Status Updates (Only for hosted API on www.chunkr.ai)
- We have temporarily switched to Textract for OCR from PaddleOCR. Textract is provided for free until we resolve PaddleOCR issues. Textract occasionally misses tables that PaddleOCR wouldn't. For self-deploys, you can still set PaddleOCR as your OCR strategy in the task service .env variables.
- We are still experiencing extremely high loads, which have affected throughputs. We're working hard to get ingestion speeds back to our standard.
We're Lumina. We've built a search engine that's five times more relevant than Google Scholar. You can check us out at lumina.sh. We achieved this by bringing state-of-the-art search technology (the best in dense and sparse vector embeddings) to academic research.
While search is one problem, sourcing high-quality data is another. We needed to process millions of PDFs in-house to build Lumina, and we found that existing solutions to extract structured information from PDFs were too slow and too expensive ($$ per page).
Chunk my docs provides a self-hostable solution that leverages state-of-the-art (SOTA) vision models for segment extraction and OCR, unifying the output through a Rust Actix server. This setup allows you to process PDFs and extract segments at an impressive speed of approximately 5 pages per second on a single NVIDIA L4 instance, offering a cost-effective and scalable solution for high-accuracy bounding box segment extraction and OCR. This solution has models that accommodate both GPU and CPU environments. Try the UI on chunkr.ai!
https://docs.chunkr.ai/introduction
- Go to chunkr.ai
- Make an account and copy your API key
- Create a task:
curl -X POST https://api.chunkr.ai/api/v1/task \ -H "Content-Type: multipart/form-data" \ -H "Authorization: ${YOUR_API_KEY}" \ -F "file=@/path/to/your/file" \ -F "model=HighQuality" \ -F "target_chunk_length=512" \ -F "ocr_strategy=Auto"
- Poll your created task:
curl -X GET https://api.chunkr.ai/api/v1/task/${TASK_ID} \ -H "Authorization: ${YOUR_API_KEY}"
- You'll need K8s and docker.
- Follow the steps in
self-deployment.md
This project is dual-licensed:
- GNU Affero General Public License v3.0 (AGPL-3.0)
- Commercial License
To use Chunkr without complying with the AGPL-3.0 license terms you can contact us or visit our website.