Name		Name	Last commit message	Last commit date
Latest commit History 1,703 Commits
.github/workflows		.github/workflows
.vscode		.vscode
apps/web		apps/web
chunkmydocs		chunkmydocs
docker		docker
kube		kube
packages		packages
pyscripts		pyscripts
services		services
terraform		terraform
.dockerignore		.dockerignore
.gitignore		.gitignore
.npmrc		.npmrc
COMMERCIAL_LICENSE.md		COMMERCIAL_LICENSE.md
LICENSE		LICENSE
README.md		README.md
THIRD-PARTY-NOTICES.md		THIRD-PARTY-NOTICES.md
git.sh		git.sh
meta.json		meta.json
package.json		package.json
pnpm-lock.yaml		pnpm-lock.yaml
pnpm-workspace.yaml		pnpm-workspace.yaml
postcss.config.cjs		postcss.config.cjs
pr-branch.sh		pr-branch.sh
self-deployment.md		self-deployment.md
tailwind.config.cjs		tailwind.config.cjs
turbo.json		turbo.json

Repository files navigation

Status Updates (Only for hosted API on www.chunkr.ai)

We have temporarily switched to Textract for OCR from PaddleOCR. Textract is provided for free until we resolve PaddleOCR issues. Textract occasionally misses tables that PaddleOCR wouldn't. For self-deploys, you can still set PaddleOCR as your OCR strategy in the task service .env variables.
We are still experiencing extremely high loads, which have affected throughputs. We're working hard to get ingestion speeds back to our standard.

Chunkr

We're Lumina. We've built a search engine that's five times more relevant than Google Scholar. You can check us out at lumina.sh. We achieved this by bringing state-of-the-art search technology (the best in dense and sparse vector embeddings) to academic research.

While search is one problem, sourcing high-quality data is another. We needed to process millions of PDFs in-house to build Lumina, and we found that existing solutions to extract structured information from PDFs were too slow and too expensive ($$ per page).

Chunk my docs provides a self-hostable solution that leverages state-of-the-art (SOTA) vision models for segment extraction and OCR, unifying the output through a Rust Actix server. This setup allows you to process PDFs and extract segments at an impressive speed of approximately 5 pages per second on a single NVIDIA L4 instance, offering a cost-effective and scalable solution for high-accuracy bounding box segment extraction and OCR. This solution has models that accommodate both GPU and CPU environments. Try the UI on chunkr.ai!

Docs

https://docs.chunkr.ai/introduction

(Super) Quick Start

Go to chunkr.ai
Make an account and copy your API key

Create a task:

curl -X POST https://api.chunkr.ai/api/v1/task \
   -H "Content-Type: multipart/form-data" \
   -H "Authorization: ${YOUR_API_KEY}" \
   -F "file=@/path/to/your/file" \
   -F "model=HighQuality" \
   -F "target_chunk_length=512" \
   -F "ocr_strategy=Auto"

Poll your created task:

curl -X GET https://api.chunkr.ai/api/v1/task/${TASK_ID} \
  -H "Authorization: ${YOUR_API_KEY}"

Self Deployments

You'll need K8s and docker.
Follow the steps in self-deployment.md

Licensing

This project is dual-licensed:

GNU Affero General Public License v3.0 (AGPL-3.0)
Commercial License

To use Chunkr without complying with the AGPL-3.0 license terms you can contact us or visit our website.

Want to talk to a founder?

https://cal.com/mehulc/30min

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Status Updates (Only for hosted API on www.chunkr.ai)

Chunkr

Docs

(Super) Quick Start

Self Deployments

Licensing

Want to talk to a founder?

About

Releases

Packages

Languages

License

UKOMAL/chunkr

Folders and files

Latest commit

History

Repository files navigation

Status Updates (Only for hosted API on www.chunkr.ai)

Chunkr

Docs

(Super) Quick Start

Self Deployments

Licensing

Want to talk to a founder?

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages