Docker image for unstructured+langchain PDF document loading with OCR
docker build --target ready -t unstructured_pdf .
docker run --rm -it unstructured_pdf python
from langchain_community.document_loaders import UnstructuredFileLoader
loader = UnstructuredFileLoader("/home/appuser/.dockerinit/test.pdf")
docs = loader.load()
- https://unstructured-io.github.io/unstructured/
- https://python.langchain.com/docs/integrations/providers/unstructured
- Add OCR
- More robust layout support
- Table support
- Markdown conversion with LLM support
- Reduce image size