A simple api using fastapi for extracting the text content of pdf using pdfminer.
Different pdf parsers were tried like pypdf2, pdfminer.. but pdfminer gave better results. For added ocr support first install tesseract and ghost script as these are required dependencies for the code to work.
Try out and compare the output of pdfminer and tika through API endpoints. Access the results through API response or app/results directory.
Note: if tesseract is installed in some other location than default, then change the location accordingly in pdfapi.py file.
git clone https://github.com/soham-1/fastapi_pdfextractor.git
pip install -r requirements.txt
cd app
uvicorn pdfapi:app --host 0.0.0.0 --port 8000 --reload
docker-compose up -d --build
docker-compose stop fast_api
docker-compose up -d
This api has following endpoints