- pdf2image
- PDF to Image converter
- Need to install poppler (check README of repo above)
- pytesseract
- Tesseract Python Wrapper
- Convert PDF file to Image using PDF2Image
- Adjust images for improving OCR result
- Run Tesseract for OCR images
- Tokenize OCR result
- Export csv file
> python main.py [-o output_file_path] pdf_file_path