GitHub - aaronplasek/OCR_PDFs: a python3/jupyter script using ocrmypdf & tesseract to batch process all PDFs in a directory and all its subdirectories

description

This jupyter notebook script does the following:

preprocesses PDFs for OCR (i.e., deskew, auto-rotate, de-background, clean using ocrmypdf and unpaper),
OCRs the PDFs (using ocrmypdf/tesseract 4.1),
outputs the following files for every PDF in the directory except PDFs with extension .processed.pdf.
- OCRed PDF/A with processing applyed (filename.processed.pdf)
- text file with the pre-existing OCR of the original PDF if it exists (filename.original.txt)
- the tesseract-generated OCR in a text file (filename.ocr.txt) for every PDF in the directory.
The script leaves the original PDFs unchanged.

Paste OCR_PDFs.ipynb script in same directory with PDFs to be OCRed.
Run script by typing jupyter notebook OCR_PDFs.ipynb in terminal. (You will need jupyter, ocrmypdf, and tesseract installed.)
Run all code blocks in notebook.

Script will process all PDFs in the same directory, but will not recursively process PDFs in subfolders.
Script uses shell commands (via the jupyter "!" command) alongside python 3 so this code will not work outside of a jupyter notebook.
Script ignores PDFs named [filename].processed.pdf. If keep_processed_PDFs = False, it will overwrite all PDFs ending in processed.pdf.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
.gitignore		.gitignore
LICENSE		LICENSE
OCR_PDF - V2.1 - TEST.ipynb		OCR_PDF - V2.1 - TEST.ipynb
OCR_PDFs.ipynb		OCR_PDFs.ipynb
README.md		README.md