Skip to content

A program that splits and groups pages of a PDF via the data that is extracted using OCR, over a specified area.

License

Notifications You must be signed in to change notification settings

StellaTec/scanned_pdf_sorter

Repository files navigation

scanned_pdf_sorter

Introduction

The purpose of this project is to sort and group pages of a scanned pdf file by the selected information that was extracted from each page.

Project Links

Prerequisites

Programs / Libraries

  • Python is required if the project is being built or if being ran without compiling

  • ODBC Diver 17 for connection and querying an SQL Database

  • Poppler is required for the reading and writing of pdf files.

  • Tesseract is required for OCR functionality (Optical Character Recognition)

    • Install Tesseract via your package manager
    • Tesseract building instructions for Windows are located in the Notes section
  • Poppler and Tesseract are included in wintools.zip

Pip Packages

  • All necessary pip packages are listed in setup.py

  • In general, here are the packages that are needed for this program to run

    • matplotlib
    • Pillow
    • pdf2image
    • pytesseract
    • pyodbc

Installation & Running

Linux

  • Create and Navigate to the directory for this program

  • Create and activate a virtual environment:

    • python -m venv venv
    • source ./venv/bin/activate
  • pip install via the git repo, replace BRANCH with the desired git repo branch that you want to install

    pip install git+https://www.gitlab.com/cblacktech/scanned_pdf_sorter@BRANCH
    
  • To run the program, run this entry_point command in the terminal

    pdf_sorter_app_run
    

Windows

  • Create and navigate into a directory for this program

  • Create and activate a virtual environment:

    • python -m venv venv
    • .\env\Scripts\activate
  • pip install via the git repo, replace BRANCH with the desired git repo branch that you want to install (omit @BRANCH for default branch)

    pip install git+https://www.gitlab.com/cblacktech/scanned_pdf_sorter@BRANCH
    
  • Download wintools.zip from repo and move the zip file to the program directory (Do one of the following):

    • curl -LJO https://gitlab.com/cblacktech/scanned_pdf_sorter/-/raw/BRANCH/wintools.zip
      
    • Download the zip file from this the repo. By clicking the wintools.zip file in the repo, then clicking the download button.

  • To run the program, run this entry_point command in the terminal

    pdf_sorter_app_run
    

Building

  • Follow the Installation & Running section until the pip installation part

  • Pip install the packages needed for development / building purposes

    pip install -e .[dev]
    

    (If project is git cloned, or downloaded)

    or

    pip install git+https://www.gitlab.com/cblacktech/scanned_pdf_sorter@BRANCH[dev]
    

    (where you replace BRANCH with the desired git repo branch that you want to install, omit @BRANCH for default branch)

  • Run pyinstaller using a spec file, for example

    pyinstaller pdf_sorter_app.spec
    
  • The finished program will be located inside the dist folder

    • Run the executable to start the program after you navigate into the program directory inside of dist
      • Linux: ./pdf_sorter_app
      • Windows: pdf_sorter_app.exe

Config.ini

  • image_type determines the image file type that is used (currently supports the values png and jpeg).

  • file_initial_search_dir determines where the pdf file selector will first open upo at.

  • The CROP_BOX stores the top-left coordinates and the bottom-right coordinates what the images will be cropped to.

  • Currently all of the other options are for testing and development purposes (it is not recommended for any of these values to be changed at this time)

Notes

  • If you want a custom window icon, have a .png file in the same directory that you are launching your application from.

  • To reinstall poppler download it from here and extract it into a folder called wintools.

  • tesseract.exe was build from source using the build instructions in the tesseract docs, and then was moved inside the wintools folder.

  • The tessdata folder contains the data for recognizing english characters and numbers, eng.traindata. The traindata was downloaded from their git repo.

  • Due to not having access to Mac hardware for testing purposes therefore, Mac OS is not supported.

Authors

About

A program that splits and groups pages of a PDF via the data that is extracted using OCR, over a specified area.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages