PDF-Reader convert your PDF files to html and raw txt, And contain Function for correcting your raw txt.
- Convert PDF to HTML and RAW text : create 2 dirs 1.html_output 2.raw_txt_output
python scripts/convert_pdf_to_raw.py /path/to/pdf
- Correct Text in RAW text : create 1 dir "corrected_txt_output"
python scripts/corrected_text.py
- Convert PDF and Correct Text : create 3 dirs 1.html_output 2.raw_txt_output 3.corrected_txt_output"
python scripts/convert_pdf_to_correct.py /path/to/pdf
- Clone the repository
git clone https://github.com/OpenThaiGPT/pdf-reader.git
cd pdf-reader
- Create Virtual Environment
conda create --name pdf python=3.9 -y
conda activate pdf
- Install Dependencies
pip install -e .
- Install Java in Conda
conda install conda-forge::openjdk
- Install PDFBox
- on Linux
wget https://dlcdn.apache.org/pdfbox/3.0.3/pdfbox-app-3.0.3.jar
- on Windows
wget https://dlcdn.apache.org/pdfbox/3.0.3/pdfbox-app-3.0.3.jar -o pdfbox-app-3.0.3.jar
or
curl https://dlcdn.apache.org/pdfbox/3.0.3/pdfbox-app-3.0.3.jar -o pdfbox-app-3.0.3.jar
แปลงรัฐธรรมนูญ (ร่างต้นปี 2559) จาก PDF เป็น HTML - https://github.com/bact/constitution โดย อาทิตย์ สุริยะวงศ์กุล Apache PDFBox® - A Java PDF Library - https://pdfbox.apache.org/