Skip to content

OpenThaiGPT/PDF2ThaiText

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

29 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PDF-To-ThaiText

PDF-Reader convert your PDF files to html and raw txt, And contain Function for correcting your raw txt.

How to use

  • Convert PDF to HTML and RAW text : create 2 dirs 1.html_output 2.raw_txt_output
python scripts/convert_pdf_to_raw.py /path/to/pdf
  • Correct Text in RAW text : create 1 dir "corrected_txt_output"
python scripts/corrected_text.py
  • Convert PDF and Correct Text : create 3 dirs 1.html_output 2.raw_txt_output 3.corrected_txt_output"
python scripts/convert_pdf_to_correct.py /path/to/pdf

Setup

  1. Clone the repository
git clone https://github.com/OpenThaiGPT/pdf-reader.git
cd pdf-reader
  1. Create Virtual Environment
conda create --name pdf python=3.9 -y
conda activate pdf
  1. Install Dependencies
pip install -e .
  1. Install Java in Conda
conda install conda-forge::openjdk
  1. Install PDFBox
  • on Linux
wget https://dlcdn.apache.org/pdfbox/3.0.3/pdfbox-app-3.0.3.jar
  • on Windows
wget https://dlcdn.apache.org/pdfbox/3.0.3/pdfbox-app-3.0.3.jar -o pdfbox-app-3.0.3.jar

or

curl https://dlcdn.apache.org/pdfbox/3.0.3/pdfbox-app-3.0.3.jar -o pdfbox-app-3.0.3.jar

Reference

แปลงรัฐธรรมนูญ (ร่างต้นปี 2559) จาก PDF เป็น HTML - https://github.com/bact/constitution โดย อาทิตย์ สุริยะวงศ์กุล Apache PDFBox® - A Java PDF Library - https://pdfbox.apache.org/

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages