Check out the task list to see what new features are in the works!
- Add a new function,
get_files
, to quickly generate all the files in a folder and keep the file structure consistent before and after processing.
- Doc2X API does not return an obvious error when uploading files over 100MB (API limit).
Please see releases
Easier to deal with PDF, extract readable text and OCR to recognise image text and clean the format. Make it more suitable for knowledge base construction.
Its going to use easyocr or Doc2x to recognise the image and add it to the original text. If the output format uses pdf format, this ensures that the text is on the same number of pages in the new PDF as the original. You can use knowledge base applications (such as Dify,FastGPT) after the PDF processing, so that theoretically can reach a better recognition rate.
Added support for Doc2x, which currently has a daily 500-page free usage quota, and its recognition of tables/formulas is excellent.
You can also use Doc2x support module alone to convert pdf to markdown/latex/docx directly like below. See Doc2x Support for more.
from pdfdeal.doc2x import Doc2X
Client = Doc2X()
filelist = gen_folder_list("./test","pdf")
# This is a built-in function for generating the folder under the path of all the pdf, you can give any list of the form of the path of the pdf
Client.pdfdeal(filelist)
See the example codes.
Install from PyPI:
pip install 'pdfdeal[easyocr]'
Using pytesseract
, make sure you have install tesseract first:
pip install 'pdfdeal[pytesseract]'
Using own custom OCR function or Doc2x or skip OCR:
pip install pdfdeal
Install from source:
pip install 'pdfdeal[all] @ git+https://github.com/Menghuan1918/pdfdeal.git'
Import the function byfrom pdfdeal import deal_pdf
. Explanation of the parameters accepted by the function:
-
input:
str
orlist
- Description: The local path to the PDF file that you want to process.
- Example:
["1.pdf","2.pdf"]
-
output:
str
, optional, default:"texts"
- Description: Specifies the type of output you want. The options are:
"texts"
: Extracted text from the PDF as a list of strings, one per page."md"
: Markdown formatted text."pdf"
: A new PDF file with the extracted text.
- Example:
"md"
- Description: Specifies the type of output you want. The options are:
-
ocr:
function
, optional, default:None
- Description: A custom OCR (Optical Character Recognition) function. If not provided, the default OCR function will be used. Use string "pytesseract" to use pytesseract, string "pass" to skip OCR
- Example custom OCR function:
custom_ocr_function
, input is :(path, language=["ch_sim", "en"], GPU=False)
, return astring
,bool
-
language:
list
, optional, default:["ch_sim", "en"]
- Description: A list of languages to be used in OCR. The default languages are Simplified Chinese (
"ch_sim"
) and English ("en"
). ["eng"] for pytesseract. - Example:
["en", "fr"]
- Description: A list of languages to be used in OCR. The default languages are Simplified Chinese (
-
GPU:
bool
, optional, default:False
- Description: A boolean flag indicating whether to use GPU for OCR processing. If set to
True
, GPU will be used. - Example:
True
- Description: A boolean flag indicating whether to use GPU for OCR processing. If set to
-
path:
str
, optional, default:None
- Description: The directory path where the output file will be saved. This parameter is only used when the
output
type is"md"
or"pdf"
. - Example:
"/path/to/save/output"
- Description: The directory path where the output file will be saved. This parameter is only used when the
Args:
mdfile
:str
, the markdown file path.replace
:str
, only "local" accepted now, will add "R2", "S3", "OSS" in the future.outputpath
:str
, the output path to save the images.relative
:bool
, whether to save the images with relative path. Default isFalse
.
from pdfdeal import md_replace_imgs
md_replace_imgs(
mdfile="Output/sample.md",
replace="local",
outputpath="./Output/test/md_replace_imgs",
)
import os
from pdfdeal import deal_pdf
for root, dirs, files in os.walk("./PPT"):
for file in files:
file_path = os.path.join(root, file)
deal_pdf(
input=file_path, output="pdf", language=["en"], path="./Output", GPU=True
)
print(f"Deal with {file_path} successfully!")
from pdfdeal import deal_pdf
Text = deal_pdf(input="test.pdf", output="texts", language=["en"], GPU=True)
for text in Text:
print(text)
from pdfdeal import deal_pdf, gen_folder_list
files = gen_folder_list("tests/pdf", "pdf")
output_path = deal_pdf(
input=files,
output="md",
ocr="pytesseract",
language=["eng"],
path="Output",
)
for f in output_path:
print(f"Save processed file to {f}")
print(deal_pdf(input="test.pdf",ocr="pass"))
from pdfdeal.doc2x import Doc2X
Client = Doc2X()
filelist = gen_folder_list("./test","pdf")
# This is a built-in function for generating the folder under the path of all the pdf, you can give any list of the form of the path of the pdf
Client.pdfdeal(filelist)
See Doc2x Support.