The PDF Page Classification algorithm is designed to analyze the content of PDF pages by converting them into high-resolution images and categorizing each page into three distinct classes: *Image, **Text, or *Image+Text. The process begins by rendering each PDF page as an image using the PyMuPDF library, followed by converting the image to grayscale and applying Gaussian blur to reduce noise. Next, binary thresholding is used to identify contours, effectively isolating text and image areas within the page. The algorithm then classifies the page based on the mean contour area and the ratio of the total contour area to the page area, determining whether the page contains only images, only text, or a combination of both. This structured approach not only enhances the ability to distinguish between various content types within PDF documents but also facilitates improved content management and extraction for further processing or analysis.
forked from Ro-shni/PDF-IMAGE-IDENTIFIER
-
Notifications
You must be signed in to change notification settings - Fork 0
amutha2002/PDF-IMAGE-IDENTIFIER
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
About
No description, website, or topics provided.
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published
Languages
- Jupyter Notebook 100.0%