PDF-IMAGE-IDENTIFIER

The PDF Page Classification algorithm is designed to analyze the content of PDF pages by converting them into high-resolution images and categorizing each page into three distinct classes: *Image, **Text, or *Image+Text. The process begins by rendering each PDF page as an image using the PyMuPDF library, followed by converting the image to grayscale and applying Gaussian blur to reduce noise. Next, binary thresholding is used to identify contours, effectively isolating text and image areas within the page. The algorithm then classifies the page based on the mean contour area and the ratio of the total contour area to the page area, determining whether the page contains only images, only text, or a combination of both. This structured approach not only enhances the ability to distinguish between various content types within PDF documents but also facilitates improved content management and extraction for further processing or analysis.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
README.md		README.md
doodle_classifier_1.ipynb		doodle_classifier_1.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PDF-IMAGE-IDENTIFIER

About

Releases

Packages

Languages

amutha2002/PDF-IMAGE-IDENTIFIER

Folders and files

Latest commit

History

Repository files navigation

PDF-IMAGE-IDENTIFIER

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages