Skip to content

amutha2002/PDF-IMAGE-IDENTIFIER

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 

Repository files navigation

PDF-IMAGE-IDENTIFIER

The PDF Page Classification algorithm is designed to analyze the content of PDF pages by converting them into high-resolution images and categorizing each page into three distinct classes: *Image, **Text, or *Image+Text. The process begins by rendering each PDF page as an image using the PyMuPDF library, followed by converting the image to grayscale and applying Gaussian blur to reduce noise. Next, binary thresholding is used to identify contours, effectively isolating text and image areas within the page. The algorithm then classifies the page based on the mean contour area and the ratio of the total contour area to the page area, determining whether the page contains only images, only text, or a combination of both. This structured approach not only enhances the ability to distinguish between various content types within PDF documents but also facilitates improved content management and extraction for further processing or analysis.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 100.0%