GitHub - ScrPzz/IG_ocr: Ocr engine leveraging CRAFT + PaddleOCR

Instagram posts OCR processor

This tool has been created to isolate text from background in images and read it. Being the tool part of a set of Instagram analysis tools*, the images that will be used on this presentation will be taken from instagram, but the tool could be easily be readapted for other sources. Two basic blocks:

CRAFT (https://github.com/clovaai/CRAFT-pytorch) model (actually the repo is a fork of it) used to find the text bounding boxes.
Boxes processing and MeanShift clustering to optimize the isolation of the text area from the background. In this image:

red boxes are produced by CRAFT, blue ones by the clustering. 10 px tolerance added in both directions.

OCR via both Pytesseract the PaddleOCR (https://github.com/PaddlePaddle/PaddleOCR). Keeping both at this benchmark stage, most probably will end up using Paddle because it is just so good.
[TODO] Background classification: solid color, images are present, etc. Then: if images in background:
- [TODO] CLIP model to understand the content of the background and compare it with the text. Trying to extract informatios on how various keyword/image pairs affect the user base reaction distribution.

How to use:

Open the boxes.ipynb notebook, update the _IMAGE variable with the path of the image you are trying to process and run the full notebook.

Examples:

See the "Examples" folder.

*along with:

an Instagram Scraper: https://github.com/ScrPzz/IG_scraper
a tool that extract information about the user base gender distribution and sentiment [TODO: past IG_nlp link here]

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
.ipynb_checkpoints		.ipynb_checkpoints
basenet		basenet
examples		examples
test_images		test_images
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
boxes.ipynb		boxes.ipynb
craft.py		craft.py
craft_utils.py		craft_utils.py
file_utils.py		file_utils.py
imgproc.py		imgproc.py
refinenet.py		refinenet.py
requirements.txt		requirements.txt
test.py		test.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Instagram posts OCR processor

How to use:

Examples:

About

Releases

Packages

Languages

License

ScrPzz/IG_ocr

Folders and files

Latest commit

History

Repository files navigation

Instagram posts OCR processor

How to use:

Examples:

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages