The scripts in this directory make it possible to test Accuracy of Tesseract for different languages.
make and install the tools in /usr/local/bin.
git clone https://github.com/Shreeshrii/ocr-evaluation-tools.git
cd ~/ocr-evaluation-tools
sudo make install
Use binaries from the tesseract/src/api and tesseract/src/training directory.
Download images and corresponding ground truth text for the language to be tested.
Each testset should have only one kind of images (eg. tif, png, jpg etc).
The ground truth text files should have the same base filename with txt extension.
As needed, modify the filenames and create the pages
file for each testset.
Instructions for testing Fraktur and Sanskrit languages are given below as an example.
bash frk_setup.sh
bash frk_test.sh
bash deva_setup.sh
bash deva_test.sh
If you just want to remove all lines which have 100% recognition, you can add a 'awk' command like this:
ocrevalutf8 wordacc ground.txt ocr.txt | awk '$3 != 100 {print $0}' results.txt
or if you've already got a results file you want to change, you can do this:
awk '$3 != 100 {print $0}' results.txt newresults.txt
If you only want the last sections where things are broken down by word, you can add a sed commend, like this:
ocrevalutf8 wordacc ground.txt ocr.txt | sed '/^ Count Missed %Right