Project: OCR (Optical Character Recognition)

Full Project Description

Term: Fall 2018

Team #5
Team members
- Chen, Jannie mc4398
- Chen, Sizhu sc4248
- Li, Yunfan yl3838
- Xu, Zhengyang zx2229
- Yu, Chenghao cy2475
Project summary: In this project, we created an OCR post-processing procedure to enhance Tesseract OCR output. We understood and discussed the assigned paper D2 and C2, for the detection algorithm and correction algorithm separately. For D2, the detection part, 2-gram was adopted to detect the word errors and we figured out the former 4 words and the latter 4 words of the detected word. For C2, at first found correction candidates by Damerau Levenshtein distance ascendingly. Then six functions were defined to calculate the feature scores for each candidate correction words. Finally, an AdaBoost model was applied to regress the labels on the six feature scores. The candidate correction word with highest probability will be chosen to replace the wrong word. The evaluation part contains the word-wise evaluation and character-wise evaluation. After post-processing, the recall and precision increases a lot especially in word level.

Contribution statement: (default)

Chen, Jannie: Understood and discussed paper D2. Wrote the ground truth dictionaries for each group. Responsible for the evaluation part. Discussed and wrote the word-wise evaluation functions. Applied the AdaBoost model in the correction part. Organized the Github and wrote the readme file.
Chen, Sizhu: Understood and discussed paper D2. Cleaned the ground truth texts and tesseract texts, and filtered the useless txt pairs and the lines in different length. For evaluation part, established the character-wise evaluation methods and coded the entire part. Organized the Github folders and made the summary.
Li, Yunfan: Understood and discussed paper C2. Create 3-gram and 5-gram candidate sets. Create relaxed-context candidate sets based on ground truth text. Helped to debug the AdaBoost regressions in correction algorithm. Applied the AdaBoost model in the correction part.
Xu, Zhengyang: Understood and discussed paper D2. Understood and reproduced the detection algorithm, including error detections using 2-gram algorithm and feature extraction for error correction part. Discussed and corrected the regression part of correction algorithm.
Yu, Chenghao: Understood and discussed paper C2. Understood and reproduced the correction algorithm, including feature candidates search and six feature scores establishment using Python. Applied the AdaBoost model in the correction part using Python. Prepared the presentation and drew the slides.

All team members contributed equally in all stages of this project. All team members approve our work presented in this GitHub repository including this contributions statement.

Following suggestions by RICH FITZJOHN (@richfitz). This folder is orgarnized as follows.

proj/
├── lib/
├── data/
├── doc/
├── figs/
└── output/

Please see each subfolder for a README file.

Name		Name	Last commit message	Last commit date
Latest commit History 92 Commits
data		data
doc		doc
figs		figs
lib		lib
output		output
.DS_Store		.DS_Store
.Rhistory		.Rhistory
README.md		README.md
__init__.py		__init__.py
proj4_slides_updated.pptx		proj4_slides_updated.pptx

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Project: OCR (Optical Character Recognition)

Full Project Description

About

Releases

Packages

Contributors 6

Languages

TZstatsADS/Fall2018-Project4-sec1-grp5

Folders and files

Latest commit

History

Repository files navigation

Project: OCR (Optical Character Recognition)

Full Project Description

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 6

Languages

Packages