-
Notifications
You must be signed in to change notification settings - Fork 1
/
basic_ocr.py
210 lines (120 loc) · 4.87 KB
/
basic_ocr.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
# coding: utf-8
# # Basic optical character using Tesseract with Python
#
#
# How to do a basic OCR for a PDF with incomplete or erroneous text information? For this a PDF of an Article is converted to an image and the tesseract package is used for the OCR.
#
# Tesseract is a popular free OCR-software https://sourceforge.net/projects/tesseract-ocr/. With a python wrapper it can be used to do OCR in python.
# Installation of additional python packages (assuming you have the basic anaconda packages installed):
#
# #### basic OCR: install tesseract and opencv
# conda install opencv
#
# conda install tesseract -c conda-forge
#
# pip install pytesseract
#
# #### convert pdf to img
# pip install pdf2image
#
# #### read PDF
# conda install tika
# ## Example PDF file
#
# As an example we use an article by A. Einstein which can be downloaded here: https://zenodo.org/record/1601163#.XsUtA7tR2EI as a PDF file. The file contains the scanned image of the article and text information. However, the text information contains errors. We will see if we can do better using the picture only and doing OCR with Tesseract.
# ## Read badly scanned PDF.
#
# #### First let's have a look at the PDF example
# In[1]:
from pdf2image import convert_from_path
# In[2]:
# read PDF as image
pages = convert_from_path('article.pdf', 500)
# In[3]:
# show first page of pdf as image
pages[0]
# ## For Comparison: read PDF using TIKA
#
# This part is only to illustrate why we would be interested in optical character recognition for this example. This part can be skipped if you just want to know how to do the OCR. It requires the additional module tika for reading PDFs
# In[4]:
# import module for PDF reading
from tika import parser
# In[5]:
parsedPDF = parser.from_file('article.pdf')
# In[6]:
pdftext = parsedPDF['content']
# Now we look at the result below.
# In[7]:
print(pdftext)
# As you can see there are lots of mistakes!!! Please note that this is not an error in the TIKA-module. TIKA just reads whatever is stored as text in the PDF and does not do any OCR. For good PDFs it works perfectly
# # Here the OCR part starts:
#
# First import all required packages
# In[8]:
# general tools
import os
import numpy as np
# In[9]:
# tools for OCR
import cv2
import pytesseract
from PIL import Image
from pytesseract import image_to_string
# ## Recognize characters from stored image
#
# For this second part you need the PDF as image in JPG format. This can be done in various ways, e.g. using GIMP:https://www.gimp.org (Image handling and format conversion in python will be explained in a separate part of the toolbox.)
#
# In this first example we have stored the first page of the pdf as "Einstein1916_01.jpg"
# In[10]:
# define image name
img_name = "Einstein1916_01.jpg"
# define path where image is stored
src_path = './'
# In[11]:
# read image from file
img = cv2.imread(os.path.join(src_path,img_name))
# In[12]:
# now we use tesseract for character recognition
extxt = pytesseract.image_to_string(img)
# In[13]:
# Print result
print(extxt)
# This is slightly better, but there are still a lot of mistakes. Can we do better?
# ### Select language for character recognition
#
# One reason for the bad quality is the fact that language-specific characters (German) are not recognized correctly. We can change this by loading the correct language data
# In[14]:
# tesseract character recognition with german language data
extxt = pytesseract.image_to_string(img, lang='deu')
# In[15]:
# Print result
print(extxt)
# Much better! This was the main Mistake! There are still some small errors, but the quality is good enough for text mining or stylometry.
# ## OCR for entire PDF + additional improvements
#
# Now we read all the pages (in this case just two) from the PDF. The OCR is slightly improved by adding some image conversion steps (this is not so important here, but may be useful in case of bad image quality). In the end we assemble the text from the PDF into one string, which could then be used e.g. to search for specific expressions.
# In[16]:
# first we read all the pdf pages (see above)
pages = convert_from_path('article.pdf', 500)
# In[17]:
# now we do image conversion and OCR for all pages
textpages = []
for n,page in enumerate(pages):
open_cv_image = np.array(page)
open_cv_image = open_cv_image[:, :, ::-1].copy()
# Convert to gray
img = cv2.cvtColor(open_cv_image, cv2.COLOR_BGR2GRAY)
# Apply dilation and erosion to remove some noise
kernel = np.ones((1, 1), np.uint8)
img = cv2.dilate(img, kernel, iterations=1)
img = cv2.erode(img, kernel, iterations=1)
# OCR, again using the german language set
extxt = pytesseract.image_to_string(img,lang='deu')
# Store Text
textpages.append(extxt)
# In[18]:
# Assemble document text
text = textpages[0] + textpages[1]
# In[19]:
# Print document text
print(text)