Recognition of English Handwriting and Typed From Images Using Tesseract On Android Platform
Recognition of English Handwriting and Typed From Images Using Tesseract On Android Platform
net/publication/342561816
CITATION READS
1 2,507
4 authors:
All content following this page was uploaded by Atanu Das on 30 June 2020.
Abstract
With the advent of digitalization in most spheres of human pursuit,
conversion of digitized text from images has gained considerable momentum over
the years, despite the fact that the concept of image character recognition
essentially dates back to the period before the invention of the computer. This
paper is an endeavor to put forth the experimental workflow of recognizing text
from image using Google’s open source Optical Character Recognition (OCR)
Engine Tesseract. Here, Tesseract has been trained in a manner so as to
recognize handwritten and typed texts in English script and produce outputs with
various levels of observed accuracy. The method is supported by an added
characteristic of externally induced image quality augmentation prior to text
extraction. The paper is predominantly aimed at building a resourceful Android
application that would enable the user to digitize text from images even on the
small screen. The research analysis asserts a precision level up to 93% in case
of handwritten text and 98% for typed characters, which is an attempt towards
advancement over existing methods. Apart from image to text conversion, it also
includes the text to speech translation feature, which renders its significance
among the visually impaired mass.
1. Introduction
Imitation of human functions with the aid of machines has been a united vision of
mankind. Researchers, from time to time, have taken resort to artificial intelligence and
machine learning to develop complex software and applications to substitute human labor
with machine operation. Over the years machine reading has evolved from dream to
realism through advancement of sophisticated and vigorous Optical character recognition
(OCR) systems [1]. OCR involves alpha numeric and character pattern recognition and
their subsequent mechanical or electronic conversion into editable and searchable data.
The most primitive use of OCR dates to 1870 with the invention of retina scanner by
Carey [2-3]. Extensive development and use of OCR technologies began in the 1960’s
and 1970’s, with the formation of simplified fonts known as OCR-7B that were easier to
convert to digitally readable text and are still in vogue as the font imprinted on credit
cards & debit cards [4]. Gradually the OCR technology was universally commercialized
in postal services to greatly speed up mail sorting. The most used commercial OCRs of
this period include IBM 1418 designed to identify IBM font 407, OCR-A, OCR-B [5-7].
By the year 2000, OCR had succeeded in decrypting even inferior quality printed and
handwritten texts to certain extent. OCR technology was now used to develop CAPTCHA
programs to thwart bots and spammers [8-9].
As one can observe, over the decades, OCR has grown more accurate and sophisticated
owing to progression in related technology areas, for instance Artificial Intelligence,
Machine Learning, and Computer Vision [10-13]. Earlier the works were executed on the
platforms of Neural Networking, Image Histograms, Clustering, Block Segmentation, to
name a few. Nowadays OCR software makes use of techniques as feature detection,
pattern recognition, and text mining for the purpose of transmuting manuscripts faster and
more accurately than ever before [14-18].
The application of OCR for extracting text from images in various languages, apart
from English, has considerably subsided with time as numerous works have been
accomplished in this area that has resulted in development of considerable quality and
quantity of software.
The development of software’s paved the way for a system that would port the concept
of OCR in Android mobile applications making it easier for the mass to avail of its
appliance and its implementation in day to day work. This paper aims primarily in
building such an application so as to enable making optimum use of OCR through mobile
applications taking into consideration the widespread accessibility of Smart phones in
today’s digital age. The paper draws on Tesseract, the same open source OCR engine that
Google uses to recognize languages and image translation.
Tesseract initially began as a PhD research project Hewlett-Packard Laboratories
Bristol and developed between 1985 and 1994 [19]. It was rated one of the top 3 engines
in the UNLV Accuracy Test as ‘HP Labs OCR’ in 1995. With partial modifications,
Tesseract was released as an open source by HP in 2005, which in turn was re-released as
open source community by Google in 2006 [20]. The most recent (LSTM based) stable
version 4.1.1, which released on December 26, 2019 can be available on GitHub.
Tesseract possesses unicode (UTF-8) support, and can recognize more than 100
languages. It can generate outputs in various formats as invisible-text-only PDF, plain
text, OCR (HTML), TSV.
The paper attempts to take a step ahead of the existing feature of image to text
extraction by training the Tesseract OCR engine to analyze and decrypt digital
handwritings besides typed characters. The focus is primarily on the precise conversion of
English handwritten images into texts and followed by the generated output’s vocal
translation for the assistance of the visually impaired. The text conversion methodology
integrates image enhancement through realignment, rescaling, and resizing of and noise
removal from the given image by incorporation of advanced algorithms. Moreover,
training of Tesseract with a self-adapted dataset for handwritten characters plays a
significant role in definitive output. The paper is an attempt at achieving noteworthy
development in generating output by embracing these additional features within its ambit.
Each chronological step in course of the Android application development proposed in
this paper is discussed at length in the forthcoming sections. The manner, in which
Tesseract has been trained, specifically to meet the requisite of this paper that involves
detection of English typed and handwritten texts, has been elaborated in Section 2.1. The
comprehensive methodology for gaining text and speech output from image input has
been demonstrated through flowcharts and algorithms with illustration in Section 2.2.
Sections 3 and 4 focus on the analysis of the overall paper weighing the possibilities of
superior output and concludes by drawing on its limitations and future scope.
Since the box file is generated taking into consideration the nature of English
typographic characters, it, to a major extent, fails in suitably generating box information
for English handwritten character, thereby making editing inevitable. Editing of the box
file calls for editors like HTML, Notepad++, Gedit that recognises UTF-8 code. Here in
this paper, Gedit is used for editing purposes. In instances as such, where a single
character of the training image is split into two lines within the box file, manual merging
of the bounding boxes is sought for.
Upon completion of generation of box files for corresponding training images,
Tesseract in Run in training mode with the command- $tesseract eng.exp2.tif eng.exp2
nobatchbox.train - which in turn shall produce two files namely: eng.exp2.tr containing
the features of each character of the training page, and eng.exp2.txt.
Next, for creation of a character set file, Tesseract’s unicharset file, containing
information on each symbol, is made use of. In order to generate the unicharset data file,
the program named unicharset_extractor, is used on the already generated box files using
the following command: $unicharset_extractor<box file>
Example: $unicharset_extractoreng.exp0.box
Each line of this character set file relates to one character in UTF-8 format which is
preceded by a hexadecimal numeral embodying a binary mask encoding its properties.
Once the extraction of character features of all the training pages takes place, character
shape features are clustered in an attempt to create prototypes. Here, the clustering is done
through two diverse programs namely mftraining and cntraining. The mftraining program
is run using the command: $mftraining<filename.tr> -F font_properties -U unicharset
Example: $eng.segoescriptb.exp2.tr -F font_properties -U unicharset; where, -U is
previously generated unicharset through unicharset_extractor and F is used to include the
font_properties file.
This will again output two data files: pffmtable and inttemp. pffmtable includes the
count of anticipated features of each character and inttemp comprises the shape
prototypes, which however cannot be opened. Besides, a third file called Microfeat is also
created by this program, that is not used further in this paper.
Next, to run the cntraining program, we use the following command:
$cntraining<filename.tr>
Example: $eng.segoescriptb.exp0.tr
This command will produce the normproto data file, which will subsequently perform
the character normalization training for Tesseract as shown below.
a1
significant elliptical 1
0.367188 0.403906 0.230469 0.242188
0.000400 0.000400 0.000400 0.000400
b1
significant elliptical 1
0.328125 0.326562 0.234375 0.175781
0.000400 0.000400 0.000400 0.000400
c1
significant elliptical 1
0.347656 0.386719 0.253906 0.195312
0.000400 0.000400 0.000400 0.000400
Tesseract uses as many as eight dictionary files for a language. Here three files namely
frequent_word_list, words_list, user_char are used for reference. Among the three files,
one file is a simple UTF-8 text file, and the rest files are coded as a Directed Acyclic
Word Graph (DAWG). To create DAWG files, first we need to build on a word list of
English language that is formatted as a UTF-8 text file with one word per line. To get two
UTF-8 text files, the word list is split into two sets namely frequent_word_list and
words_list.
The resultant files are converted into their corresponding DAWG file using the
command: $wordlist2dawg <dictionaryfile><dawg file>
Example: $wordlist2dawg words_list word-dawg
By this time the training procedure is just about complete. The remaining task is to
rename the files as per the desirable language code and then fuse the multiple generated
files during different steps given in the earlier sections.
At the outset, the files are renamed by prefixing each file name with a three-letter
language code (lang.). Here, for English, lang.=“eng.” has been affixed to the file names
as provided within the two-column table (Figure 2. Modification of file names).
This result in the generation of an output file eng.traineddata which has been copied to
the tess data directory (usually: /usr/local/share/tessdata). This is the concluding stage of
the training process and it is quite likely that Tesseract will now be able to identify and
distinguish any image file comprising of basic characters of English script.
2.1. Proposed Method
This paper is drawn on the Tess-Two Library version and tested on Android Version 6.0
and higher. The functional methodology is adapted by the engine for projection of visual
and audio output after it is made ready for identification of texts from scripts.
Since each Android application is permitted to run solely within its self-restricted
sandbox, for the application to get access to resources or information beyond its sandbox,
the app has to seek for an appropriate permission. An app permission declaration is hence
registered in the app manifest seeking the user’s consent during runtime.
The image is then captured with an auto focus & 2.0f frame per second shutter speed of
the camera. The captured image is saved in the phone’s internal storage in .JPG format
and the automated storage path thus created is preserved for future reference and
accessibility. There is also provision for the user to select images already saved in their
phone gallery without actually capturing it from the app camera. Workflow of the
proposed method is described in (Figure 3. Workflow).
START
N Check
Permission
Y
Error Message
Image
STOP Source Local
Camera Storage
Camera Input
Open Image
N Y N
Code No:
STOP STOP
xxx
is Generated
Check Y
Activity Request Code
N
and Validity
Check
Error Message
Prepare Test
Data
STOP
Start OCR
N Voice
Conversion
Get Text
Text Y
Output
Modulate Voice END
Speed and Pitch Output
END
These images may be resized as per the user’s convenience by dynamic cropping to
remove the irrelevant regions with a focus to retain the primary text region. It also allows
the user to adjust skewed images by rotating or straightening the captured image as per
convenience such that the baselines of the texts are placed at a parallel angle to the
horizontal plane of the image (Figure 4. Cropping).
This image, however, may be distorted owing to the intervention of various degradable
factors like thermal noise, poor lighting, and presence of dust particles, temperature shift
and other environmental aspects. This distortion in the image quality may adversely affect
the OCR Engine’s detection capability giving rise to errors. In order to overcome this
setback, the image in question is made to undergo a pre-processing stage of rescaling,
contrasting and noise removal for superior visual interpretation. This paper employs the
state-of-the-art algorithm 2 for desirable augmentation.
Algorithm 2: Pre –processing of Input Image
1: procedure: MYPROCEDURE
//Rescaling
2: input: Cropped Image
3: if (image ≥300dpi) then
4: retain image specification
5: else
6: convert image into grayscale and saved in byte array
7: setDpi(): byte wise operation done to change image dpi up to 300
8: end if
9: output: 300 or more dpi Image
//Image Enhancement
10: input: 300 or more dpi Image
11: Let xi , j is the intensity value of the input image M×N image at the position (i,j)
where (i,j) A {1,2,3,...., M } {1,2,3,...., N}
12: if ( xi , j < 160) then
13: setPixel = 0; // close to Black then convert it into black
14: else
15: setPixel = 255;
16: end if
17: output: Enhanced Image
//Noise Removal
18: input: Enhanced Image
19: a same size zero valued flag image F (M×N) is generated where f i , j is the pixel value at
location (i,j)
20: loop:
19: Consider a 5×5 matrix taking center xi , j
20: Construct sets S1, S2 taking the elements of the matrix i.e. i varies from i-2 to i+2 and
j varies from j-2 to j+2
21:
S1 = xi , j : 0 xi , j 160
S2 = xi , j : 161 xi , j 255
22: if (n(S1)> n(S2)) then
23: N1 = 0
24: else
25: N2 = 1
26: end if
27: Construct a 3×3 matrix centering the same xi , j
28: Construct a set S3, S4 taking the elements of the matrix i.e. i varies from i-1 to i+1
and j varies from j-1 to j+1
29:
S3 = xi , j : 0 xi , j 160
S4 = x i, j : 161 xi , j 255
30: if (n(S3)> n(S4)) then
31: N3 = 0
32: else
33: N4 = 1
34: end if
35: if (N1== N3 && xi , j ==0 || N1== N3 &&
xi , j ==255) then
36: f i , j =0
37: if (N2== N4 && xi , j ==255 || N2== N4 &&
xi , j ==0) then
38: f i , j =1
39: end if
40: end loop
41: loop:
42: if ( f i , j ==0) then
43: xi , j ==0
44: else
45: xi , j ==255
46: end if
47: end loop
48: output: Enhanced and Noise Free Image
It is suggested for better image quality output that the resolution of the cropped image
is at least 300 dpi. If the resolution is detected to be less than the desired, it needs to be
rescaled. A case in point is presented for reference in (Figure 5. 300dpi). Further, the
contrast factor of the image is altered to transform the grayscale image into black & white.
The parts of the image with pixel value less than 160 are converted to 0 and represented
as black. The remaining pixel values are converted to 255.
(a) (b)
Figure 5. (a) Input image less than 300dpi (b) Rescaled image with
300dpi
Lastly, the image is filtered for removal of noise that might have crept in owing to
several unlikable factors. The noise removal technique as mentioned in the algorithm 3
enables segregation of noise from actual text matter. The samples of the input image and
the output image post filtering is given in (Figure 6. Image enhancement). Tesseract
is able to distinguish characters from the rest of the content if the image is of a superior
quality which largely affects the accuracy level of the engine.
(a) (b)
Figure 6. (a) Captured image (b) Image Intensified and free of noise
(a)
(e) (d)
(b) (c)
Figure 7. (a) Input image (b) (c) Google Translate output (d) (e) Proposed
application visual and audio output
(c)
(a) (b)
Figure 8. (a) Input image (b) Google Translate output (c) Proposed
application visual and audio output
The second illustration outcome on typed characters (Figure 8. Typed characters
with image), as stated in (Table 2. Comparison of Figure 8), depicts a negligible
percentage of variation with the proposed application scoring over Google
Translate. This may be owing to the fact that the latter could not comprehend the
accurate words attributable to the meandering character of the input image.
(a)
(b) (c)
Figure 9. (a) Input image (b) Google Translate output (c) Proposed
application visual and audio output
(a)
(b) (c)
Figure 10. (a) Input image (b) Google Translate output (c) Proposed
application visual and audio output
Based on another illustration (Figure 10. Handwritten texts sample two) of
handwritten text and its consequence in (Table 4. Comparison of Figure 10), it is
established that the proposed application yields better output on handwritten
scripts as contrasted to Google Translate. The differential percentage, even in this
case, is above 30% with Google Translate generating no more than 60.63%
accuracy.
Table 4. Comparison of results of Figure 10
Applications No of Words No of Errors Accuracy
Google Translate 30 12 60.63%
Proposed Application 30 2 93.34%
4. Conclusion
The paper presents an elaborate overview of the process of digitization of text, laying
emphasis on English handwritten text over typographic texts. Handwritten text being the
focal area, this paper was expected to deal with an infinite sample set with diverse
characteristics. For optimization of data input, the paper has sought to construct and
develop a personalized data set for handwritten texts. Google’s open source OCR Engine
Tesseract has been made to undergo different phases of training with the purpose of
identifying words and characters from these chosen data sets. Apart from the customary
training process of Tesseract, this paper has strived to introduce certain distinctive
attributes of its own like image augmentation for better quality image input, image
resizing for preferred sectional input and audio can be controlled for a desired voiced
speed and pitch of an output. Also, its ability to perform on the Android platform
promotes its applicability and accessibility to a wider range. The resultant output of the
total process illustrates considerable progress in the field of handwritten manuscript
detection, there remains much space towards achieving better accuracy. Potential research
options in this area include imbibing advanced machine learning methods for acquiring
maximum accuracy, as well as digital translation of regional and multilingual scripts in
both typed and handwritten forms.
References
[1] Mantas, J. An Overview of Character Recognition Methodologies. Pattern Recognition. 1986, 19, 425–
430.
[2] Fragoso, V.; Gauglitz, S.; Zamora, S.; Kleban, J.; Turk, M., "TranslatAR: A mobile augmented reality
translator," Applications of Computer Vision (WACV), 2011 IEEE Workshop on , vol., no., pp.497,502,
5-7 Jan. 2011
[3] Nagy, G. At the frontiers of OCR. Proceedings of the IEEE. 1992, 80, 1093-1100.
[4] ARCHANA A. SHINDE, Text Pre-processing and Text Segmentation for OCR; Publisher:
International Journal of Computer Science Engineering and Technology; pp. 810-812.
[5] Pal, U.; Roy, R.K.; Kimura, F. Multi-lingual City Name Recognition for Indian Postal Automation.
International Conference on Frontiers in Handwriting Recognition. 2012,
[6] Jianbin Jiao, Qixiang Ye, Qingming Huang, A configurable method for multi-style license plate
recognition. 2009. Pattern Recognition, Volume 42, Issue 3, , Pages 358-369.
[7] Graef, R.; Morsy, M.M.N. A Novel Hybrid Optical Character Recognition Approach for Digitizing
Text in Forms. Extending the Boundaries of Design Science Theory and Practice. 2019, 11491, 206-
220
[8] I. Marosi, "Industrial OCR approaches: architecture algorithms and adaptation techniques", Document
Recognition and Retrieval XIV SPIE, pp. 6500-01, Jan 2007.
[9] Smith, R. An overview of the Tesseract OCR engine. International Conference on Document Analysis
and Recognition. 2007
[10] S.V. Rice, G. Nagy, T.A. Nartker, Optical Character Recognition: An Illustrated Guide to the
Frontier, USA:Kluwer Academic Publishers, pp. 57-60, 1999.
[11] Rehman, A.; Naz, S.; Razzak, M.I. Writer identification using machine learning approaches: a
comprehensive review. Multimedia Tools and Applications. 2018, 78, 10889-10931.
[12] Lu, S.; Liu, L.; Lu, Y.; Wang, P.S.P Cost-Sensitive Neural Network Classifiers for Postcode
Recognition. International Journal of Pattern Recognition and Artificial Intelligence. 2012, 26, 1-14.
[13] Gheorghita, S.; Munteanu, R.; Graur, A. An Effect of Noise in Printed Character Recognition System
Using Neural Network. Advances in Electrical and Computer Engineering. 2013, 13, 65–68.
[14] Unnikrishnan,R., Smith,R. Combined script and page orientation estimation using the tesseract ocr
engine. In: Proceedings of the International Workshop on Multilingual OCR. ACM;2009,p.6.
[15] Huang, W.; He, D.; Yang, X.; Zhou, Z.; Kifer, D.; Giles, C.L. Detecting Arbitrary Oriented Text in The
Wild With A Visual Attention Model. Proceedings of the 24th ACM international conference on
Multimedia. 2016, 551–555.
[16] Singh, S.; Sharma, A.; Chhabra, I. A Dominant Points-Based Feature Extraction Approach to
Recognize Online Handwritten Strokes. International Journal on Document Analysis and Recognition.
2017, 20, 37–58.
[17] Geometric Rectification of Camera-Captured Document Images. Jian Liang; DeMenthon, D.;
Doermann, D.; April 2008. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.30,
no.4, pp.591-605..
[18] Pawar,N.,Shaikh,Z.,Shinde,P.,Warke,Y. Image to text conversion using tesseract. International
Research Journal of Engineering and Technology 2019;6(2):516519.
[19] Smith,R.,Antonova,D.,Lee,D.S. Adapting the tesseract open source ocr engine for multilingual ocr. In:
Proceedings of the International Workshop on Multi lingual OCR .ACM; 2009,p.1.
[20] Otsu, N. A Threshold Selection Method from Gray-Level Histograms. IEEE Transactions on Systems
Man and Cybernetics. 1975, 9, 62–66.