Report For OCR Project
Report For OCR Project
Report For OCR Project
A PROJECT REPORT
Submitted by
JAYASURYA.K - 822320104007
SARANYA.M - 822320104309
PAVITHRA.G - 822320104009
JAYASRI.S - 822320104006
of
BACHELOR OF ENGINEERING
IN
MAY 2023
1
ANNA UNIVERSITY : CHENNAI 600 025
BONAFIDE CERTIFICATE
Certified that this project report “Text Analysis and Information Retrieval of
Historical Tamil Ancient Documents Using Machine Translation in Image
Zoning” is the bonafide work of K.JAYASURYA, G.PAVITHRA,
M.SARANYA, S.JAYASRI, who carried out the project work under my
supervision.
2
ACKNOWLEDGEMENT
First of all, I thank god for his blessing and grace on every success in the
life and also this project.
The success of any work lies in the involvement and commitment of its
makers, this being no exception. At this juncture I would like to acknowledge
the many minds that made this project of ours to be a reality.
3
TABLE OF CONTENTS
List of Figures
4
Fig 4: Software process 9
Fig 5: Recognized sample image 10
Fig 6: Recognized paragraph 10
Fig 7: Consonants and Vowels 11
Fig 8: Brahmi characters 11
List of abbreviations
5
ABSTRACT
digitalizing it. The aim of this Project is to require Handwritten set of Tamil
Characters as input within the format of image to process the character, train
Network then attempts to work out if the computer file matches a pattern that
the Neural Network has memorized. Optical Character Recognition deals with a
and classification .we've developed a CNN model from scratch by training the
model with the Tamil characters in offline mode and We have achieved good
INTRODUCTION
1.1 INTRODUCTION
Although research work has been carried out in the recent past to perfect
character recognition for some ancient Tamil script, 100% accurate results have
still not been obtained.
7
This OCR scanner consists of a bottom plate that is affixed with an LED strip
panel that is then covered with a clear glass slab to act as the surface for
placing the image as well as a clear pathway for illumination.
The support rods on either side of the box serve two purposes. The first being
that the ridges are placed with spacing to fit a wooden slab on which the
webcam will be mounted. The second use is that this slab can be adjusted to
increase or decrease the level of zoom required to scan the image clearly.
The final component that brings this scanner to life is the webcam that is
attached to the slab that captures the image placed on the glass panel and
transfers it to the laptop for training the neural network. A Bluetooth speaker
that is connected to the laptop is also required to be used to produce the audio
output after the text to speech conversion.
8
CHAPTER 2
SOFTWARE PROCESS
The dark pixel is stored as 1’s and light pixel is stored as 0’s using the image
zoning technique.
After matching the character, the equivalent current Digital Tamil text is
converted using Unicode.
Then, the binarized image obtained is sliced to equivalent letter blocks each
containing an ancient Tamil character.
Next, this set of cropped images is fed to the convolution neural network;
trained for image classification and recognition.
Transfer Learning [15] along with data augmentation is used to train the CNN
to classify Tamil letters using Keras and TensorFlow .
The advantage of using Transfer Learning comes from the fact that pre-trained
networks are adept at identifying lower level features such as edges, lines, and
curves.
This process is known as image encoding and is the primary step for training
the neural network to recognize the images.The encoded image is passed
through a convolution layer of dimensions 28 × 28 × 1, where, the 28
represents the size of the image and 1 stands for the number of channels here
being binary.
Using optimizer techniques in mind, the parameters were updated after every
iteration.
These layers convolve around the image to detect edges, lines, blobs of colors
and other visual elements.
This dimensioning of the CNN has been made after trial and error- keeping in
mind the convergence speed as well as the accuracy required.
Next, the pooling layers reduces the dimensionality of the images by removing
a certain number of pixels from the image .
Based on Euclidean distance principle [21], the letter block in the image is
recognized to its closest font family.
To implement OCR, the tile is fed to the Pytesseract library trained for Tamil.
Finally, for the conversion of the digitized text to an audio output, the “gTTS”
library [23] of python is used.
10
Using the image slicer tool from Python [26], the cropped characters were fed
through a two- dimensional CNN to train and produce the equivalent digitized
text output on the python shell screen and its equivalent audio output on the
speaker.
The dark pixel is stored as 1’s and light pixel is stored as 0’s using the image
zoning technique.
2.2 Flowchart
11
2.3 RESULT
12
Ref(4) “Unsupervised Transcription of Historical Documents”Since the
accuracy of the audio output is dependent on the accuracy of the OCR, the
efficiency of the integrated system relies primarily on the efficiency of the
OCR technique.
The accuracy of the system for printed Tamil literature averaged around 91%.
The accuracy of OCR system for handwritten modern Tamil script was
therefore around 70%. The text to speech accuracy was 68%.
13
Fig 5: Recognized sample image
14
2.4 REASONS FOR ACCURACY VARITAION
The erosion on some surfaces (i.e., rocks, pillars, etc.) further lead to certain
characters remaining unrecognizable or lost. These errors could be minimized
by using images of higher resolution,
Ref(7) Comparing the OCR Accuracy Levels of Bitonal and Greyscale Images”
[online] One is that the consonantal vowel characters are similar to consonants.
The second one, the writers do not write the Brahmi characters properly.
Moreover, the characters are stored in different strokes and styles with
ambiguity as revealed below.
Third one, the data set stores only a minimum number of data, i.e. only 6000
characters are stored in the database for training set.
15
16
CHAPTER 3
CONCLUSION
CONCLUSION:
In future, the system can be modified with the incorporation of better feature
extraction methods to improve the accuracy rate.
In addition, the system in future will recognize the Brahmi letters and
vattezhuthu from the stone inscriptions for improving the result in using some
hybrid algorithms.
17
References
18