Report For OCR Project

TEXT ANALYSIS AND INFORMATION RETRIEVAL OF
HISTORICAL TAMIL ANCIENT DOCUMENTS USING MACHINE

TRANSLATION IN IMAGE ZONING
A PROJECT REPORT
Submitted by
JAYASURYA.K - 822320104007
SARANYA.M - 822320104309
PAVITHRA.G - 822320104009
JAYASRI.S - 822320104006
in partial fulfillment for the award of the degree
of
BACHELOR OF ENGINEERING
IN
COMPUTER SCIENCE AND ENGINEERING
VANDAYAR ENGINEERING COLLEGE, THANJAVUR
ANNA UNIVERSITY : CHENNAI 600 025
MAY 2023
1
ANNA UNIVERSITY : CHENNAI 600 025
BONAFIDE CERTIFICATE
Certified that this project report “Text Analysis and Information Retrieval of
Historical Tamil Ancient Documents Using Machine Translation in Image
Zoning” is the bonafide work of K.JAYASURYA, G.PAVITHRA,
M.SARANYA, S.JAYASRI, who carried out the project work under my
supervision.
Mrs.G.SUBBULAKSHMI, M.E., Mr.M.SARAVANAKUMAR,M.E.,

HEAD OF THE DEPARTMENT SUPERVISOR
Computer science and engineering, Computer science and engineering,
Vandayar Engineering College, Vandayar Engineering College,
Pulavarnatham, Thanjavur-613501 Pulavarnatham, Thanjavur-613501
Submitted for the project viva voice examination held on ………….
Imternal Examiner External Examiner
2
ACKNOWLEDGEMENT
First of all, I thank god for his blessing and grace on every success in the
life and also this project.
The success of any work lies in the involvement and commitment of its
makers, this being no exception. At this juncture I would like to acknowledge
the many minds that made this project of ours to be a reality.
My grateful thank to our beloved Chairman Mr.S.GUNASEKARA

VANDAYAR, M.Com and Correspondent Mr.G.VIJAY PRAKASH, B.E.,
M.Tech., M.B.A for the continuous help during our course period by arranging
various activities.
Next my grateful thanks to our Principal

Dr.SAMUNDEESWARI,M.Tech., Ph.D for providing his hands to us to
successfully complete the course.
The encouragement and support of our Head of the Department

G.SUBBULAKSHMI, M.Eas always guiding with spirit throughout the course
period. Whenever clouds of disappointment hung above, I am always had one
never failing ray of hope that showed the path, time and again, in the form of
our Head of the Department.
My special thanks for the invaluable help and guidance given by my
internal guide Mr.M.SARAVANAKUMAR, M.E., Asst.Professor Department
of Computer Science and Engineering, Vandayar Engineering College,
Thanjavur.
I would be failing in my duty if I do not mention the wholehearted
support and technical assistance extended to me by all the FACULTY
MEMBERS AND TECHNICAL STAFFS of the Computer Science and
Engineering Department, Vandayar Engineering College, Thanjavur.
3
TABLE OF CONTENTS
CHAPTER NO. TITLE PAGE NO.

List of figures 5
List of abbreviations 6
Abstract 7
1 INTRODUCTION
1.1 Introduction 8
1.2 OSR for 9
historical document
1.3 Hardware 9
requirement
2 SOFTWARE
PROCESS
2.1 Software 11
implementation
2.2 Flowchart 14
2.3 Result 15
2.4 Reasons for 17
accuracy variations
3 CONCLUSION
3.1 Conclusion 19
List of Figures
Fig1: Script families 2

Fig2 : Inscription 2
Fig 3 : Hardware requirements 4
4
Fig 4: Software process 9
Fig 5: Recognized sample image 10
Fig 6: Recognized paragraph 10
Fig 7: Consonants and Vowels 11
Fig 8: Brahmi characters 11
List of abbreviations
OCR - Optical Character Recognition

LED - Light Emitting Diode
CNN - Convolutional Neural Network
API - Application Programming Interface
5
ABSTRACT
Now-a-days digitalization becomes an important one for documents
preservation. Some Tamil Handwritten characters need preservation, like land
documents etc. So we try to overcome the difficulty of paper preservation by
digitalizing it. The aim of this Project is to require Handwritten set of Tamil
Characters as input within the format of image to process the character, train
the Convolution Neural Network algorithm to acknowledge the pattern and
convert the recognized characters to a Printed document. Convolutional Neural
Network then attempts to work out if the computer file matches a pattern that
the Neural Network has memorized. Optical Character Recognition deals with a
crucial concern issue of handwritten character classification. To beat the
difficulty of knowledge recognition among similarities, Convolutional Neural
Network will provide more accuracy of character recognition. Convolutional
Neural Networks (CNN) are playing an important role nowadays in every
aspect of computer vision applications. The art of CNN is used in recognizing
Tamil handwritten characters in offline mode. CNNs differ from traditional
approach of Tamil Handwritten Character Recognition (THCR) in extracting
the features by the methods of preprocessing, normalization, feature extraction
and classification .we've developed a CNN model from scratch by training the
model with the Tamil characters in offline mode and We have achieved good
accuracy results on obtained datasets. This work is for digitalizing offline
THCR using deep learning technique.

6
CHAPTER 1
INTRODUCTION
1.1 INTRODUCTION
The primary objective of this work is to aid in the development of an Optical

Character Recognition (OCR) system to digitize and produce an audible speech
output for ancient Tamil inscriptions. The notable challenge is that the typeset-
ting process was extremely noisy. The hand-carved blocks were uneven and
often failed to perfectly sit on the baseline.
Fig1:Script families Fig2: Inscription
1.2 OCR for historical document
Ref (2) R. J. Kannan and R. Phrabhakar, “A Comparative Study of Optical

Character Recognition for Tamil Script” OCR is the recognition of printed
and/or handwritten characters by a computer [5]. OCR has become a widely
exploited tool by those interested in digitizing written records .
Although research work has been carried out in the recent past to perfect
character recognition for some ancient Tamil script, 100% accurate results have
still not been obtained.
1.3 Hardware Requirements:
7
This OCR scanner consists of a bottom plate that is affixed with an LED strip
panel that is then covered with a clear glass slab to act as the surface for
placing the image as well as a clear pathway for illumination.
The support rods on either side of the box serve two purposes. The first being
that the ridges are placed with spacing to fit a wooden slab on which the
webcam will be mounted. The second use is that this slab can be adjusted to
increase or decrease the level of zoom required to scan the image clearly.
The final component that brings this scanner to life is the webcam that is
attached to the slab that captures the image placed on the glass panel and
transfers it to the laptop for training the neural network. A Bluetooth speaker
that is connected to the laptop is also required to be used to produce the audio
output after the text to speech conversion.
Fig 3: Hardware requirements
8
CHAPTER 2
SOFTWARE PROCESS
2.1 Software Implementation
Ref(3) . K. Punitharaja, and P. Elango, “Tamil Handwritten Character

Recognition: Progress and Challenges” The input image is first converted to
grayscale and the binarization process is applied to the grayscale form of the
image.
Next, the Otsu’s thresholding algorithm is implemented to perform the

binarization which involves removing the foreground text from the noisy
background.
The dark pixel is stored as 1’s and light pixel is stored as 0’s using the image
zoning technique.
After matching the character, the equivalent current Digital Tamil text is
converted using Unicode.
Then, the binarized image obtained is sliced to equivalent letter blocks each
containing an ancient Tamil character.
Next, this set of cropped images is fed to the convolution neural network;
trained for image classification and recognition.
Transfer Learning [15] along with data augmentation is used to train the CNN
to classify Tamil letters using Keras and TensorFlow .
The advantage of using Transfer Learning comes from the fact that pre-trained
networks are adept at identifying lower level features such as edges, lines, and
curves.
The processing of an image in a machine happens in terms of numbers.

9
Hence, every pixel in the image is given a value ranging between 0 and 255.
This process is known as image encoding and is the primary step for training
the neural network to recognize the images.The encoded image is passed
through a convolution layer of dimensions 28 × 28 × 1, where, the 28
represents the size of the image and 1 stands for the number of channels here
being binary.
Using optimizer techniques in mind, the parameters were updated after every
iteration.
These layers convolve around the image to detect edges, lines, blobs of colors
and other visual elements.
For 2D CNN and subsequently two layers of MaxPooling, the dimensions

work out to be 32 × 4 × 4.
This dimensioning of the CNN has been made after trial and error- keeping in
mind the convergence speed as well as the accuracy required.
Next, the pooling layers reduces the dimensionality of the images by removing
a certain number of pixels from the image .
To implement this work MaxPooling technique was preferred . The convolved

image is compared with every image in the data set.
Based on Euclidean distance principle [21], the letter block in the image is
recognized to its closest font family.
To implement OCR, the tile is fed to the Pytesseract library trained for Tamil.
Finally, for the conversion of the digitized text to an audio output, the “gTTS”
library [23] of python is used.
gTTS is a tool tointerface with Google Translate’s text-to-speech API .
10
Using the image slicer tool from Python [26], the cropped characters were fed
through a two- dimensional CNN to train and produce the equivalent digitized
text output on the python shell screen and its equivalent audio output on the
speaker.
The dark pixel is stored as 1’s and light pixel is stored as 0’s using the image
zoning technique.
2.2 Flowchart
11
2.3 RESULT
12
Ref(4) “Unsupervised Transcription of Historical Documents”Since the
accuracy of the audio output is dependent on the accuracy of the OCR, the
efficiency of the integrated system relies primarily on the efficiency of the
OCR technique.
The accuracy of the system for printed Tamil literature averaged around 91%.
The accuracy of OCR system for handwritten modern Tamil script was
therefore around 70%. The text to speech accuracy was 68%.
Fig 4: Software process
13
Fig 5: Recognized sample image
Fig 6: Recognized paragraph
14
2.4 REASONS FOR ACCURACY VARITAION
Ref(5) A Comprehensive Guide to Convolutional Neural Networks Like

handwritten scripts, scripts found on inscriptions are not uniform and this also
causes recognition problems.
The erosion on some surfaces (i.e., rocks, pillars, etc.) further lead to certain
characters remaining unrecognizable or lost. These errors could be minimized
by using images of higher resolution,
The lack of character recognition is due to three reasons.
Ref(7) Comparing the OCR Accuracy Levels of Bitonal and Greyscale Images”
[online] One is that the consonantal vowel characters are similar to consonants.
Fig 7: Consonants and vowels
The second one, the writers do not write the Brahmi characters properly.
Moreover, the characters are stored in different strokes and styles with
ambiguity as revealed below.
Fig 8: Brahmi Characters
Third one, the data set stores only a minimum number of data, i.e. only 6000
characters are stored in the database for training set.
15
16
CHAPTER 3
CONCLUSION
CONCLUSION:
OCR techniques for ancient Tamil scripts is a rich research topic.
In future, the system can be modified with the incorporation of better feature
extraction methods to improve the accuracy rate.
In addition, the system in future will recognize the Brahmi letters and
vattezhuthu from the stone inscriptions for improving the result in using some
hybrid algorithms.
17
References
1. S. Rajakumar and S. V. Bharathi., "Century Identification and Recognition

of Ancient Tamil Character Recognition," International Journal of Computer
Applications, vol. 26, no. 4, pp. 32-35, July 2011.
2. R. J. Kannan and R. Phrabhakar, “A Comparative Study of Optical Character

Recognition for Tamil Script”, European Journal of Scientific Research ISSN
1450-216X Vol.35 No.4 (2009), pp.570-582.
3. K. Punitharaja, and P. Elango, “Tamil Handwritten Character Recognition:

Progress and Challenges” I J C T A, 9(3), 2016, pp. 143-151.
4. T. B. Kirkpatrick, G. Durrett and D. Klein, “Unsupervised Transcription of

Historical Documents”, Proceedings of the 51st Annual Meeting of the
Association for Computational Linguistics, pages 207–217, Sofia, Bulgaria,
August 4-9 2013.
5. “A Comprehensive Guide to Convolutional Neural Networks- the EL15

way” [online] Available: https://towardsd atascience.com/a-comprehensive-
guide-to-convolutional-neural- networks-the-eli5-way-3bd2b1164a53.
6. “Tesseract OCR” [online] Available: https://opensource.

google.com/projects/tesseract.
7. “Comparing the OCR Accuracy Levels of Bitonal and Greyscale Images”

[online] Available: http://www.dlib.org/dli b/march09/powell/03powell.html.
18

Report For OCR Project

Uploaded by

Copyright:

Available Formats

Report For OCR Project

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Report For OCR Project

Uploaded by

Copyright:

Available Formats

TEXT ANALYSIS AND INFORMATION RETRIEVAL OF

HISTORICAL TAMIL ANCIENT DOCUMENTS USING MACHINE

in partial fulfillment for the award of the degree

COMPUTER SCIENCE AND ENGINEERING

VANDAYAR ENGINEERING COLLEGE, THANJAVUR

ANNA UNIVERSITY : CHENNAI 600 025

Mrs.G.SUBBULAKSHMI, M.E., Mr.M.SARAVANAKUMAR,M.E.,

Submitted for the project viva voice examination held on ………….

Imternal Examiner External Examiner

My grateful thank to our beloved Chairman Mr.S.GUNASEKARA

Next my grateful thanks to our Principal

The encouragement and support of our Head of the Department

CHAPTER NO. TITLE PAGE NO.

Fig1: Script families 2

OCR - Optical Character Recognition

Now-a-days digitalization becomes an important one for documents

preservation. Some Tamil Handwritten characters need preservation, like land

documents etc. So we try to overcome the difficulty of paper preservation by

the Convolution Neural Network algorithm to acknowledge the pattern and

convert the recognized characters to a Printed document. Convolutional Neural

crucial concern issue of handwritten character classification. To beat the

difficulty of knowledge recognition among similarities, Convolutional Neural

Network will provide more accuracy of character recognition. Convolutional

Neural Networks (CNN) are playing an important role nowadays in every

aspect of computer vision applications. The art of CNN is used in recognizing

Tamil handwritten characters in offline mode. CNNs differ from traditional

approach of Tamil Handwritten Character Recognition (THCR) in extracting

the features by the methods of preprocessing, normalization, feature extraction

accuracy results on obtained datasets. This work is for digitalizing offline

THCR using deep learning technique.

The primary objective of this work is to aid in the development of an Optical

Fig1:Script families Fig2: Inscription

1.2 OCR for historical document

Ref (2) R. J. Kannan and R. Phrabhakar, “A Comparative Study of Optical

1.3 Hardware Requirements:

Fig 3: Hardware requirements

2.1 Software Implementation

Ref(3) . K. Punitharaja, and P. Elango, “Tamil Handwritten Character

Next, the Otsu’s thresholding algorithm is implemented to perform the

The processing of an image in a machine happens in terms of numbers.

For 2D CNN and subsequently two layers of MaxPooling, the dimensions

To implement this work MaxPooling technique was preferred . The convolved

gTTS is a tool tointerface with Google Translate’s text-to-speech API .

Fig 4: Software process

Fig 6: Recognized paragraph

Ref(5) A Comprehensive Guide to Convolutional Neural Networks Like

The lack of character recognition is due to three reasons.

Fig 7: Consonants and vowels

Fig 8: Brahmi Characters

OCR techniques for ancient Tamil scripts is a rich research topic.

1. S. Rajakumar and S. V. Bharathi., "Century Identification and Recognition

2. R. J. Kannan and R. Phrabhakar, “A Comparative Study of Optical Character

3. K. Punitharaja, and P. Elango, “Tamil Handwritten Character Recognition:

4. T. B. Kirkpatrick, G. Durrett and D. Klein, “Unsupervised Transcription of

5. “A Comprehensive Guide to Convolutional Neural Networks- the EL15

6. “Tesseract OCR” [online] Available: https://opensource.

7. “Comparing the OCR Accuracy Levels of Bitonal and Greyscale Images”

You might also like