Report For OCR Project

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 18

TEXT ANALYSIS AND INFORMATION RETRIEVAL OF

HISTORICAL TAMIL ANCIENT DOCUMENTS USING MACHINE


TRANSLATION IN IMAGE ZONING

A PROJECT REPORT

Submitted by

JAYASURYA.K - 822320104007

SARANYA.M - 822320104309

PAVITHRA.G - 822320104009

JAYASRI.S - 822320104006

in partial fulfillment for the award of the degree

of

BACHELOR OF ENGINEERING

IN

COMPUTER SCIENCE AND ENGINEERING

VANDAYAR ENGINEERING COLLEGE, THANJAVUR

ANNA UNIVERSITY : CHENNAI 600 025

MAY 2023

1
ANNA UNIVERSITY : CHENNAI 600 025

BONAFIDE CERTIFICATE

Certified that this project report “Text Analysis and Information Retrieval of
Historical Tamil Ancient Documents Using Machine Translation in Image
Zoning” is the bonafide work of K.JAYASURYA, G.PAVITHRA,
M.SARANYA, S.JAYASRI, who carried out the project work under my
supervision.

Mrs.G.SUBBULAKSHMI, M.E., Mr.M.SARAVANAKUMAR,M.E.,


HEAD OF THE DEPARTMENT SUPERVISOR
Computer science and engineering, Computer science and engineering,
Vandayar Engineering College, Vandayar Engineering College,
Pulavarnatham, Thanjavur-613501 Pulavarnatham, Thanjavur-613501

Submitted for the project viva voice examination held on ………….

Imternal Examiner External Examiner

2
ACKNOWLEDGEMENT

First of all, I thank god for his blessing and grace on every success in the
life and also this project.

The success of any work lies in the involvement and commitment of its
makers, this being no exception. At this juncture I would like to acknowledge
the many minds that made this project of ours to be a reality.

My grateful thank to our beloved Chairman Mr.S.GUNASEKARA


VANDAYAR, M.Com and Correspondent Mr.G.VIJAY PRAKASH, B.E.,
M.Tech., M.B.A for the continuous help during our course period by arranging
various activities.

Next my grateful thanks to our Principal


Dr.SAMUNDEESWARI,M.Tech., Ph.D for providing his hands to us to
successfully complete the course.

The encouragement and support of our Head of the Department


G.SUBBULAKSHMI, M.Eas always guiding with spirit throughout the course
period. Whenever clouds of disappointment hung above, I am always had one
never failing ray of hope that showed the path, time and again, in the form of
our Head of the Department.
My special thanks for the invaluable help and guidance given by my
internal guide Mr.M.SARAVANAKUMAR, M.E., Asst.Professor Department
of Computer Science and Engineering, Vandayar Engineering College,
Thanjavur.
I would be failing in my duty if I do not mention the wholehearted
support and technical assistance extended to me by all the FACULTY
MEMBERS AND TECHNICAL STAFFS of the Computer Science and
Engineering Department, Vandayar Engineering College, Thanjavur.

3
TABLE OF CONTENTS

CHAPTER NO. TITLE PAGE NO.


List of figures 5
List of abbreviations 6
Abstract 7
1 INTRODUCTION
1.1 Introduction 8
1.2 OSR for 9
historical document
1.3 Hardware 9
requirement
2 SOFTWARE
PROCESS
2.1 Software 11
implementation
2.2 Flowchart 14
2.3 Result 15
2.4 Reasons for 17
accuracy variations
3 CONCLUSION
3.1 Conclusion 19

List of Figures

Fig1: Script families 2


Fig2 : Inscription 2
Fig 3 : Hardware requirements 4

4
Fig 4: Software process 9
Fig 5: Recognized sample image 10
Fig 6: Recognized paragraph 10
Fig 7: Consonants and Vowels 11
Fig 8: Brahmi characters 11

List of abbreviations

OCR - Optical Character Recognition


LED - Light Emitting Diode
CNN - Convolutional Neural Network
API - Application Programming Interface

5
ABSTRACT

Now-a-days digitalization becomes an important one for documents

preservation. Some Tamil Handwritten characters need preservation, like land

documents etc. So we try to overcome the difficulty of paper preservation by

digitalizing it. The aim of this Project is to require Handwritten set of Tamil

Characters as input within the format of image to process the character, train

the Convolution Neural Network algorithm to acknowledge the pattern and

convert the recognized characters to a Printed document. Convolutional Neural

Network then attempts to work out if the computer file matches a pattern that

the Neural Network has memorized. Optical Character Recognition deals with a

crucial concern issue of handwritten character classification. To beat the

difficulty of knowledge recognition among similarities, Convolutional Neural

Network will provide more accuracy of character recognition. Convolutional

Neural Networks (CNN) are playing an important role nowadays in every

aspect of computer vision applications. The art of CNN is used in recognizing

Tamil handwritten characters in offline mode. CNNs differ from traditional

approach of Tamil Handwritten Character Recognition (THCR) in extracting

the features by the methods of preprocessing, normalization, feature extraction

and classification .we've developed a CNN model from scratch by training the

model with the Tamil characters in offline mode and We have achieved good

accuracy results on obtained datasets. This work is for digitalizing offline

THCR using deep learning technique.


6
CHAPTER 1

INTRODUCTION

1.1 INTRODUCTION

The primary objective of this work is to aid in the development of an Optical


Character Recognition (OCR) system to digitize and produce an audible speech
output for ancient Tamil inscriptions. The notable challenge is that the typeset-
ting process was extremely noisy. The hand-carved blocks were uneven and
often failed to perfectly sit on the baseline.

Fig1:Script families Fig2: Inscription

1.2 OCR for historical document

Ref (2) R. J. Kannan and R. Phrabhakar, “A Comparative Study of Optical


Character Recognition for Tamil Script” OCR is the recognition of printed
and/or handwritten characters by a computer [5]. OCR has become a widely
exploited tool by those interested in digitizing written records .

Although research work has been carried out in the recent past to perfect
character recognition for some ancient Tamil script, 100% accurate results have
still not been obtained.

1.3 Hardware Requirements:

7
This OCR scanner consists of a bottom plate that is affixed with an LED strip
panel that is then covered with a clear glass slab to act as the surface for
placing the image as well as a clear pathway for illumination.

The support rods on either side of the box serve two purposes. The first being
that the ridges are placed with spacing to fit a wooden slab on which the
webcam will be mounted. The second use is that this slab can be adjusted to
increase or decrease the level of zoom required to scan the image clearly.

The final component that brings this scanner to life is the webcam that is
attached to the slab that captures the image placed on the glass panel and
transfers it to the laptop for training the neural network. A Bluetooth speaker
that is connected to the laptop is also required to be used to produce the audio
output after the text to speech conversion.

Fig 3: Hardware requirements

8
CHAPTER 2

SOFTWARE PROCESS

2.1 Software Implementation

Ref(3) . K. Punitharaja, and P. Elango, “Tamil Handwritten Character


Recognition: Progress and Challenges” The input image is first converted to
grayscale and the binarization process is applied to the grayscale form of the
image.

Next, the Otsu’s thresholding algorithm is implemented to perform the


binarization which involves removing the foreground text from the noisy
background.

The dark pixel is stored as 1’s and light pixel is stored as 0’s using the image
zoning technique.

After matching the character, the equivalent current Digital Tamil text is
converted using Unicode.

Then, the binarized image obtained is sliced to equivalent letter blocks each
containing an ancient Tamil character.

Next, this set of cropped images is fed to the convolution neural network;
trained for image classification and recognition.

Transfer Learning [15] along with data augmentation is used to train the CNN
to classify Tamil letters using Keras and TensorFlow .

The advantage of using Transfer Learning comes from the fact that pre-trained
networks are adept at identifying lower level features such as edges, lines, and
curves.

The processing of an image in a machine happens in terms of numbers.


9
Hence, every pixel in the image is given a value ranging between 0 and 255.

This process is known as image encoding and is the primary step for training
the neural network to recognize the images.The encoded image is passed
through a convolution layer of dimensions 28 × 28 × 1, where, the 28
represents the size of the image and 1 stands for the number of channels here
being binary.

Using optimizer techniques in mind, the parameters were updated after every
iteration.

These layers convolve around the image to detect edges, lines, blobs of colors
and other visual elements.

For 2D CNN and subsequently two layers of MaxPooling, the dimensions


work out to be 32 × 4 × 4.

This dimensioning of the CNN has been made after trial and error- keeping in
mind the convergence speed as well as the accuracy required.

Next, the pooling layers reduces the dimensionality of the images by removing
a certain number of pixels from the image .

To implement this work MaxPooling technique was preferred . The convolved


image is compared with every image in the data set.

Based on Euclidean distance principle [21], the letter block in the image is
recognized to its closest font family.

To implement OCR, the tile is fed to the Pytesseract library trained for Tamil.

Finally, for the conversion of the digitized text to an audio output, the “gTTS”
library [23] of python is used.

gTTS is a tool tointerface with Google Translate’s text-to-speech API .

10
Using the image slicer tool from Python [26], the cropped characters were fed
through a two- dimensional CNN to train and produce the equivalent digitized
text output on the python shell screen and its equivalent audio output on the
speaker.

The dark pixel is stored as 1’s and light pixel is stored as 0’s using the image
zoning technique.

2.2 Flowchart

11
2.3 RESULT

12
Ref(4) “Unsupervised Transcription of Historical Documents”Since the
accuracy of the audio output is dependent on the accuracy of the OCR, the
efficiency of the integrated system relies primarily on the efficiency of the
OCR technique.

The accuracy of the system for printed Tamil literature averaged around 91%.

The accuracy of OCR system for handwritten modern Tamil script was
therefore around 70%. The text to speech accuracy was 68%.

Fig 4: Software process

13
Fig 5: Recognized sample image

Fig 6: Recognized paragraph

14
2.4 REASONS FOR ACCURACY VARITAION

Ref(5) A Comprehensive Guide to Convolutional Neural Networks Like


handwritten scripts, scripts found on inscriptions are not uniform and this also
causes recognition problems.

The erosion on some surfaces (i.e., rocks, pillars, etc.) further lead to certain
characters remaining unrecognizable or lost. These errors could be minimized
by using images of higher resolution,

The lack of character recognition is due to three reasons.

Ref(7) Comparing the OCR Accuracy Levels of Bitonal and Greyscale Images”
[online] One is that the consonantal vowel characters are similar to consonants.

Fig 7: Consonants and vowels

The second one, the writers do not write the Brahmi characters properly.
Moreover, the characters are stored in different strokes and styles with
ambiguity as revealed below.

Fig 8: Brahmi Characters

Third one, the data set stores only a minimum number of data, i.e. only 6000
characters are stored in the database for training set.

15
16
CHAPTER 3

CONCLUSION

CONCLUSION:

OCR techniques for ancient Tamil scripts is a rich research topic.

In future, the system can be modified with the incorporation of better feature
extraction methods to improve the accuracy rate.

In addition, the system in future will recognize the Brahmi letters and
vattezhuthu from the stone inscriptions for improving the result in using some
hybrid algorithms.

17
References

1. S. Rajakumar and S. V. Bharathi., "Century Identification and Recognition


of Ancient Tamil Character Recognition," International Journal of Computer
Applications, vol. 26, no. 4, pp. 32-35, July 2011.

2. R. J. Kannan and R. Phrabhakar, “A Comparative Study of Optical Character


Recognition for Tamil Script”, European Journal of Scientific Research ISSN
1450-216X Vol.35 No.4 (2009), pp.570-582.

3. K. Punitharaja, and P. Elango, “Tamil Handwritten Character Recognition:


Progress and Challenges” I J C T A, 9(3), 2016, pp. 143-151.

4. T. B. Kirkpatrick, G. Durrett and D. Klein, “Unsupervised Transcription of


Historical Documents”, Proceedings of the 51st Annual Meeting of the
Association for Computational Linguistics, pages 207–217, Sofia, Bulgaria,
August 4-9 2013.

5. “A Comprehensive Guide to Convolutional Neural Networks- the EL15


way” [online] Available: https://towardsd atascience.com/a-comprehensive-
guide-to-convolutional-neural- networks-the-eli5-way-3bd2b1164a53.

6. “Tesseract OCR” [online] Available: https://opensource.


google.com/projects/tesseract.

7. “Comparing the OCR Accuracy Levels of Bitonal and Greyscale Images”


[online] Available: http://www.dlib.org/dli b/march09/powell/03powell.html.

18

You might also like