0% found this document useful (0 votes)

14 views

Project of Matlab

College project computer science research

Uploaded by

maketrickss129

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views

Project of Matlab

College project computer science research

Uploaded by

maketrickss129

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 42

SEGMENTATION OF TOUCHING

CHARACTERS IN TAMIL

A PROJECT REPORT

SUBMITTED BY

JAIGANESH S (112718104010)

MARIYA ANTONY SARATH S (112718104018)

MOHAMMAD ALI S (112718104020)

YUVARAJ B (112718104030)

IN PARTIAL FULFILLMENT FOR THE AWARD OF THE

DEGREE OF

BACHELOR OF ENGINEERING

COMPUTER SCIENCE AND ENGINEERING

St. PETER’S COLLEGE OF ENGINEERING

AND TECHNOLOGY

ANNA UNIVERSITY: CHENNAI 600 025

APRIL 2022
1
ANNA UNIVERSITY: CHENNAI 600 025

BONAFIDE CERTIFICATE

Certified that this project report “SEGMENTATION OF HANDWRITTEN

CHARACTER IN TAMIL” is the bonafide work of” Jaiganesh. R (112718104010),
Mariya Antony Sarath. s (112718104018), Mohammad ali. S (112718104020) and
Yuvaraj. B (112718104030)” who carried out the project work under my supervision

SIGNATURE SIGNATURE

Ms. P. PREETHY REBECCA Ms. J. WISELY JOE

HEAD OF THE DEPARTMENT SUPERVISOR
Department of Computer Science Department of Computer Science
&Engineering St. Peter’s College of &Engineering St. Peter’s College of
Engineering and Technology, Avadi, Engineering and Technology, Avadi,
Chennai-600054. Chennai-600054.

06.…
08…
.20…
21… at St. Peter’s College of
Submitted for viva-voce held on……
Engineering and Technology.

INTERNALEXAMINER EXTERNALEXAMINER
2
ACKNOWLEDGEMENT

We thank God Almighty for giving us blessings and opportunity to

update our career through St. Peter's College of Engineering and Technology.
We thank the Chairperson Dr. (Mrs.) T. Banumathi and the Trustees
Dr. T. Lasya, M.B.B.S., M.D., FRCR(London), and Dr. T. Namratha,
M.B.B.S., M.B.A., for having provided us the necessary infrastructure that
has helped us to learn and further progress forward in our academic career.
We would like to express our sincere thanks to Dr. C.V. Jayakumar,
M.E., Ph. D. Principal who has been the exceptional source of inspiration
throughout our campaign in this institution.
We would like to express our gratefulness to, Head of the Computer
Science and Engineering Department who has been a source of inspiration
throughout our stint in this institution.
We would like to thank our Project Supervisor Ms. Mary Selvan M.E.,
Associate Professor, for supervising us in the project development and for
rendering her valuable guidance, encouragement and support throughout the
project work.
We extend our sincere thanks to the Project Coordinator Ms. Mary
Selvan, M.E., Associate Professor, for the congruous support and bountiful
encouragement throughout the project work.
We extend our sincere thanks to all Teaching and Non-teaching
faculty of Department of Computer Science and Engineering for their
help and support further. We thank our Parents and Friends who have
constantly encouraged us throughout this course by extending their moral
support to achieve great heights.

3
ABSTRACT

Image Segmentation refers to partition of an image into different region that a

similar and different in some characteristics like color intensity and texture image

segmentation used in almost in all modern computers and mobiles for image

reorganization. Here Tamil handwritten documents are converted into grayscale image

and then segmented into characters segmentation is done in both vertical and horizontal

direction in this algorithm. First, the algorithm checks for touching in character in

horizontal zone and then in vertical zone. If the touching character is in vertical zone,

then the cut should be made in horizontal way. or if the touching character is in

horizontal zone, then the cut should be made in vertical way. This method proposes

more accuracy for touching lines in Tamil language.

4
ABSTRACT (TAMIL)
படப் பிரிவு என்பது ஒரு படத்தத வெெ் வெறு பகுதிகளாகப்

பிரிப் பததக் குறிக்கிறது, இது கிட்டத்தட்ட அதனத்து நவீன

கணினிகள் மற் றும் வமாதபல் களில் படத்தத

மறுசீரதமப் பதற் காக பயன்படுத்தப் படும் ெண்ண தீவிரம்

மற் றும் அதமப் புப் படப் பிரிவு வபான்ற சில பண்புகளில் ஒவர

மாதிரியான மற் றும் வெறுபட்டது. இங் வக தமிழ் தகயால்

எழுதப் பட்ட ஆெணங் கள் கிவரஸ்வகல் படமாக மாற் றப் பட்டு

பின்னர் எழுத்துகளாகப் பிரிக்கப் பட்டு வெங் குத்து மற் றும்

கிதடமட்ட திதெயில் இந்த ெழிமுதறயில் வெய் யப் படுகிறது.

முதலில் , அல் காரிதம் கிதடமட்ட மண்டலத்திலும் பின்னர்

வெங் குத்து மண்டலத்திலும் உள் ள எழுத்துக்கதளத் வதாடுெததெ்

ெரிபார்க்கிறது. வதாடும் எழுத்து வெங் குத்து மண்டலத்தில்

இருந்தால் , வெட்டு கிதடமட்டமாக வெய் யப் பட வெண்டும் ,

அல் லது வதாடும் பாத்திரம் கிதடமட்ட மண்டலத்தில் இருந்தால் ,

வெட்டு வெங் குத்தாக வெய் யப் பட வெண்டும் . இந்த முதற தமிழ்

வமாழியில் ெரிகதளத் வதாடுெதற் கு அதிக துல் லியத்தத

முன்வமாழிகிறது.

5
TABLE OF CONTENTS

CHAPTER TITLE PAGE No.

No.
ABSTRACT 4

LIST OF FIGURES 5

1 INTRODUCTION 8
1.1 Problem Definition 9
9
1.2 Segmentation Methods
1.3 Identification of Touching Characters 11

1.4 Segmentation of Single Touching 11

Character

2 LITERATURE SURVEY 14

3 SYSTEM SPECIFICATION 17

3.1.1 HARDWARE REQUIREMENTS 18

3.1.2 SOFTWARE REQUIREMENTS 18

4 SYSTEM ANALYSIS 19

4.1.1 EXISTING SYSTEM 19

4.1.2 PROPOSED SYSTEM 20

4.2 SYSTEM ARCHITECTURE 21

4.3 SOFTWARE DESCRIPTION 22

4.4 MODULE DESCRIPTION 23

6
5 SYSTEM DESIGN 28

DATA FLOW DIAGRAM 28

USE CASE DIAGRAM 29
CLASS DIAGRAM 30
ACTIVITY DIAGRAM 31

6 TESTING 32

SAMPLE CODE 32

SAMPLE SCREEN SHOT 35

7 CONCLUSION 41

8 REFERENCES 42

7
INTRODUCTION

The Tamil language contains one of the world's oldest scripts, which

goes back to the 6th century BC. And it's thought that such figures were born

in Keezhadi, Tamil Nadu, India. Tamil characters have evolved from Greek

characters in terms of shape. The script was previously known as "Tamili"

before being renamed "Tamil." The script is written in a left-to-right direction.

Vowels (Uyirezhuthu) and consonants (Uyirezhuthu) have been separated

from the script (Meiyezhuthu). The consonant has 18 characters, whereas

the vowel has 12 characters. Characters have had certain strokes as their

forms since the evolution of the language. The Christian preacher

Constantine Joseph Beschi, afterwards known as 'Veeramamunivar,'

changed the forms using different strokes and shapes to identify the short

(Kuril) and long (Nedil) vowels during the 17th century AD. The famous Tamil

Nadu reformer E.V. Ramasamy (Periyar) made adjustments to the shapes

of the characters in the 19th century AD that are still in use today. Because

paper was readily available in their region prior to the development of paper,

a previous civilization of Tamil people used to record their medical tips,

architectural information, astrological views, and literatures.

8
1.1 PROBLEM DEFINITION:

The difference in the forms of the scripts is caused by the way the pen

is held. This has a significant impact on how the scripts are depicted as

touching characters. OCR identifies just single characters, leaving the

touching characters unnoticed. As a result, character segmentation is a

critical step in providing the best answer in the recognition phase. Single and

multiple touching characters are the two forms of touching characters. Single

touching characters are divided into two categories: horizontal touching and

vertical touching. Horizontal touching occurs when the characters interact

with the same line characters, while vertical touching occurs when the

characters interact with consecutive line characters. Multiple touching

characters are those who touch in both directions.

1.2 SEGMENTATION METHODS

An OCR in manuscripts starts with the primary phase of line segmentation

process. An Adaptive Partial Projection (APP) method can be used to identify the

line numbers and space between the text lines by a piece wise projection method.

9
The second method is known as A* Path Planning (A*PP) is used to identify

the touching and partially overlapping characters in text lines by heuristic way. The

latter uses by means of various cost functions in Thai and Khmer language

manuscripts.

In Tamil manuscripts, both the above-mentioned methods are ineffective and

they change the structure of the character.

In this Dynamic Labelling Algorithm (DLA), text lines can be segmented with

a recognition accuracy of 96 percent (RA). Even when the characters are strongly

contacting and overlapping with the preceding or subsequent line characters, this

offers optimum line segmentation in Tamil manuscripts.

Following the successful line segmentation procedure using DLA, character

segmentation is the second key phase in OCR of Tamil manuscripts. Image filtering,

image sharpening, and morphology are employed in the pre-processing stage to

remove undesired particles other than the character present in the lines. They also

make the character stroke visible. The flow diagram depicts the suggested character

segmentation method's step-by-step process.

10
1.3 Identification of Touching Characters

The linked component is used to determine the character's weight. A threshold

value has been set to assess the weight; when the lower value falls below the

threshold, the characters are treated as single characters and are immediately split

without any complications. The bigger the value, the more touching characters are

assessed using the following factor:

The touching character has a greater aspect ratio(ar) than single characters that

are automatically divided. The factor (far) indicated above is used to detect touching

characters which need greater attempt to make single characters. The touching

characters are determined by the formula far =ea/(1+ea), where a=w/h and w and h

are the character's width and height parameters. After the touching character has

been identified, it is divided into three categories: horizontal touching, vertical

touching, and multiple touching characters.

1.4 Segmentation of Single Touching Character

When the cutting edge in between characters is fixed, segmentation of

touching characters is achievable. The cutting edge is defined as the point at which

two linked characters are separated by a segmentation point. Types 1 to 6

11
demonstrate the many ways two characters can contact horizontally. For the above-

mentioned category characters, the cutting edge is employed by calculating their

weight in columns. They are designated as the lightest weight that can anticipate as

the character's contact point, as well as a cutting edge to segment the character.

Because the weight is derived by the character's column, the cutting edge for this

category of horizontal touching characters. The height of the character, can be used

to identify the vertical touching character. In vertical touching characters, the types

7 to 10 indicate the various ways of touching point. Rows determine the cutting edge

between the characters with the least weight. Vertical touching characters have a

horizontal cutting edge, while segmented characters have a vertical cutting edge.

Multiple touching characters can be identified by horizontally contacting the same

line characters and vertically touching the subsequent line characters. The horizontal

and vertical cutting-edge method creates touching characters that can be used to

segment single characters.

12
LITERATURE SURVEY

In this paper, Anupama. B and Seenivasa Reddy propose a method for extracting

image features based on multiple histogram projections and morphological operators.

The text image is horizontally projected, and the peaks in the horizontal projection are

used to identify line segments. To divide the text image into segments, a threshold is

used. Another threshold is used to eliminate false lines. For the line segments, vertical

histogram projections are used, which are then decomposed into words and then further

decomposed into characters using threshold. Several document images are used to test

the algorithm. According to the results of the experiment, the proposed method is fast

and reliable for handwritten documents with non-overlapping lines. This method has the

drawback of causing segmentation errors for touching characters. For good quality

documents, DR has a line segmentation accuracy of 99 percent, whereas RA has a line

segmentation accuracy of 98 percent.

Shalini M. and Indira Reddy B. are the authors of this study. One of the most

important aspects of record picture inquiry is content line division. To recognise all

content areas in the record picture, content line separation is critical. We offer a

computation that takes into account multiple histogram projections and uses

morphological administrators to concentrate elements of the image in this work. The

content picture is level projected, and then line segments are recognized by the tops in

the flat projection. The limit is used to divide the content image into pieces. False lines

13
are eliminated by using a different edge. For the line segments, vertical histogram

projections are used, which are then fragmented into words using edge and then further

deteriorated to characters. This strategy's drawback is that it resulted in division mishaps

for touching characters. For good quality archives, line division precision with DR is 99

percent and RA is 98 percent.

A Mahmoud and A. Mousa present a Hough-based and a Histogram-based

strategy to character segmentation in this study. The fact that Arabic is primarily written

in cursive poses a significant obstacle. As a result, a segmentation procedure must be

used to establish where the character begins and ends. This is a crucial stage in character

recognition. To separate lines, words, and characters, the suggested method employs

Projection-based approach concepts. The stage of character segmentation is in charge

of separating the related characters. The horizontal axis profile and vertical axis profile

were used to discover character separations using the profile's amplitude filter and a

simple edge tool. When tested on several printed papers using various Arabic typefaces,

this method shows promise. The word segmentation algorithm has a 99 percent accuracy

rating. The character segmentation algorithm also reached a satisfactory accuracy ratio

of 98 percent.

14
Partha Bhowmick and Gaurav Harit present a novel character segmentation

approach for Bangla handwritten text documents in this study. This approach is

dependent on the vertices of the outer isothermic coverings being characterised. Words

in a text that correspond to each other. Each cover is divided into several sub-polygons

that represent the characters that make up a word using the vertex characterisation. The

suggested method is unique in that it can execute character segmentation without

deskewing skewed documents and is suitable to handwritten Bangla language without

using any heuristics. Accuracy: Using this character segmentation, we acquire a 96.04

percent accuracy.

In this publication, Vijay and Madan Kharat present their findings. In handwritten

Hindi text documents, global threshold performs better for character segmentation, but

Otsu's threshold technique segments words and lines. Character accuracy is determined

not just by the lines and words, but also by the precision of character segmentation. This

method must be used to evaluate the job in order to correct slope and congested lines

documents where accuracy has been degraded. The word segmentation method in

obtained an accuracy of 85 percent. The character segmentation algorithm had a 92%

accuracy rate.

15
The precise character level segmentation of printed or handwritten text is a

critical pre-processing step for optical character recognition, according to Soumen Bag

and Ankit Krishna in this paper (OCR). It has been observed that languages with cursive

writing make the segmentation problem significantly more difficult. The fundamental

problem in handwritten character segmentation is dealing with the inherent variety in

different people's writing styles. They describe an effective character segmentation

method for handwritten Hindi words in this research. Segmentation is carried out based

on some structural patterns seen in this language's writing style. The proposed approach

can handle a wide range of writing styles as well as skewed header lines as input. For

both printed and handwritten words, the approach has been tested on our own database.

The average success rate is 96.93 percent when it comes to accuracy. When compared

to other methods, the method produces reasonably good results for this database.

Kiruba and Nivethitha presented a histogram strategy to segment the characters

in this method. The image is initially converted to a grayscale format. The image is then

processed using Otsu's thresholding method to boost the text's intensity and make it

stand out from the backdrop. The data is then analysed to detect and rectify skew. The

image is then converted to a histogram, with each low point representing the line space

between lines and being segmented line by line. Vertical histogram is used for character

segmentation. Line segmentation has a 97 percent accuracy rate. Character

segmentation is 87 percent accurate.

16
Kathirvalavakumar and Karthigaiselvi's major goal in this study is to use vertical

and horizontal projections based on character structure to separate the various

horizontally overlapping lines and touching characters found in all zones of machine

printed Tamil script. Documents of various categories are gathered and tested. The

suggested method successfully segments all of the photos used in the experiment. Even

when the lines and characters are touched, all of the lines and words are appropriately

segmented, and characters are segmented more precisely. The suggested algorithms

have the benefit of being able to segment more than two touching letters in a word.

Because it is based on projection values, the procedures required in line, word, and

character segmentation algorithms are fairly simple to implement. The projection

profile-based approach has a 92 percent accuracy. The accuracy of the connected

component labelling approach is 91 percent.

3.1 SYSTEM SPECIFICATION

3.1.1 Hardware Requirements:

The most common set of requirements defined by any operating system or

software application is the physical computer resources, also known as hardware. A

hardware requirements list is often accompanied by a hardware compatibility list

(HCL), especially in case of operating systems. An HCL lists tested, compatibility and

17
sometimes incompatible hardware devices for a particular operating system or

application. The following sub-sections discuss the various aspects of hardware

requirements.

RAM 8 GB And Above

Any Intel or AMD x86-64
PROCESSOR processor with four logical
cores
SPEED 2.6 GHz and Above
STORAGE 20 - 30 GB of SSD
MONITOR Any Monitor With 1280x720
pixels
INPUT Basic Keyboard & Mouse
Hardware accelerated
GRAPHICS graphics card supporting
OpenGL 3.3 with 1GB GPU
memory

Table 3.1.1 Hardware Requirements

3.1.2 Software Requirements:

Software Requirements deal with defining software resource requirements and

prerequisites that need to be installed on a computer to provide optimal functioning of

an application. These requirements or prerequisites are generally not included in the

software installation package and need to be installed separately before the software is

installed.

18
OPERATING SYSTEM Windows 7 or Higher, Mac, Linux.

CODING LANGUAGE Matlab

IDE Matlab

Table 3.1.2 Software Requirements

4.1 SYSTEM ANALYSIS

4.1.1 EXISTING SYSTEM:

There are many image segmentation techniques proposed to segment the images

to retrieve essential knowledge and information out of it. All of the techniques vary in

their method used for segmenting the images. Some of the popular techniques used for

image segmentation are as follows:

➢ Thresholding

➢ Edge detection segmentation

➢ Region based segmentation techniques

➢ Clustering techniques

➢ Watershed segmentation

19
4.1.2 PROPOSED SYSTEM: