0% found this document useful (0 votes)
82 views5 pages

Optical Character Recognition Techniques in Urdu-A Survey: Vippon Preet Kour Naveen Kumar Gondhi

The document discusses optical character recognition techniques for Urdu, a cursive language written in Nastaliq script from right to left. It outlines challenges in recognizing Urdu text including bi-directionality, non-monotonic characters, context sensitivity, complex dot placement, variable spacing, and looping letters. The document also summarizes common Urdu writing styles and the main steps in an optical character recognition process: preprocessing, segmentation, feature extraction, recognition/classification.

Uploaded by

Ammara
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
82 views5 pages

Optical Character Recognition Techniques in Urdu-A Survey: Vippon Preet Kour Naveen Kumar Gondhi

The document discusses optical character recognition techniques for Urdu, a cursive language written in Nastaliq script from right to left. It outlines challenges in recognizing Urdu text including bi-directionality, non-monotonic characters, context sensitivity, complex dot placement, variable spacing, and looping letters. The document also summarizes common Urdu writing styles and the main steps in an optical character recognition process: preprocessing, segmentation, feature extraction, recognition/classification.

Uploaded by

Ammara
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

International Journal of Computer Applications (0975 – 8887)

International Conference on Advancements in Engineering and Technology (ICAET 2017)

Optical Character Recognition Techniques in Urdu- A


Survey

Vippon Preet Kour Naveen Kumar Gondhi


Department of Computer Science & Engineering Department of Computer Science & Engineering
SMVDU, Katra SMVDU, Katra

ABSTRACT desire, demand and popularity of the Urdu literature, there is a


The survey of the optical character reader for Urdu like need to develop and design an Optical Character Reader for
cursive languages is based on the various techniques and Urdu Language. The objective of character recognition is to
studies performed on designing and implementation of the imitate the human reading ability, with the human accuracy
optical character reader. As, Urdu language has Nastaliq font but at far higher accuracy.
so different approaches were applied on this font so as to get
the desired result. Survey is being performed on all the
2. CHALLENGES IN URDU SCRIPT
techniques whether segmentation based, on-line or off-line There are various challenges in Urdu like cursive scripts and
etc, then all the data gathered is represented in a tabular they are
manner so as to make it an ease to understand or to have an 2.1 Bi-directionality
idea of the concept by visualizing the table at once. Non
Urdu language is written bidirectional, as the characters are
existence of the Urdu OCR has limited the concept a digital
written are written from right to left while the numerals are
Urdu library and this nonexistence leads a pathway for
written left to right. So, this makes an OCR design much
immense research in this field.
chaotic and complex.
General Terms 2.2 Non-Monotonicity
Pattern Recognition, Nastaliq font, Urdu language, Pushto
The writing pattern of Urdu is quite different from the other
language.
cursive languages. In the Urdu like scripts, one frequently
Keywords goes back to the already written character as certain letters
Image Segmentation, Optical Character Reader, Feature consist of stoke that goes back and beyond the previous
Extraction, Classification. character e.g.; JEEM, HAAY.

1. INTRODUCTION 2.3 Context sensitivity


The character recognition or the optical character recognition Each character in Urdu language changes its shape in
is the process of the mechanical or electronic conversion of accordance with the neighboring character. Thus a character
the type images of handwritten text or printed text into can have different shapes.
machine encoded text whether the text is taken from a
handwritten document, photo of the document. It is a method
2.4 Complex dot placement
There is always some dot displacement in Urdu language
of digitizing the printed texts so that they can be edited,
depending upon the character presence with the neighboring
searched, stored, more compactly and can be displayed online
characters. Due to this the dots displace from their standard
using machine processes. The applications of the OCR’s are
positions.
mostly for the blind impaired users, data entry for business
documents, number plate recognition, defeating CAPTCHA in 2.5 Spacing
anti-bot systems. The OCR works with implementation of Like the English language, there is no definite procedure to
various techniques, tools and on the basis of them the understand the intra word spacing in case of Urdu language.
accuracy level is obtained and thus a result is generated. This makes the readability a bit difficult task.
There are a variety of languages spoken all over the world.
Many people are multi-linguistic, but certain languages are 2.6 Looping
having immense cultural influence from the prehistoric times. In Urdu language certain characters have negative and
The main languages of the era are Hindi, Sanskrit, Urdu, positive loops. The negative loops form shape resembling to
Punjabi, Sindhi, Pushto etc. Urdu is the national language of the circle or oval or vice versa.
Pakistan, and an official language of six states of India. It is
also one of the official languages recognized in the ‫میں زمانہ اگلے ہیں کہتے‬
Constitution of India. Urdu language has some specialized
vocabulary and apart from it Urdu is intelligible with standard In this example there are both positive and negative loops. For
Hindi. In the Asian continent, the Urdu language came under example: meem and gaaf.
the influence of British rule, when they replaced the Persian
language by the Urdu language. Urdu language writing style 3. URDU WRITING STYLES
basically comes under the ’Cursive Languages Writing Style’. The various Urdu writing styles are present based on the past
Urdu is written in Nastaliq format. As there is a lot of old, styles followed by different writers and people around the
popular literature written in Urdu language present in globe and they are
handwritten form, but all this is not digitized. So, due to the

22
International Journal of Computer Applications (0975 – 8887)
International Conference on Advancements in Engineering and Technology (ICAET 2017)

3.1 Urdu Script 4.2 Preprocessing


It is an extension of the Persian alphabet from right to left, This step involves the tasks such as the separation of dots
which itself is an extension of the Arabic alphabet. This touching the base of the ligature.
feature is known as the Persian Calligraphy, but the Urdu
language follows the Nastaliq style of this calligraphy. It is 4.3 Segmentation
the most popular and commonly used style of the Urdu In this step, the text of a paragraph is segmented into lines and
language used so far. then the lines into words and the words into sub
words/ligatures. This step has two kinds of approaches viz
3.2 Kaithi Script “segmentation free or holistic methods” and “segmentation
This script was used in the British administration courts of based or analytical methods”.
Bengal, Bihar, North-West provinces and Oudh. It is a highly
Persianized and technical form of the Urdu language and was 4.5 Feature Extraction
used prominently in the 19th century. Most of the legal This step involves the extraction of unique and salient patterns
operations or paperwork of the British India was executed in from the input image to enhance the discrimination power and
this script. reduce data for the classification. The extracted features can
be classified as Statistical features, Global features and.
3.3 Devanagari Script Structural features.
The introduction of orthographic features by publishers into
the Devanagari script solved the purpose of representing the 4.6 Recognition/classification
Perso-Arabic etymology of Urdu words. This is the most On the basis of the extracted features the classification
popular script adopted for publishing journals and other /recognition is the main decision making stage. The pattern is
technical tasks. identified and recognized from the input features.

3.4 Roman script 5. STEPS OF AN OCR PROCESS


Due to the ease and availability of Roman movable type of The diagramatic flow diagram shows how and what main
printing press, Urdu was occasionally written in Roman script. steps are taken into consideration while designing an OCR,
It is prominently used over internet as well by the youngsters and they are as shown in the fig:

SCANNED INPUT

PREPROCESSING

SEGMENTATION

FEATURE EXTRACTION

RECOGNIZED TEXT

Fig. 2: Steps of OCR for cursive languages:


A brief description of the diagram is given below as:

5.1 Input or scanned pages


This is the data which needs to be recognized by the optical
character reader. This can be in any form i.e., either in text
form or in the form of an image.
Fig.1: The Urdu Nastaliq alphabets with their names 5.2 Pre –processing
Devanagari and Latin alphabets. This step is performed on raw data to prepare it for another
processing procedure. It is the preliminary step that
4. OPTICAL CHARACTER RECOGN- transforms the data into a format that can be more effectively
ITION TYPES processed with ease. It is an important step and usually
Based on the type of the input mode, the OCR can be consists of binarization, filtering, smoothing, slant correction,
classified as On-line OCR and Off-line OCR. The online skew detection, thinning, baseline detection etc. This requires
character recognition system consists of five components. fineness while carrying out the tasks as it can severely and
adversely affect the upcoming steps.
4.1 Image acquisition
This step involves binarization, filtering, smoothing, slant 5.3 Segmentation
correction, skew detection, and thinning and baseline It is defined as the process by which a given data is divided
detection to improve the performance. This step affects the into sub data or we can say that in order to detect what
reliability and efficiency. actually is contained in the image or input, the division of the

23
International Journal of Computer Applications (0975 – 8887)
International Conference on Advancements in Engineering and Technology (ICAET 2017)

given image or data is done into subparts and then taking output dependent on the state is visible. Each state has the
these subparts after processing are merged together to probability distribution over the possible output tokens and
determine what actually was in the input, or the subparts can the sequence of tokens generated by the model give
be feed to the next step for processing. information about the sequence of states. The ‘hidden’ here
indicates a state through which the model passes but not the
5.4 Feature extraction parameters of the model. The HMM are generally applied in
This step starts from the initial step of measured data and temporal pattern recognition such as speech, handwriting,
extracts the derived or desired features or values which are gesture recognition, musical score following etc. The
intended to be non-redundant and informative. The extracted diagrammatic representation of the model is shown as:
or selected features are assumed to contain the desired
information or data. It extracted features can be classified as a12 a23
Structural features, Statistical features, Global
transformations. X1 X2 X3
a22
5.5 Recognized text
This is the final output result or desired data that has been
obtained from the application of the previous steps.

6. APPROACHES b11 b12 b13 b14


For the design of an optical character reader for the Urdu
language, a wide variety of techniques/approaches were used
by different researchers.

6.1 Segmentation based approaches


Segmentation is the process in which the given data is divided
i.e., the subparts of a particular data are formed. So the data
that is to be segmented can be in any form e.g., a paragraph, a
line, a word etc. The subpart or the divisible part is made such
that it can be processed easily in the upcoming steps. The Y1 Y2 Y3 Y4
segmentation is followed in such a sequential manner that the
paragraph is segmented into the individual lines and the line is
then segmented into different words and at last the words are Fig.5: Hidden Markov Model with X-States, Y-Possible
segmented into the characters or the alphabets. observations, a-State transition probabilities, b-
Output probabilities.
PARAGRAPH LINE
6.3 Template matching:
In this approach we find the small parts of an image which
match the template image. It is a feature based approach i.e.,
CAHARACTER WORD strong features are taken into consideration while matching. If
the strong features are not present, then the template based
Fig.3: Step wise representation of segmentation approach is followed and it proves to be effective. It requires
sampling of a large number of points and then from that
The commonly two types of approaches in segmentation sample we start matching the points. If the template may not
include Segmentation free or holistic approach and provide a direct match, then the other methods like motion
Segmentation based or analytical approach. The analytical tracking and occlusion handling are performed. This approach
approach is further of two methods i.e., indirect and direct. In has various applications as face detection, visual object
direct method, the ligatures are not further segmented while in recognition, car plate recognition etc. It is a simple method
indirect method, a word is separated directly into the letters used for classification and pattern recognition. A database of
using a number of heuristics that identify all of the templates is used for matching which is called the training
segmentation points of a character. A ligature is resolved by data. It can identify scanned or computer written characters,
splitting it into smaller elements that might be letters or less numbers and the secondary characters are known as diacritics.
than letters such as sub letters or small stokes which further The template image is moved to all possible positions of the
need segmentation. Hence, segmentation proves to be an source image and an exact match with nearest representation
extensive and important approach in OCR. is extracted and taken into consideration. In most of the cases
all matching is done on pixel by pixel basis.

6.4 Unicode Mapping


It is a computing industry standard for the consistent
Fig.4: Segmented Urdu word encoding, representation and handling of text in most writing
systems. The latest version of Unicode contains a repository
6.2 Hidden Markov Models (HMM) of more than 120,000 characters covering about 129 modern
It is a statistical model in which the system being modeled is and historic scripts along with multiple symbol sets. Unicode
assumed to be a process with hidden states. The dynamic can be implemented by different character encodings. All of
Bayesian Network is used for representing HMM in the the possible sequences of segments are generated and stored
simpler form. It is based a little on the forward-backward in a file. After the segments are generated, we produce
procedure of a optimal non-linear filtering problem. In this Unicode of each segment. One character can have multiple
model the state is not directly visible to the observer but the segments (over segmentation), while others can combine to

24
International Journal of Computer Applications (0975 – 8887)
International Conference on Advancements in Engineering and Technology (ICAET 2017)

construct one segment (under segmentation). One character major


can have different number/types of diacritics to distinguish order,
among characters. So, a state machine has been developed to template
cope up with all the combinations. One sequence of states matching
exhausts inputs and returns the Unicode value of one
character. When a ligature is tested, long sequence for all FFNN
Machine
segments is generated and then found in the state table. If any Back,
Printed
fulfilling sequence is found then the value is acquired, 200 solidity,
Ligature Hussain
otherwise one state from the sequence is dropped and again Ligature number of 100%
Or Word et al[1]
searched in the table. So, it is the longest sequence matching s holes, axis
Recognitio
algorithm. eccentricity
n
, moments.
….. Handwritt
…….. en Isolated Numeral,
17U0653 Sagheer
Character, 60329 gradient, 99%
2(+0) U0670 et al[5]
Ligature, SVM
47U0647 Word And
…. ….. Numeral
3(+0) 3(+0) U0632 Recognitio Basu et Numeral,
3000 96.20%
3(+0) 3(+0) n al[6] QTLR
127U0632
Sliding
Sardar[7 window
Fig.6: State Transition Table 1050 97%
] and HMM,
Table 1: Comparison of various OCR approaches Online KNN
Data Classificati Accura Isolated
Approach Authors Numeral,
Sets on cy Character,
Razzak 900 Structural
Ligature 96.30%
Connected et al[8] Images ,Rule
Or Word
component Based
And
200 labelling Numeral
Hussain Numeral,
Ligature and 100% Recognitio
et al[1] Fuzzy
s centroid to n Razzak 900 Logic,
centroid 97.80%
et al[9] ligatures Fuzzy rule,
distance
Hybrid and
Horizontal HMM
Small
and vertical
Pal et variety
profile, 96.90%
al[2] Characte 7. CONCLUSION
component
rs The reliable Urdu script OCR is still a far cry due to immense
labeling
Segmentati challenges. In particular the Nastaliq style of writing and its
on Ligature geometrical difference from the Naksh style of writing makes
used as a this more challenging. The researchers had tried both online
150
sentence
structural and offline forms of the handwritten text, but haven’t yet been
method, more successful in either of them. Isolated and ligature based
s
trigram recognition for the Urdu script is more enthusiastic parameter
Akram compose
trained on in research so far. Till this date there is no multilingual OCR
and d of
co- 99.40% available, but there is a need to develop algorithms that can
Hussain[ 6075
occurrence incorporate unlimited database as there is high similarity
3] ligatures
information among Arabic script languages.
and
of ligatures
2156
words
and words 8. REFERENCES
in the [1] S.A. Husain, A multi-tier holistic approach for urdu
corpus. Nastaliq recognition, in: Proceedings of the 6th
International Multitopic IEEE Conference (INMIC'02),
Topologica 2002.
l, contour [2] U. Pal, A. Sarkar, Recognition of printed Urdu script,
3050
Pal et and water in: Proceedings of the Seventh International Conference
characte 98%
al[2] reservoir, on Document Analysis and Recognition (ICDAR 2003),
Isolated rs
template 2003.
Character matching.
Recognitio [3] M. Akram, S. Hussain, Word segmentation for urdu
n Pixel OCR system, in: Proceedings of the 8th Workshop on
values Asian Language Resources. Asian Federation for
Zaman 106
using row 95.00% Natural Language Processing, Beijing, China, 2010.
et al[4]
Ligatures
major and [4] S. Zaman, W. Slany, F. Sahito, Recognition of
column segmented Arabic/Urdu characters using pixel values as
their features, in: Proceedings of the 1st International

25
International Journal of Computer Applications (0975 – 8887)
International Conference on Advancements in Engineering and Technology (ICAET 2017)

Conference on Computer and Information Technology the 13th International Multitopic IEEE Conference
(ICCIT'2012), 2012 (INMIC'09), 2009
[5] M.W. Sagheer, C.L. He, N. Nobile, C.Y. Suen, A new [14] M. Riley, Beyond quasi-stationarity: designing time-
large Urdu database for off-Line handwriting frequency representation for speech signals in :
recognition 5716 (2009). Proceedings of the International Conference on
[6] S. Basu, N. Das, R. Sarkar, M. Kundu, M. Nasipuri, Acoustics Speech and Signal Processing(ICASSP87),
D.K. Basu, A novel framework for automatic sorting of vol. 12, 1987 ,pp, 657-660.
postal documents with multi-script address blocks, [15] Nabeel Shahzad,Brandon Paulson and Tracy Hammond
Pattern Recognition 43 (10) (2010) . Urdu Qaeda: Recognition System for Isolated
[7] S. Sardar, A. Wahab, Optical character recognition UrduCharacters IUI 2009 Workshop on Sketch
system for Urdu: online and offline OCR irrespective of Recognition February 8, 2009, Sanibel Island, Florida
fonts, in: Proceedings of the International Conference Chair: Tracy Hammond
on Information and Emerging Technologies (ICIET), [16] Tabassam Nawaz, Syed Ammar Hassan Shah Naqvi,
Karachi, Pakistan, 2010. Habib ur Rehman & Anoshia Faiz Optical Character
[8] M.I. Razzak, A. Belaïd, S.A. Hussain, Effect of ghost Recognition System for Urdu (Naskh Font) Using
character theory on arabic script based languages Pattern Matching Technique International Journal of
character recognition, in: Proceedings of the WASE Image Processing, (IJIP)Volume (3) : Issue (3)
Global Conference on Image Processing and Analysis [17] Sohail Abdul Sattar Shams-ul Haque Mahmood Khan
(GCIA’09), Taiwan, China, 2009. Pathan “A Finite State Model for Urdu Nastalique
[9] M.I. Razzak, F. Anwar, S.A. Husain, A. Belaïd, M. Optical Character Recognition “,IJCSNS International
Sher, HMM and fuzzy logic: a hybrid approach for Journal of Computer Science and Network Security,
online urdu script-based languages' character VOL.9 No.9, September 2009
recognition, Knowledge Based Systems 23 (8) (2010) [18] Faisal Shafait, Adnan-ul-Hasan, Daniel Keysers, and
[10] S.T. Javed, Investigation into a segmentation based Thomas M. Breuel, “Layout Analysis of Urdu
OCR for the Nastaleeq writing system (Master's thesis). Document Images,” [Multitopic Conference, 2006.
National University of Computer & Emerging Sciences, INMIC '06. IEEE,p. 293 – 298.]
Lahore, Pakistan, 2007. [19] S.A.Hussain, Anwar F., Asma. “Online Urdu Character
[11] Z.A. Shah, Ligature based optical character recognition Recognition System.” MVA2007 IAPR Conference on
of Urdu-Nastaleeq font, in: Proceedings of the 6th Machine Vision Applications.
International Multitopic IEEE Conference (INMIC'02), [20] Liana M & Venu G. (2006). Offline Arabic
2002. Handwriting Recognition: A Survey. IEEE,Transactions
[12] S.T. Javed, S. Hussain, Improving Nastalique-specific On Pattern Analysis and Machine Intelligence, vol. 28,
pre-recognition process for Urdu OCR, in: Proceedings No. 5, pp. 712-724.I.
of the 13th International Multitopic IEEE Conference [21] R. Safabakhsh and P. Adibi. (2005). Nastaaligh
(INMIC'09), 2009. Handwritten Word Recognition Using a
[13] S.F. Rashid, S.S. Bukhari, F. Shafait, T.M. Breuel, A ContinuousDensity variable-Duration HMM. The
discriminative learning approach for orientation Arabian J. Science and Eng., vol.30, pp. 95-118.
detection of urdu document images, in: Proceedings of

26

You might also like