Dalal 2008
Dalal 2008
1165
not involving any learning machine, e.g. a relevance index
based on correlation coefficients or test statistics, whereas
wrappers use the performance of a learning machine
trained using a given feature subset. Both filter and
wrapper methods can make use of search strategies to
explore the space of all possible feature combinations that
is usually too large to explored exhaustively Yet filters
are sometimes assimilated to feature ranking methods for
Fig. 2. The matra/Shiroreka feature in Bangla and
which feature subset generation is trivial since only single
English words
features are evaluated . Hybrid methods exist, in which a
filter is used to generate a ranked list of features. On the 1.3.2. Curvature Information and Local Extreme of
basis of the order thus defined, nested subsets of features Curvature
are generated and computed by a learning machine, i.e Curvature information gives a unique, viewpoint
.following a wrapper approach. Another class of embedded independent description for local shape(12). In differential
methods incorporate feature subset generation and geometry, it is well known that a surface can be
evaluation in the training algorithm. The last item on the reconstructed up to second order (except for a constant
list, criterion estimation difficulty to overcome is that a term) if the two principal curvatures at each point is
defined criterion (a relevance index or the performance of known, by using the first and second fundamental forms
a learning machine) must be estimated from a limited Therefore, curvature information provides a useful shape
amount of training data. Two strategies are possible: “in- descriptor for various tasks in computer vision, ranging
sample” or “out-of-sample” The first one (in-sample) is the from image segmentation and feature extraction, to scene
“classical statistics” approach. It refers to using all the analysis and object recognition . In the field of handwritten
training data to compute an empirical estimate. That alphanumeric character recognition , many researcher use
estimate is then tested with a statistical test to assess its different method for recognition such as geometrical and
significance, or a performance bound is used to give a topological feature , statistic feature , and other algorithms
guaranteed estimate. The second one (out-of-sample) is the to perform recognition based on character shape. But for a
“machine learning” approach. It refers to splitting the good handwritten recognition system depends on main two
training data into a training set used to estimate the attributes, first selected feature gathering from a
parameters of a predictive model (learning machine) and a handwritten character, second the recognizers that trained
validation set used to estimate the learning machine to remember feature of each character in order to cluster
predictive performance. Averaging the results of multiple and recognize each input character. In curvature
splitting (or “cross-validation”) is commonly used to information handwritten character is used as a sequence of
decrease the variance of the estimator. curve segments. Each curve segment is characterized by
its degree of curvature which is measure by the cumulative
1.3. Methods for feature extraction angle difference from all sampling points with in the
segments. If the cumulative is minus, it is a clockwise
1.3.1. Horizontal and Vertical Histogram
curve. Since some characters may consist of the same
Horizontal and Vertical Histogram technique is one the number of segments with the same curve (ex. one segment
handwritten script identified technique which uses the with clockwise curve) other features of segments are
Matra/Shirorekha based feature, In this technique longest gathering to distinguish these characters.
horizontal run of black pixels on the rows of a Bangla text
word will be much longer than that of English script(3).
1.3.3. Topological Features
This is so because the characters in a Bangla word are
generally connected by matra/Shirorekha (see Fig.2). Here Topological features such as loops is a group of white
row-wise histogram of the longest horizontal run is shown pixels surrounded by black ones ,end points is point with
in the right part of the words. This information has been exactly 1 neighboring point ,dots a cluster of say 1-3 pixels
used to separate English from Bangla script. Matra feature and junction is a point with more than 2 neighbors all in
is considered to be present in a word, if the length of the thinned black and white images(11) are shown in fig 3.
longest horizontal run of the word satisfies the following
two conditions: (a) if it is greater than 45% of the width of
a word, and (b) if it is greater than thrice of the height of
busy-zone.
1166
= ∫ f(t)g(t)dt with limit (-1,1)t. The Chebyshev series for
Fig.3. Loop, End points, a Dot and Junctions. X and Y is
∞
Dots: Dots above the letters “i” and “j” can be identified X(t) = ∑ αiTi(t)
with a simple set of rules. Short, isolated strokes occurring i=0
on or above the half-line are marked as potential dots. ∞
Y(t) = ∑ βiTi(t)
Junctions: Junctions occur where two strokes meet or
i=0
cross and are easily found in the skeleton as points with
more than two neighbors.
Endpoints: Endpoints are points in the skeleton with only
one neighbor and mark the ends of strokes, though some
are artifacts of the skeletonization algorithm.
Loops: Loops are found from connected-component
analysis on the smoothed image, to find areas of
background color not connected to the region surrounding
the word. A loop is coded by a number representing its
area.
1167
leading ligature and the right end of the ending ligature and applying a rotation or shear transformation on the
constitute turnover points. image. An alternative to skew correction is the use of
reference lines of the form ri(x)= m *x+c where m is the
skew angle and c is the offset of the corresponding
reference line from the x-axis. Angled reference lines may
be computed from the angular histogram at the skew angle.
The best-fit line through the minima points may be
determined by a least-square linear regression procedure
(Figure 7). Minima that do not fall in the vicinity of the
implicit baseline either correspond to descenders, or are
spurious. The halfline may be determined as the regression
line through upper contour maxima. However upper
contour maxima are often poorly aligned, and spurious
Fig.6. Splitting a contour to upper and lower parts. The points are difficult to detect and remove, especially in
upper contour is shown in dotted lines and the lower more discrete writing (non-cursive) styles. Consequently
contour is shown in bold lines the resulting halfline is often erroneous. Reference lines
computed from the angular histogram at the skew angle of
2. Local Contour Extrema: Local extrema represent the baseline have proved to be more reliable (Figure 7(c)).
a powerful abstraction of the shape of the word. Local Y-
extrema is simply derived from the y-coordinates of the
pixels on the chaincoded contour. The primary challenge is
the detection and rejection of spurious extrema arising
from irregularities in the contour. Heuristic filters are used
to eliminate spurious extrema. These include checks on the a) b) c)
distance of the detected extremum from previous extrema
detected, spatial postion relative to other extrema, the slope Fig.7. (a) Exterior contour of word showing lower-
of the contour in the neighborhood of the extremum, and contour minima. (b) Baseline determined as regression
so forth. Majority of spurious extrema in discrete writing line through minima. (c) Angular reference lines from
are due to ends of strokes used to form characters, and angular histogram at skew angle.
generally occur in the middle zone of the word.
Fragmentation in the binary image due to suboptimal 4. Word Length, Ascenders, and Descenders: The
binarization or poor resolution often leads to broken number of minima on the lower contour of a word image is
strokes and spurious extrema. Since the upper contour a good estimate of the length of a word. It is expected that
flows effectively from right to left, left to right excursions such a length would be proportional to the number of
of the contour must be followed by right to left retraces, letters in the word. Figure 4 shows the length estimate of
and this leads to additional extrema being detected. the word as 11 and also shows one ascender and one
Extrema found on such uncharacteristic excursions must descender. The positions of ascsenders and descenders
be discarded lest they lead to spurious ascenders. A similar along with the length of the word are used as features. An
problem arises with right-to-left stretches of the lower ascender refers to the portion of a letter in a word that rises
contour in the context of spurious descenders. Let us refer above the general body of the word. A descender refers to
to extrema on such uncharacteristic excursions of the upper the portion of the word that falls below the general body of
and lower contours as being directionally invalid. the word. Ascenders and descenders can be appropriately
Directional validity is conveniently determined from the selected from the extremas of the upper and lower contour
slopes of the contour elements into and out of the of the word.
extremum and is sufficient to discard most spurious
extrema.
3. Reference Line:
A time-tested technique for global reference line
determination is based on the vertical histogram of pixels .
The method makes the implicit assumption that the word
was written horizontally and scanned in without skew.
Consequently, the method fails for words with significant Fig.8. Local contour extrema: maximas and minimas
baseline skew. Unfortunately, baseline skew is present to on the exterior and interior contours are marked along
varying degrees in most freeform handwriting. Baseline with the reference lines. The loop associated with
skew can be corrected by estimating the skew of the word ‘b’ that rises above the reference lines in the middle is
1168
called an ascender and the loop associated with the pattern analysis and machine intelligence, vol. 20, no. 3, march
’q’ that falls below the reference lines in the middle is 1998
called a descender.
[10] S. Madhvanath, G. Kim, and Venu Govindaraju, Senior
Member, IEEE,” Chaincode Contour Processing for Handwritten
1.4. Concluding Remark Word Recognition”
There are various methods of feature extraction in that
depending on the features the technique for extracting the [11] Tutorial on “ Character recognition system for non experts. “
features are developed and then depending on that by Nawwaf N. Kharma & Rabab K. Ward , University of British
classification of features is done. Horizontal and vertical Columbia.
histogram is used to extract the feature , from dot ,end [12] “Robust Estimation of Curvature Information from Noisy
points also can recognize the handwritten script, from 3D Data for Shape Description “, Chi-Keung Tang and G´erard
curvature information extract the feature as a curve and so Medioni Institute for Robotics and Intelligent Systems University
on . In general the properties or feature is used to invent of Southern California Los Angeles, California 90089-0273.
the techniques for extracting feature.
[13] “An Introduction to Feature Extraction “ , Isabelle Guyon1
1.5. References and Andr´e Elisseeff2 1 ClopiNet, 955 Creston Rd., Berkeley,
CA 94708, USA.
[1] R. Plamondon and S. N. Srihari, “On-line and off-line
handwritten recognition: .A comprehensive survey”, IEEE Trans. [14] “Devnagari numeral recognition by combining decision of
on Pattem Analysis and Machine Intelligence, Vol. 22, 2000, pp. multiple connectionist classifiers “, reena bajaj, lipika dey and
62-84. santanu chaudhury,Department of Electrical Engineering, Indian
Institute of Technology, Department of Mathematics, Indian
[2] U. Mahadevan, and S. N. Srihari, “Parsing and Recognition Institute of Technology, New Delhi 110016,India , Sadhana Vol.
of City, State, and ZIP Codes in Handwritten Addresses”, In 27, Part 1, February 2002, pp. 59–72.
Proc. of 5th Int. Conf. on Document Analysis and Recognition,
1999, pp. [15] Daniel I. Morariu, Lucian N. Vintan, and Volker Tresp ,”
Evolutionary Feature Selection for Text Documents using the
[3] K. Roy, S. Vajda, U. Pal, and B. B. Chaudhuri, “A’System SVM” International Journal of Applied Mathematics and
towards Indian Postal Automation”, In Proc. of lntemational Computer Sciences Volume 1 Number 1.
Workshop on Frontier of Handwriting Reognition-9,2004.
[16] Khampheth Bounnady, Boontee Kruatrachue, and Somkiat
[4] U. Pal and B. B. Chaudhuri, “Script line separation from Wangsiripitak , “On-Line lao Handwritten Recognition with
Indian multi-script documents” IETE Journal of Research, Vol. Proportional Invariant Feature “,proceedings of world academy
49, 325-328. 2003, pp. 3-1 1. of science, engineering and technology volume 5 april 2005 issn
1307-6884.
[5] Z. Shi and V. Govindaraju, “Skew Detection for Complex
Document Imagedusing Fuzzy Runlength”, In Proc. of 7th Int.
Conf. on Document Analysis and Recognition, 2003, pp. 715-
719.
[7] .U. Pal and P. P. Roy, ”Multi-oriented and curved text lines
extraction from Indian documents“, IEEE Trans. on Systems,
Man and Cybernetics- Part B, Vo1.34,2004, pp. 1676-1684.
1169