0% found this document useful (0 votes)
8 views6 pages

Dalal 2008

Uploaded by

Savet Omron
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views6 pages

Dalal 2008

Uploaded by

Savet Omron
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

First International Conference on Emerging Trends in Engineering and Technology

A Survey of Methods and Strategies for Feature Extraction in


Handwritten Script Identification

Ms. Snehal Dalal, Mrs. Latesh Malik


P.C.E, Nagpur , G.H. R.C.E, Nagpur,
[email protected] [email protected]

Abstract dimension n, x = [x1, x2, ...xn]. The components x′ of this


Feature extraction is one of the basic function of vector are the original features where x′ a vector of
handwritten Script Identification. It involves measuring transformed features of dimension n′. Preprocessing
those features of the input pattern are relevant to transformations may include:
classification . This paper provides a review of these Standardization: Features can have different scales
advances . The aim is to provide an appreciation for the although they refer to comparable objects. Consider for
range of techniques that have been developed rather than instance, a pattern x = [x1, x2] where x1 is a width
to simply list sources. Various types of features proposed measured in meters and x2 is a height measured in
for handwritten script identification include horizontal centimeters. Both can be compared, added or subtracted
and vertical histogram , curvature information and local but it would be unreasonable to do it before appropriate
extreme of curvature, topological features such as loops is normalization. The following classical centering and
a group of white pixels surrounded by black ones ,end scaling of the data is often used: x′i= (xi−µi)/σi , where µi
points is point with exactly 1 neighbouring point ,dots a and are σi the mean and the standard deviation of feature
cluster of say 1-3 pixels and junction is a point with more xi over training examples.
than 2 neighbours all in thinned black and white images, Normalization: Consider for example the case where x
Parameters of polynomial or curve fitting functions, is an image and the xi’s are the number of pixels with color
Contour information where contours is the outside i, it makes sense to normalize x by dividing it by the total
boundary of a pattern . . number of counts in order to encode the distribution and
remove the dependence on the size of the image. This
Keywords: Matra/ shirorekha based feature ,topological translates into the formula: x′ = x/||x||ψ
features, Contour information, feature extraction Signal enhancement: The signal-to-noise ratio may be
,Handwritten recognition. improved by applying signal or image-processing filters.
These operations include baseline or background removal,
1. Introduction de-noising, smoothing, or sharpening. The Fourier
transform and wavelet transforms are popular methods.
Extraction of local features: For sequential, spatial or
1.1. The Role of Feature Extraction other structured data, specific techniques like convolution
Feature is synonymous of input variable or attribute. methods using hand-crafted kernels or syntactic and
Finding a good data representation is very domain specific structural methods are used. These techniques encode
and related to available measurements. In medical problem specific knowledge into the features.
diagnosis example, the features may be symptoms, that is, Linear and non-linear space embedding methods:
a set of variables categorizing the health status of a patient When the dimensionality of the data is very high, some
(e.g. fever, glucose level, etc.). Human expertise, which is techniques might be used to project or embed the data into
often required to convert “raw” data into a set of useful a lower dimensional space while retaining as much
features, can be complemented by automatic feature information as possible. Classical examples are Principal
construction methods. In some approaches, feature Component Analysis (PCA) and Multidimensional Scaling
construction is integrated in the modeling process. For (MDS) .The coordinates of the data points in the lower
examples the “hidden units” of artificial neural networks dimension space might be used as features or simply as a
compute internal representations analogous to constructed means of data visualization.
features. In other approaches, feature construction is a Non-linear expansions: Although dimensionality
preprocessing. To describe preprocessing steps, let us reduction is often summoned when speaking about
introduce some notations. Let x be a pattern vector of complex data, it is sometimes better to increase the

978-0-7695-3267-7/08 $25.00 © 2008 IEEE 1164


DOI 10.1109/ICETET.2008.44
dimensionality. This happens when the problem is very - evaluation criterion estimation (or assessment
complex and first order interactions are not enough to method).
derive good results. This consists for instance in The last three aspects are relevant to feature selection
computing products of the original features xi to create and are schematically summarized in Figure 1.
monomials xk1xk2 ...xkp .
Feature discretization. Some algorithms do not
handle well continuous data. It makes sense then to
discretize continuous values into a finite discrete set. This
step not only facilitates the use of certain algorithms, it
may simplify the data description and improve data
understanding .

1.2. Feature Extraction


We are decomposing the problem of feature extraction
in two steps. First feature construction and second is
feature selection.
1.2.1. Feature Construction a) Filter
Feature construction is one of the key steps in the data
analysis process ,largely conditioning the success of any
subsequent statistics or machine learning endeavor. In
particular, one should beware of not losing information at
the feature construction stage. It may be a good idea to add
the raw features to the preprocessed data or at least to
compare the performances obtained with either
representation. The medical diagnosis example that we
have used before illustrates this point. Many factors might
influence the health status of a patient. To the usual
clinical variables (temperature, blood pressure, glucose
level, weight, height, etc.), one might want to add diet
information (low fat, low carbonate, etc.), family history,
or even weather conditions. Adding all those features
seems reasonable but it comes at a price: it increases the b) Wrappers
dimensionality of the patterns and thereby immerses the
relevant information into a sea of possibly irrelevant, noisy
or redundant features.
1.2.2. Feature Selection
Feature selection is primarily performed to select
relevant and informative features, it can have other
motivations, including:
1. general data reduction, to limit storage requirements
and increase algorithm speed.
2. feature set reduction, to save resources in the next
round of data collection or during utilization;
3. performance improvement, to gain in predictive
accuracy;
4. data understanding, to gain knowledge about the c) Embedded Methods
process that generated the data or simply visualize the data Fig. 1. The three principal approaches of feature
There are four aspects of feature extraction: selection. The shades show the components used by the
-feature construction; three approaches: a) filters b)wrappers and
- feature subset generation (or search strategy); c)embedded methods.
- evaluation criterion definition (e.g. relevance index or
predictive power); Filters and wrappers differ mostly by the evaluation
criterion. It is usually understood that filters use criteria

1165
not involving any learning machine, e.g. a relevance index
based on correlation coefficients or test statistics, whereas
wrappers use the performance of a learning machine
trained using a given feature subset. Both filter and
wrapper methods can make use of search strategies to
explore the space of all possible feature combinations that
is usually too large to explored exhaustively Yet filters
are sometimes assimilated to feature ranking methods for
Fig. 2. The matra/Shiroreka feature in Bangla and
which feature subset generation is trivial since only single
English words
features are evaluated . Hybrid methods exist, in which a
filter is used to generate a ranked list of features. On the 1.3.2. Curvature Information and Local Extreme of
basis of the order thus defined, nested subsets of features Curvature
are generated and computed by a learning machine, i.e Curvature information gives a unique, viewpoint
.following a wrapper approach. Another class of embedded independent description for local shape(12). In differential
methods incorporate feature subset generation and geometry, it is well known that a surface can be
evaluation in the training algorithm. The last item on the reconstructed up to second order (except for a constant
list, criterion estimation difficulty to overcome is that a term) if the two principal curvatures at each point is
defined criterion (a relevance index or the performance of known, by using the first and second fundamental forms
a learning machine) must be estimated from a limited Therefore, curvature information provides a useful shape
amount of training data. Two strategies are possible: “in- descriptor for various tasks in computer vision, ranging
sample” or “out-of-sample” The first one (in-sample) is the from image segmentation and feature extraction, to scene
“classical statistics” approach. It refers to using all the analysis and object recognition . In the field of handwritten
training data to compute an empirical estimate. That alphanumeric character recognition , many researcher use
estimate is then tested with a statistical test to assess its different method for recognition such as geometrical and
significance, or a performance bound is used to give a topological feature , statistic feature , and other algorithms
guaranteed estimate. The second one (out-of-sample) is the to perform recognition based on character shape. But for a
“machine learning” approach. It refers to splitting the good handwritten recognition system depends on main two
training data into a training set used to estimate the attributes, first selected feature gathering from a
parameters of a predictive model (learning machine) and a handwritten character, second the recognizers that trained
validation set used to estimate the learning machine to remember feature of each character in order to cluster
predictive performance. Averaging the results of multiple and recognize each input character. In curvature
splitting (or “cross-validation”) is commonly used to information handwritten character is used as a sequence of
decrease the variance of the estimator. curve segments. Each curve segment is characterized by
its degree of curvature which is measure by the cumulative
1.3. Methods for feature extraction angle difference from all sampling points with in the
segments. If the cumulative is minus, it is a clockwise
1.3.1. Horizontal and Vertical Histogram
curve. Since some characters may consist of the same
Horizontal and Vertical Histogram technique is one the number of segments with the same curve (ex. one segment
handwritten script identified technique which uses the with clockwise curve) other features of segments are
Matra/Shirorekha based feature, In this technique longest gathering to distinguish these characters.
horizontal run of black pixels on the rows of a Bangla text
word will be much longer than that of English script(3).
1.3.3. Topological Features
This is so because the characters in a Bangla word are
generally connected by matra/Shirorekha (see Fig.2). Here Topological features such as loops is a group of white
row-wise histogram of the longest horizontal run is shown pixels surrounded by black ones ,end points is point with
in the right part of the words. This information has been exactly 1 neighboring point ,dots a cluster of say 1-3 pixels
used to separate English from Bangla script. Matra feature and junction is a point with more than 2 neighbors all in
is considered to be present in a word, if the length of the thinned black and white images(11) are shown in fig 3.
longest horizontal run of the word satisfies the following
two conditions: (a) if it is greater than 45% of the width of
a word, and (b) if it is greater than thrice of the height of
busy-zone.

1166
= ∫ f(t)g(t)dt with limit (-1,1)t. The Chebyshev series for
Fig.3. Loop, End points, a Dot and Junctions. X and Y is

Dots: Dots above the letters “i” and “j” can be identified X(t) = ∑ αiTi(t)
with a simple set of rules. Short, isolated strokes occurring i=0
on or above the half-line are marked as potential dots. ∞
Y(t) = ∑ βiTi(t)
Junctions: Junctions occur where two strokes meet or
i=0
cross and are easily found in the skeleton as points with
more than two neighbors.
Endpoints: Endpoints are points in the skeleton with only
one neighbor and mark the ends of strokes, though some
are artifacts of the skeletonization algorithm.
Loops: Loops are found from connected-component
analysis on the smoothed image, to find areas of
background color not connected to the region surrounding
the word. A loop is coded by a number representing its
area.

1.3.4. Parameters of polynomials Fig.4. The (x, y) trace of G


Curve fitting is finding a curve which matches a series
of data points and possibly other constraints. It uses the 1.3.5. Contour Information
concept of both interpolation (where an exact fit to Given a binary image, it is scanned from top to bottom
constraints is expected) and regression analysis. Both are and right to left, and transitions from white (background)
sometimes used for extrapolation. Regression analysis to black (foreground) are detected. The contour is traced
allows for an approximate fit by minimizing the difference counterclockwise outside the pattern (clockwise for
between the data points and the curve. interior contours) and expressed as an array of contour
Start with a first degree polynomial equation , Y=ax+b elements shown in figure 5. Each contour element
. This is a line with slope ‘a’ that a line will connect any represents a pixel on the contour and contains fields for
two points. So, a first degree polynomial equation is an the x,y coordinates of the pixel, the slope or direction of
exact fit through any two points. If the order of the the contour into the pixel, and auxiliary information such
equation increased to a second degree polynomial, getting as curvature.
Y= ax2+ bx+c , this will exactly fit three points .Again
increase the order of the equation to a third degree
polynomial getting y= ax3+bx2+cx+d, this will exactly fit
four points. A more general statement would be to say it
will exactly fit four constraints. Each constraint can be a
point, angle, or curvature (which is the reciprocal of the
radius, or 1/R). Angle and curvature constraints are most
often added to the ends of a curve, and in such cases are Fig. 5. Contour element
called end conditions. Identical end conditions are Techniques that are useful in recognition of words are
frequently used to ensure a smooth transition between described in this subsection(10): (i) Determination of
polynomial curves contained within a single spline. upper and lower contours of the word, (ii) Determination
Higher-order constraints, such as "the change in the rate of of significant local extrema on the contour, (iii)
curvature", could also be added Parameters of polynomial Determination of reference lines, and (iv) Determination of
technique is used for feature extraction. This concept is word length.
used in Chebyshev series which accurately captures the 1. Upper and lower contours: Division of an exterior
shape of handwritten mathematical characters . contour into upper and lower parts (Figure 6) involves the
For analysis, assume that handwriting traces are detection of two “turnover” points on the contour - the
provided as sequences of (xi, yi, ti) tuples that have been points at which the lower contour changes to upper contour
normalized so t0 = 0 and tn = 1. The (x,y) trace of character and vice versa. Given that the exterior contour is traced in
is shown in figure 4. For basis functions use Chebyshev the counterclockwise direction, the upper contour runs
polynomials of the first kind, defined by Tn(t) ≡ predominantly from right to left and the lower
cos(n arccos t) and orthonormal for the inner product (f, g ) predominantly from left to right. The left end of the

1167
leading ligature and the right end of the ending ligature and applying a rotation or shear transformation on the
constitute turnover points. image. An alternative to skew correction is the use of
reference lines of the form ri(x)= m *x+c where m is the
skew angle and c is the offset of the corresponding
reference line from the x-axis. Angled reference lines may
be computed from the angular histogram at the skew angle.
The best-fit line through the minima points may be
determined by a least-square linear regression procedure
(Figure 7). Minima that do not fall in the vicinity of the
implicit baseline either correspond to descenders, or are
spurious. The halfline may be determined as the regression
line through upper contour maxima. However upper
contour maxima are often poorly aligned, and spurious
Fig.6. Splitting a contour to upper and lower parts. The points are difficult to detect and remove, especially in
upper contour is shown in dotted lines and the lower more discrete writing (non-cursive) styles. Consequently
contour is shown in bold lines the resulting halfline is often erroneous. Reference lines
computed from the angular histogram at the skew angle of
2. Local Contour Extrema: Local extrema represent the baseline have proved to be more reliable (Figure 7(c)).
a powerful abstraction of the shape of the word. Local Y-
extrema is simply derived from the y-coordinates of the
pixels on the chaincoded contour. The primary challenge is
the detection and rejection of spurious extrema arising
from irregularities in the contour. Heuristic filters are used
to eliminate spurious extrema. These include checks on the a) b) c)
distance of the detected extremum from previous extrema
detected, spatial postion relative to other extrema, the slope Fig.7. (a) Exterior contour of word showing lower-
of the contour in the neighborhood of the extremum, and contour minima. (b) Baseline determined as regression
so forth. Majority of spurious extrema in discrete writing line through minima. (c) Angular reference lines from
are due to ends of strokes used to form characters, and angular histogram at skew angle.
generally occur in the middle zone of the word.
Fragmentation in the binary image due to suboptimal 4. Word Length, Ascenders, and Descenders: The
binarization or poor resolution often leads to broken number of minima on the lower contour of a word image is
strokes and spurious extrema. Since the upper contour a good estimate of the length of a word. It is expected that
flows effectively from right to left, left to right excursions such a length would be proportional to the number of
of the contour must be followed by right to left retraces, letters in the word. Figure 4 shows the length estimate of
and this leads to additional extrema being detected. the word as 11 and also shows one ascender and one
Extrema found on such uncharacteristic excursions must descender. The positions of ascsenders and descenders
be discarded lest they lead to spurious ascenders. A similar along with the length of the word are used as features. An
problem arises with right-to-left stretches of the lower ascender refers to the portion of a letter in a word that rises
contour in the context of spurious descenders. Let us refer above the general body of the word. A descender refers to
to extrema on such uncharacteristic excursions of the upper the portion of the word that falls below the general body of
and lower contours as being directionally invalid. the word. Ascenders and descenders can be appropriately
Directional validity is conveniently determined from the selected from the extremas of the upper and lower contour
slopes of the contour elements into and out of the of the word.
extremum and is sufficient to discard most spurious
extrema.
3. Reference Line:
A time-tested technique for global reference line
determination is based on the vertical histogram of pixels .
The method makes the implicit assumption that the word
was written horizontally and scanned in without skew.
Consequently, the method fails for words with significant Fig.8. Local contour extrema: maximas and minimas
baseline skew. Unfortunately, baseline skew is present to on the exterior and interior contours are marked along
varying degrees in most freeform handwriting. Baseline with the reference lines. The loop associated with
skew can be corrected by estimating the skew of the word ‘b’ that rises above the reference lines in the middle is

1168
called an ascender and the loop associated with the pattern analysis and machine intelligence, vol. 20, no. 3, march
’q’ that falls below the reference lines in the middle is 1998
called a descender.
[10] S. Madhvanath, G. Kim, and Venu Govindaraju, Senior
Member, IEEE,” Chaincode Contour Processing for Handwritten
1.4. Concluding Remark Word Recognition”
There are various methods of feature extraction in that
depending on the features the technique for extracting the [11] Tutorial on “ Character recognition system for non experts. “
features are developed and then depending on that by Nawwaf N. Kharma & Rabab K. Ward , University of British
classification of features is done. Horizontal and vertical Columbia.
histogram is used to extract the feature , from dot ,end [12] “Robust Estimation of Curvature Information from Noisy
points also can recognize the handwritten script, from 3D Data for Shape Description “, Chi-Keung Tang and G´erard
curvature information extract the feature as a curve and so Medioni Institute for Robotics and Intelligent Systems University
on . In general the properties or feature is used to invent of Southern California Los Angeles, California 90089-0273.
the techniques for extracting feature.
[13] “An Introduction to Feature Extraction “ , Isabelle Guyon1
1.5. References and Andr´e Elisseeff2 1 ClopiNet, 955 Creston Rd., Berkeley,
CA 94708, USA.
[1] R. Plamondon and S. N. Srihari, “On-line and off-line
handwritten recognition: .A comprehensive survey”, IEEE Trans. [14] “Devnagari numeral recognition by combining decision of
on Pattem Analysis and Machine Intelligence, Vol. 22, 2000, pp. multiple connectionist classifiers “, reena bajaj, lipika dey and
62-84. santanu chaudhury,Department of Electrical Engineering, Indian
Institute of Technology, Department of Mathematics, Indian
[2] U. Mahadevan, and S. N. Srihari, “Parsing and Recognition Institute of Technology, New Delhi 110016,India , Sadhana Vol.
of City, State, and ZIP Codes in Handwritten Addresses”, In 27, Part 1, February 2002, pp. 59–72.
Proc. of 5th Int. Conf. on Document Analysis and Recognition,
1999, pp. [15] Daniel I. Morariu, Lucian N. Vintan, and Volker Tresp ,”
Evolutionary Feature Selection for Text Documents using the
[3] K. Roy, S. Vajda, U. Pal, and B. B. Chaudhuri, “A’System SVM” International Journal of Applied Mathematics and
towards Indian Postal Automation”, In Proc. of lntemational Computer Sciences Volume 1 Number 1.
Workshop on Frontier of Handwriting Reognition-9,2004.
[16] Khampheth Bounnady, Boontee Kruatrachue, and Somkiat
[4] U. Pal and B. B. Chaudhuri, “Script line separation from Wangsiripitak , “On-Line lao Handwritten Recognition with
Indian multi-script documents” IETE Journal of Research, Vol. Proportional Invariant Feature “,proceedings of world academy
49, 325-328. 2003, pp. 3-1 1. of science, engineering and technology volume 5 april 2005 issn
1307-6884.
[5] Z. Shi and V. Govindaraju, “Skew Detection for Complex
Document Imagedusing Fuzzy Runlength”, In Proc. of 7th Int.
Conf. on Document Analysis and Recognition, 2003, pp. 715-
719.

[6] F. M. Wahl, K. Y. Wong, R G. Casey, “Block segmentation


and text extraction in mixed texthuge documents“, Computer
Graphics and Image Processing, Vol. 20, 1982, pp. 375 - 390.
17.1 U. Pat and Sagarika Dana, ”Segmentation of Bangla
Unconstrained Handwritten Text”, Proc. 7th Int. Conf. on
Document Analysis and Recognition, 2003, pp. 1 128-1 132. 27 I

[7] .U. Pal and P. P. Roy, ”Multi-oriented and curved text lines
extraction from Indian documents“, IEEE Trans. on Systems,
Man and Cybernetics- Part B, Vo1.34,2004, pp. 1676-1684.

[8] Bruce W. Char , Stephen M. Watt . Representing and


Characterizing Handwritten Mathematical Symbols through
Succinct Functional Approximation , Department of Computer
Science Drexel University, Philadelphia, PA 19104 USA,
Department of Computer Science University of Western Ontario
London, Ontario N6A 5B7 Canada,2007

[9] Andrew W. Senior, Anthony J. Robinson , “An Off-Line


Cursive Handwriting Recognition System” . IEEE transactions on

1169

You might also like