Determine Characters by Mathematical Model For Segmentation Arabic Words
Determine Characters by Mathematical Model For Segmentation Arabic Words
Indian Journal of Science and Technology, Vol 9(40), DOI: 10.17485/ijst/2016/v9i40/84801, October 2016 ISSN (Online) : 0974-5645
Abstract
Objectives: The objectives are to use a mathematical model to define a region-based segmentation method. This study
determines whether the Connected Component (CC) is one or more than one character. Method: Whereas the other
methods they tend to ignore the solid foundation of describing characters and connection points. This proposed method
adopts on many stages for adaptive the mathematic in segmentation characters process are: i) peak detection from vertical
histogram for (CC), and ii) enhancement of the model using a mathematical model to improve the segmentation method
based on the Voronoi Diagram (VD) Through a number of peaks. Findings: Whereas characters, such as سand ص, are
confusing to segmentation methods; these errors include separating connection strokes from both sides to produce
a separated one. Other errors must be handled at a later stage, such as segmenting the character ـحat an acute angle.
Whereas the mathematical model is depending on peaks, numbers, direction, and length of CC. This model is tested on
segmentation using five Arabic datasets as: AHDB, IFN-ENIT, AHDB-FTR, APTI, Zeki and Al Hamad DB datasets. The
Preliminary results show that the application of the EDMS feature with multi perceptron-NN classifier it’s preferable.
Its accuracy when compared with Zeki method is 96.81% for the ACTOR printed dataset and the rate of this method is
85.81% for Zeki dataset and also compared with Al Hamad method is 95.09%, and 89.10% for ACDARhandwrittendataset.
Whereas the others datasets accuracies are 95.09% for IFN-ENIT, 98.27% for APTI, 91.63% for AHDB, and 90.69% for
AHDB-FTR on same feature (EDMS) and classifier (MLP_NN). Novelty: Adapt Mathematics with segmentation process to
determine whether the CC is one or more than one character. Using a mathematical model based on the VD to avoid over
segmentation .
Keywords: Mathematical Model, Arabic words, More than one character, Segmentation, Voronoi Diagrams
method, known as the divide-and-conquer algorithm, has Arabic is currently one of the most widespread
been chosen2. languages in the world, and several robust studies on
According3 states that the primary purpose of over- the recognition of its handwritten script have been
segmentation is to divide characters into single entities reported. However, to date, none of these studies has
such that no merged characters remain after the process. been comprehensive because Arabic is considerably
Discarding invalid points of segmentation and retaining more complex than Latin languages. [Figure 2] illustrates
valid points of segmentation are considered to be two of some of complexity of Arabic characters. The problems
the primary problems following over-segmentation. associated with the segmentation of Arabic scripts into
This paper demonstrates novel techniques for detecting characters and their classifications make Arabic even
whether connected components represent one character more difficult to study3.
or more than one character to avoid over-segmentation There are several issues associated with segmentation
and to reduce the computation time required for the and methods thereof. In some languages, certain
segmentation of letters in the Arabic language. The characters contain strokes or bulges that easily and
proposed method is supported by mathematical models effectively can be segmented by many of the available
and examples that address the scarcity of such information segmentation methods, a property referred to as over-
in the literature, despite the use of existing handwritten segmentation,1,17:
databases. [Figure 1] shows the types uses of datasets in • Characters that have the primary shape ـحand acute
this article. angles within a character tend to be segmented
incorrectly.
• Characters with several bulges, such as ـسand ـص,
may be segmented as one unit.
• The primary issue associated with the character ـف
is that the segmentation of the character frequently
occurs at its corner below the loop.
• Segmentation is considered to separate the tails
of the primary shapes س, ص, and ق, producing
the نshape in addition to the first part(s) of the
character.
Figure 1. Examples of datasets: (a) IFN/ENIT, (b) AHDB/ • A hanging segment is the result of segmenting tails
FTR, (c) Al Hamad, (d) AHDB, (e) APTI, and (f) Zeki. of certain characters, such as ب, ف, ك, and ل.
2. Character Segmentation
Problem and Weaknesses
Through the present methods applied for over-
segmentation., that of3, is clearly optimal. However,
considering the different failure of the available
techniques, which contain thinning, a new technique for
boost the heuristic algorithm to decrease “bad” errors and
increase total outcome accuracy is eligible. An insightful
and more helpful technique for identifying the range of Figure 2. Shapes of Arabic characters in different
a component as a separate segment of a bigger character positions, fonts, overlapping, shared horizontal space:
may be acquired by utilize a connected component’s (a) two characters occupying a shared horizontal space,
width-to-height ratio. To ensure a higher rate of accuracy (b) Arabic ligatures, their constituent characters, (c) four
in classifying components, more complex heuristics must characters written in completely different ways, (d) shapes
of the character “ع,” (e) letter '' may be missed, and (f)
be developed based on features that exclude aspect ratios
characters with similar contours.
and positioning of these components.
2 Vol 9 (40) | October 2016 | www.indjst.org Indian Journal of Science and Technology
Tabril Ramdan, Khairuddin Omar and Mohammad Faidzul
3. Proposed Method
The purpose of the suggest method is to beat the problems
and failure of the prior methods. Therefore, we developed
a high-performance method that is generally eligible of
qualify all challenges related by over segmentation, and
is particularly effective in cases in which the prior failed
methods. The suggest method assume a local approach
and is combined of two method: an orientation method Figure 4. Illustration of vertical histogram for
for define connected components in two directions (left, connected component ()اعس.
right) based on the features of an image, and a novel
method that can define the proper value the thresholding 3.3 Peak Detection
of every peaks along the y-axis and the length connected According to13, the peak detected takes two case: the first
components along the x-axis. its values to be evaluated and second its threshold. The
first argument is the y-values for the maximum of each
3.1 Architecture Colum in the carve, and threshold value its relies on the
The proposed method follows the steps shown in [Figure specific requisition. The peak detected function checks
3]. for each value the values to the left and to the right of the
considered value. If the behold value overtake the values
to the left and right by at minimum the threshold value,
then the considered value is count a peak.
Based on this logic, the function should returning
three maximum values for the graph created if a proper
threshold value is applied. The peak detection function
then finding two vectors consisting of all maximum
points and all minimum points, as shown in [Figure 5].
If the peak detect is miss or zero in the worst condition,
then the connected component is one character, ligature,
or dots.
Figure 3. Illustration of determination of CC of one
or more than one character.
Vol 9 (40) | October 2016 | www.indjst.org Indian Journal of Science and Technology 3
Determine Characters by Mathematical Model for Segmentation Arabic Words by Voronoi Diagrams
4 Vol 9 (40) | October 2016 | www.indjst.org Indian Journal of Science and Technology
Tabril Ramdan, Khairuddin Omar and Mohammad Faidzul
(f)]. To evaluate the performance of the proposed method, edge direction matrixes (EDMS), which is suggested for
we compared its results with those of Zeki’s method. The analysis in terms of the statistical texture method via
author outlined the factor values of the method based binary images. Numerous equations were suggested to
on his own values. It should be noted that the proposed include characteristics from EDM1 and EDM2 values.
method demonstrated the best results for all chosen By calculating their correlation, the proposed equations
challenges, as shown in the first image in [Figure 6(a)]. extracted 22 features. Homogeneity, pixel regularity,
The most comprehensive results were obtained using the weight, edge direction, and edge regularity, based on
proposed method with the handwritten image, as shown the study by15, and grey-level co-occurrence matrix
in Figure 6(f), as well as words, as shown in [Figure 6(b)]. (GLCM) texture measurements were used to describe
Finally, [Figure 6(c)] shows that the different datasets image texture, as proposed by Haralick in the 1970s14.
provided better results compared with those shown in The invariant moment technique is another local feature
[Figure 6(e)]. extraction method proposed by Hu 1962s. Hu’s method
features seven invariant moments17, and was applied to
the three types of classifiers, including a random forest
classifier, multiple perceptron neural network, and Rules
Ridor. Note that this method used an inverted filter with
GLCM and MOMENT features.
Using experiments to evaluate the results obtained
from several types of databases and the application of
three types of feature extraction methods with three
types of classifiers, we found that the best results were
obtained with the EDMS feature extraction method with
classifier multilayer neural networks, as shown in Table
Figure 6. Sample results for (a) segmentation after 2. The use of EDMS with MLP-NN was observed to be
determining connected components as either one or more
more effective than the use of GLCM and MOMENT in
than one character from the IFNENIT dataset, (b) a word
the other category (95.09% in the IFN-ENIT handwritten
from the APTI dataset, (c) a word from the IFNENIT dataset
with the proposed method’s results, and (d) words from the dataset and 98.27% in the APTI printed dataset). The
AHDB dataset from the proposed method, and (e) and (f) GLCM feature extraction method showed lower detection
the primary shape dataset and the Ahmed Zeki algorithm. percentages for rule classifiers than did the EDMS and
MOMENT methods (69.43% in the IFN-ENIT dataset
5.2 Analytical Experiment and 71.42% in the APTI dataset).
Basic features may be recognize similarly as character For any method some cases its bad or incorrect
strokes, character openings, or other character properties, segmentation; in this method also some cases it
for instance, concavities and convexities, end centers performs badly because some reasons affected on result
also, crossing points, extreme, union with straight lines segmentation like different size of fonts for handwritten
et cetera14. This test was mostly complete to explain the this happens by reason of fixed threshold of peaks
analytical achievement of the suggested method for detection, also overlapping characters can affected, and
various types of segmentation challenges. Therefore, the segment some characters when the writing time Although
benchmark datasets used were IFNENIT, AHDB, and he was supposed to write these characters
APTI. The results obtained after applying the proposed As a one body.Comparisons show that the EDMS
methodology were used to determine whether one or feature extraction method identifies character
more than one character had been applied to several components as either one or more than one character
datasets after the segmentation algorithm. As previously using two methods: the Zeki method and histogram. The
stated, the proposed method appears to produce good results demonstrate that the proposed method is effective
results when identifying Arabic characters. [Table 2] in avoiding over-segmentation and yields satisfactory
presents the findings obtained through the application results. [Table 3] compares the success rates, locations of
of three types of feature extraction methods, including segmentation points, over-segmentation, and error rates.
Vol 9 (40) | October 2016 | www.indjst.org Indian Journal of Science and Technology 5
Determine Characters by Mathematical Model for Segmentation Arabic Words by Voronoi Diagrams
A simple analysis of variance of the printed and text for segmentation and to avoid over-segmentation of
datasets for the three methods, Zeki, histogram, and the characters, as mentioned previously in the theses of1,3,4,17.
proposed method was performed. The use of this statistic A mathematical model was used to determine whether a
to test the hypothesis formulated in this study revealed connected component is a character that does not require
that there was no significant difference between the use segmentation or more than one character that requires
of any of these three methods when the data generated segmentation. The results obtained using this method
were compared to those obtained under the alternate appears to be promising and reliable for both handwriting
hypothesis that states that there is indeed a marked and print datasets. The proposed method also possesses
difference in the use of these three methods based on the none of the weaknesses of previously developed methods
data generated. [Table 4] and [Table 5] are summarize the described herein. To evaluate the proposed method,
results of the single-factor ANOVA. we compared it with many other established methods,
Based on the p-value shown in [Table 5], it can be including those proposed by3,17. Both visual and analytical
concluded that the three methods used differ significantly experiments were applied to selected datasets, namely
with respect to their generated values. IFN/ENIT, APTI, ACDAR, AHDB, Zeki DB, and AHDB/
FTR. Based on these experiments, the proposed method
was observed to yield better performance compared to
6. Conclusion that of the Zeki and Al Hamad methods. The proposed
The primary focus of this study was to determine whether method solved the problem of identifying connected
an identified component is either a single Arabic character components and whether they formed one or more than
or more than one character to reduce the time required one character. The proposed method is highly adaptable
6 Vol 9 (40) | October 2016 | www.indjst.org Indian Journal of Science and Technology
Tabril Ramdan, Khairuddin Omar and Mohammad Faidzul
for managing any over-segmentation problems and 8. Slimane F, Kanoun S, Alimi AM. Database and Evaluation
can solve special challenges, such as those described in Protocols for Arabic Printed Text Recognition. Proceedings
of 10th International Conference on Document Analysis
Section 2.
and Recognition. 2012.
9. Al Hamad H, Hamdi-Cherif A. The Arabic Center for Doc-
7. Acknowledgements ument Analysis and Recognition ( ACDAR ) - Structure
and Perspectives. Recent Advances in Information Science.
2010; (1):85–91.
The authors would like to thank Prof Dr. Khairuddin 10. Al-ma S, Elliman D, Higgins C. A Data Base for Arabic
Omar of the National University of Malaysia for his Handwritten Text Recognition Research. Proceedings of
assistance and cooperation. This study was funded by 8th the International Workshop on Frontiers in Handwrit-
the FRGS/1/2014/ICT07/UKM/01/1 grant entitled ing Recognition. 2002; 1(1):117–21.
“Improving Segmentation of Arabic Handwriting 11. Ramdan J, Omar K, Mady A. Arabic Handwriting Data
Base for Text Recognition. Proceedings of the 4th Interna-
by Determination of Neighborhood Using Voronoi
tional Conference on Electrical Engineering and Informat-
Diagrams.” ics. 2013; 11(13). p. 580–4.
12. Shukla MK, Banka H, Yadav KP. Structural Features Ex-
traction for Devnagari and Bangla Language Documents.
8. References Indian Journal of Science and Technology. 2015 Jul; 8(13).
Doi: 10.17485/ijst/2015/v8i13/56453.
1. Zeki A. Segmentation of Arabic characters using Voronoi 13. Bataineh B, Abdullah SNHS, Omar K. A statistical glob-
diagrams. University Kebangsaan Malaysia, Malaysia. 2008. al feature extraction method for optical font recognition.
2. Ramdan J, Omar K. Comparative Study of Algorithms for Intelligent Information and Database Systems. Springer.
Voronoi Diagram Construction on Segmentation of Arabic 2011; 257–67.
Hand Writing. Australian Journal of Basic and Applied Sci- 14. Thakare V. Survey On Image Texture Classification Tech-
ences. 2011; 5(11):1653–67. niques. International Journal of Advancements in Tech-
3. Al Hamad H. Over-segmentation of handwriting Arabic nology. 2013; 4(1):97–10.4 Available from: http://ijict.org/
scripts using an efficient heuristic technique. International index.php/ijoat/article/viewArticle/imagetexture
Conferenceon Wavelet Analysis and Pattern Recognition. 15. Zitova B. Image registration methods: a survey. Image Vis
2012; 180–5. Comput [Internet]. 2003; 21(11):977–1000. Available from:
4. Zeki A, Zakaria M, Liong C. The use of Area-Voronoi Di- http://dx.doi.org/10.1016/S0262-8856(03)00137-9\nhttp://
agram in Separating Arabic Text Connected Components. www.sciencedirect.com/science/article/pii/
Proceedings of 3th the International Conference on Electri- 16. Zeki A. The segmentation problem in arabic character rec-
cal Engineering and Informatics. 2007. ognition the state of the art. 1stInternational Conference
5. Jyothi J, Manjusha K, Kumar MA, Soman KP. Innovative on Information and Communication Technologies. 2005;
Feature Sets for Machine Learning based Telugu Charac- 11–26.
ter Recognition. Indian Journal of Science and Technology. 17. Zeki A, Zakaria MS, Liong CY. Isolation of dots for arabic
2015 Sep; 8(24). Doi: 10.17485/ijst/2015/v8i24/79996. ocr using voronoi diagrams. Proceedings of the Interna-
6. Billauer E. Peak detection using MATLAB [Internet]. 2013. tional Conference on Electrical Engineering and Informat-
Available from: http://www.billauer.co.il. ics. 2007. p. 199–202.
7. Pechwitz M, Maddouri SS, Margner V, Ellouze N, Amiri
H. IFN/ENIT-database of handwritten Arabic words. Pro-
ceedings of CIFED. Citeseer. 2002. p. 127–36.
Vol 9 (40) | October 2016 | www.indjst.org Indian Journal of Science and Technology 7