0% found this document useful (0 votes)
16 views7 pages

Determine Characters by Mathematical Model For Segmentation Arabic Words

1) The document proposes a new mathematical model-based method to segment Arabic words by determining whether connected components represent single or multiple characters to avoid over-segmentation. 2) Existing methods often ignore the structural foundations of Arabic characters and tend to over-segment characters with bulges like seen, lam-alif, and sad. 3) The proposed method uses mathematical models based on peaks, numbers, directions, and lengths of connected components along with Voronoi diagrams to better determine segmentation points and avoid errors from separating connected strokes.

Uploaded by

jibrel.ambark
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views7 pages

Determine Characters by Mathematical Model For Segmentation Arabic Words

1) The document proposes a new mathematical model-based method to segment Arabic words by determining whether connected components represent single or multiple characters to avoid over-segmentation. 2) Existing methods often ignore the structural foundations of Arabic characters and tend to over-segment characters with bulges like seen, lam-alif, and sad. 3) The proposed method uses mathematical models based on peaks, numbers, directions, and lengths of connected components along with Voronoi diagrams to better determine segmentation points and avoid errors from separating connected strokes.

Uploaded by

jibrel.ambark
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

ISSN (Print) : 0974-6846

Indian Journal of Science and Technology, Vol 9(40), DOI: 10.17485/ijst/2016/v9i40/84801, October 2016 ISSN (Online) : 0974-5645

Determine Characters by Mathematical Model for


Segmentation Arabic Words by Voronoi Diagrams
Jabril Ramdan, Khairuddin Omar and Mohammad Faidzul
School of Computer Science, Faculty of Information Science and Technology, Centre of Artificial Intelligence
Universiti Kebangsaan Malaysia, Bangi - 43200, Selangor, Malaysia

Abstract
Objectives: The objectives are to use a mathematical model to define a region-based segmentation method. This study
determines whether the Connected Component (CC) is one or more than one character. Method: Whereas the other
methods they tend to ignore the solid foundation of describing characters and connection points. This proposed method
adopts on many stages for adaptive the mathematic in segmentation characters process are: i) peak detection from vertical
histogram for (CC), and ii) enhancement of the model using a mathematical model to improve the segmentation method
based on the Voronoi Diagram (VD) Through a number of peaks. Findings: Whereas characters, such as ‫ س‬and ‫ص‬, are
confusing to segmentation methods; these errors include separating connection strokes from both sides to produce
a separated one. Other errors must be handled at a later stage, such as segmenting the character ‫ ـح‬at an acute angle.
Whereas the mathematical model is depending on peaks, numbers, direction, and length of CC. This model is tested on
segmentation using five Arabic datasets as: AHDB, IFN-ENIT, AHDB-FTR, APTI, Zeki and Al Hamad DB datasets. The
Preliminary results show that the application of the EDMS feature with multi perceptron-NN classifier it’s preferable.
Its accuracy when compared with Zeki method is 96.81% for the ACTOR printed dataset and the rate of this method is
85.81% for Zeki dataset and also compared with Al Hamad method is 95.09%, and 89.10% for ACDARhandwrittendataset.
Whereas the others datasets accuracies are 95.09% for IFN-ENIT, 98.27% for APTI, 91.63% for AHDB, and 90.69% for
AHDB-FTR on same feature (EDMS) and classifier (MLP_NN). Novelty: Adapt Mathematics with segmentation process to
determine whether the CC is one or more than one character. Using a mathematical model based on the VD to avoid over
segmentation .

Keywords: Mathematical Model, Arabic words, More than one character, Segmentation, Voronoi Diagrams

1. Introduction are separated. Such errors are generated in the Voronoi-


based recognition method, even though this method
Due to the uniqueness of the characters of the Arabic/Jawi makes great effort to minimize over-segmentation using
language, there is a high incidence of error when pixel- several rules, such as in the case of the character ‫س‬1,4. The
based methods of character recognition are employed. challenges of the arc system are presented in this critical
When applying these methods, it is common to ignore the segmentation phase.
foundation of a character’s description in these languages The reason that this challenge exists is primarily due
because Arabic/Jawi is typically segmented into multiple to the complexity of the image shapes of the characters
stages in recognition systems. The language is difficult to that constitute this language. The VD algorithm involves
analyze using pixel-based methods; indeed, segmentation two primary parameters, the speed and quality of a VD
has been proven to be particularly challenging, and it is construction. Because this method produces the best
likely that characters will be over-segmented due to the structures during segmentation, a construction algorithm
many bulges in certain characters, including seen (‫)س‬, is viewed as the most feasible link between the static and
lam-alif (‫)ال‬, and sad (‫)ص‬, producing errors when they dynamic styles of algorithm development; thus, this

* Author for correspondence


Determine Characters by Mathematical Model for Segmentation Arabic Words by Voronoi Diagrams

method, known as the divide-and-conquer algorithm, has Arabic is currently one of the most widespread
been chosen2. languages in the world, and several robust studies on
According3 states that the primary purpose of over- the recognition of its handwritten script have been
segmentation is to divide characters into single entities reported. However, to date, none of these studies has
such that no merged characters remain after the process. been comprehensive because Arabic is considerably
Discarding invalid points of segmentation and retaining more complex than Latin languages. [Figure 2] illustrates
valid points of segmentation are considered to be two of some of complexity of Arabic characters. The problems
the primary problems following over-segmentation. associated with the segmentation of Arabic scripts into
This paper demonstrates novel techniques for detecting characters and their classifications make Arabic even
whether connected components represent one character more difficult to study3.
or more than one character to avoid over-segmentation There are several issues associated with segmentation
and to reduce the computation time required for the and methods thereof. In some languages, certain
segmentation of letters in the Arabic language. The characters contain strokes or bulges that easily and
proposed method is supported by mathematical models effectively can be segmented by many of the available
and examples that address the scarcity of such information segmentation methods, a property referred to as over-
in the literature, despite the use of existing handwritten segmentation,1,17:
databases. [Figure 1] shows the types uses of datasets in • Characters that have the primary shape ‫ ـح‬and acute
this article. angles within a character tend to be segmented
incorrectly.
• Characters with several bulges, such as ‫ ـس‬and ‫ـص‬,
may be segmented as one unit.
• The primary issue associated with the character ‫ـف‬
is that the segmentation of the character frequently
occurs at its corner below the loop.
• Segmentation is considered to separate the tails
of the primary shapes ‫س‬, ‫ص‬, and ‫ق‬, producing
the ‫ ن‬shape in addition to the first part(s) of the
character.
Figure 1. Examples of datasets: (a) IFN/ENIT, (b) AHDB/ • A hanging segment is the result of segmenting tails
FTR, (c) Al Hamad, (d) AHDB, (e) APTI, and (f) Zeki. of certain characters, such as ‫ب‬, ‫ف‬, ‫ك‬, and ‫ل‬.

2. Character Segmentation
Problem and Weaknesses
Through the present methods applied for over-
segmentation., that of3, is clearly optimal. However,
considering the different failure of the available
techniques, which contain thinning, a new technique for
boost the heuristic algorithm to decrease “bad” errors and
increase total outcome accuracy is eligible. An insightful
and more helpful technique for identifying the range of Figure 2. Shapes of Arabic characters in different
a component as a separate segment of a bigger character positions, fonts, overlapping, shared horizontal space:
may be acquired by utilize a connected component’s (a) two characters occupying a shared horizontal space,
width-to-height ratio. To ensure a higher rate of accuracy (b) Arabic ligatures, their constituent characters, (c) four
in classifying components, more complex heuristics must characters written in completely different ways, (d) shapes
of the character “‫ع‬,” (e) letter '' may be missed, and (f)
be developed based on features that exclude aspect ratios
characters with similar contours.
and positioning of these components.

2 Vol 9 (40) | October 2016 | www.indjst.org Indian Journal of Science and Technology
Tabril Ramdan, Khairuddin Omar and Mohammad Faidzul

3. Proposed Method
The purpose of the suggest method is to beat the problems
and failure of the prior methods. Therefore, we developed
a high-performance method that is generally eligible of
qualify all challenges related by over segmentation, and
is particularly effective in cases in which the prior failed
methods. The suggest method assume a local approach
and is combined of two method: an orientation method Figure 4. Illustration of vertical histogram for
for define connected components in two directions (left, connected component (‫)اعس‬.
right) based on the features of an image, and a novel
method that can define the proper value the thresholding 3.3 Peak Detection
of every peaks along the y-axis and the length connected According to13, the peak detected takes two case: the first
components along the x-axis. its values to be evaluated and second its threshold. The
first argument is the y-values for the maximum of each
3.1 Architecture Colum in the carve, and threshold value its relies on the
The proposed method follows the steps shown in [Figure specific requisition. The peak detected function checks
3]. for each value the values to the left and to the right of the
considered value. If the behold value overtake the values
to the left and right by at minimum the threshold value,
then the considered value is count a peak.
Based on this logic, the function should returning
three maximum values for the graph created if a proper
threshold value is applied. The peak detection function
then finding two vectors consisting of all maximum
points and all minimum points, as shown in [Figure 5].
If the peak detect is miss or zero in the worst condition,
then the connected component is one character, ligature,
or dots.
Figure 3. Illustration of determination of CC of one
or more than one character.

3.2 Vertical Projection Profile (Histogram)


Vertical Projection profile is performed by counting the
number of black pixels in every image row using the
following formula:

Vertical Projection Profile (VPP) is a column vector


Figure 5. (a) Illustration of the maximum and
of the image comprising of total of ON pixels for every
minimum peaks.
column. Even Projection Profile (HPP) is the one
dimensional column vector cluster alluding to total of
ON pixels in the picture for every row12.The mathematical 3.4 Mathematical Model
representation of the vertical histogram for connected The practical world’s mathematical applications can be
component (‫ )اعس‬is as shown in [Figure 4]. interpreted through mathematical modeling. Alongside,

Vol 9 (40) | October 2016 | www.indjst.org Indian Journal of Science and Technology 3
Determine Characters by Mathematical Model for Segmentation Arabic Words by Voronoi Diagrams

Table 1. Specifications of the datasets used in this study


IFNENIT APTI ACDAR AHDB Zeki DB AHDB/FTR
No. of images 596 50 500 3045 50 497
No. of writers N/A N/A N/A 115 N/A 5
Printed /Handwrite Handwrite Printed Handwrite Handwrite Printed Handwrite
Contents Tunisian Most common Most common Most common Most common Libyan Towns
Towns Names Names
words per image Max.3 words One word One word One word One word Max. 3 words
Noise Free No/need No/need Free Free No/need
filtering filtering filtering
Resolution (dpi) 300 300 300 600 300 300
File format Mono-chrome Mono-chrome Mono-chrome Mono-chrome Mono-chrome Mono-chrome
BMP PNG JEPG TIFF BMP BMP

mathematical modeling is also used for scrutinizing 4. Datasets


significant queries about the linked world, elucidating
them, idea testing and anticipating with the future In this research, six databases were applied to estimate
conditions in regard to the world practical applications. the proposed algorithm and segmentation algorithm,
The fact cannot be neglected that the practical world is inclusive the IFN/ENIT6, APTI10, ACDAR8, AHDB9, Zeki
linked with sports, wildlife management, chemistry, primary shapes dataset Al-Jazeera.net1, and the AHDB/
engineering, computers, physics, ecology, economics, and FTR7. [ Table 1] shows the specifications of the different
physiology, etc. datasets.
Where is number of maximum
peaks.
Thus, the following is true:
5. Result and Discussion
Two types of tests were conducted for this research. The
first, a visual test, qualified the images and the outcomes
of segmentation. This test was proceeding on choice
images that presented different types of segmentation
Where 1 is a connected component that is composed of challenges. The second test was an analytical test that
more than one character and 0 is a connected component provided a statistical measurement based on benchmark
that comprises only one character. datasets and estimation.

3.5 Pseudo Code of Algorithm 5.1 Visual experiment


While c< Connected Components (image) To demonstrate fully the visual performance of this
[maxpeak, minpeak] = peakdet (Vertical Histogram, approach, problems such as variance in text size,
Threshold); considerable varying illumination, low image resolution,
Ifmaxpeak< 2 THEN thin pen stroke lines and minimum disparity among
Connected Component is One Character; the text and background were combined to the test. Six
Else samples were composed from different sources for this
Connected Component is more than One Character; test. [Figures 6(a), (b), (c), and (d)]show various sizes and
Go to next algorithm for segmentation path fonts of the handwritten and printed text samples collected
EndIf and imported from IFNENIT, AHDB, and APTI; high-
c=c+1; quality printed samples successively collected from the
EndWhile primary shapes dataset1, are shown in [Figures. 6(e) and

4 Vol 9 (40) | October 2016 | www.indjst.org Indian Journal of Science and Technology
Tabril Ramdan, Khairuddin Omar and Mohammad Faidzul

(f)]. To evaluate the performance of the proposed method, edge direction matrixes (EDMS), which is suggested for
we compared its results with those of Zeki’s method. The analysis in terms of the statistical texture method via
author outlined the factor values of the method based binary images. Numerous equations were suggested to
on his own values. It should be noted that the proposed include characteristics from EDM1 and EDM2 values.
method demonstrated the best results for all chosen By calculating their correlation, the proposed equations
challenges, as shown in the first image in [Figure 6(a)]. extracted 22 features. Homogeneity, pixel regularity,
The most comprehensive results were obtained using the weight, edge direction, and edge regularity, based on
proposed method with the handwritten image, as shown the study by15, and grey-level co-occurrence matrix
in Figure 6(f), as well as words, as shown in [Figure 6(b)]. (GLCM) texture measurements were used to describe
Finally, [Figure 6(c)] shows that the different datasets image texture, as proposed by Haralick in the 1970s14.
provided better results compared with those shown in The invariant moment technique is another local feature
[Figure 6(e)]. extraction method proposed by Hu 1962s. Hu’s method
features seven invariant moments17, and was applied to
the three types of classifiers, including a random forest
classifier, multiple perceptron neural network, and Rules
Ridor. Note that this method used an inverted filter with
GLCM and MOMENT features.
Using experiments to evaluate the results obtained
from several types of databases and the application of
three types of feature extraction methods with three
types of classifiers, we found that the best results were
obtained with the EDMS feature extraction method with
classifier multilayer neural networks, as shown in Table
Figure 6. Sample results for (a) segmentation after 2. The use of EDMS with MLP-NN was observed to be
determining connected components as either one or more
more effective than the use of GLCM and MOMENT in
than one character from the IFNENIT dataset, (b) a word
the other category (95.09% in the IFN-ENIT handwritten
from the APTI dataset, (c) a word from the IFNENIT dataset
with the proposed method’s results, and (d) words from the dataset and 98.27% in the APTI printed dataset). The
AHDB dataset from the proposed method, and (e) and (f) GLCM feature extraction method showed lower detection
the primary shape dataset and the Ahmed Zeki algorithm. percentages for rule classifiers than did the EDMS and
MOMENT methods (69.43% in the IFN-ENIT dataset
5.2 Analytical Experiment and 71.42% in the APTI dataset).
Basic features may be recognize similarly as character For any method some cases its bad or incorrect
strokes, character openings, or other character properties, segmentation; in this method also some cases it
for instance, concavities and convexities, end centers performs badly because some reasons affected on result
also, crossing points, extreme, union with straight lines segmentation like different size of fonts for handwritten
et cetera14. This test was mostly complete to explain the this happens by reason of fixed threshold of peaks
analytical achievement of the suggested method for detection, also overlapping characters can affected, and
various types of segmentation challenges. Therefore, the segment some characters when the writing time Although
benchmark datasets used were IFNENIT, AHDB, and he was supposed to write these characters
APTI. The results obtained after applying the proposed As a one body.Comparisons show that the EDMS
methodology were used to determine whether one or feature extraction method identifies character
more than one character had been applied to several components as either one or more than one character
datasets after the segmentation algorithm. As previously using two methods: the Zeki method and histogram. The
stated, the proposed method appears to produce good results demonstrate that the proposed method is effective
results when identifying Arabic characters. [Table 2] in avoiding over-segmentation and yields satisfactory
presents the findings obtained through the application results. [Table 3] compares the success rates, locations of
of three types of feature extraction methods, including segmentation points, over-segmentation, and error rates.

Vol 9 (40) | October 2016 | www.indjst.org Indian Journal of Science and Technology 5
Determine Characters by Mathematical Model for Segmentation Arabic Words by Voronoi Diagrams

A simple analysis of variance of the printed and text for segmentation and to avoid over-segmentation of
datasets for the three methods, Zeki, histogram, and the characters, as mentioned previously in the theses of1,3,4,17.
proposed method was performed. The use of this statistic A mathematical model was used to determine whether a
to test the hypothesis formulated in this study revealed connected component is a character that does not require
that there was no significant difference between the use segmentation or more than one character that requires
of any of these three methods when the data generated segmentation. The results obtained using this method
were compared to those obtained under the alternate appears to be promising and reliable for both handwriting
hypothesis that states that there is indeed a marked and print datasets. The proposed method also possesses
difference in the use of these three methods based on the none of the weaknesses of previously developed methods
data generated. [Table 4] and [Table 5] are summarize the described herein. To evaluate the proposed method,
results of the single-factor ANOVA. we compared it with many other established methods,
Based on the p-value shown in [Table 5], it can be including those proposed by3,17. Both visual and analytical
concluded that the three methods used differ significantly experiments were applied to selected datasets, namely
with respect to their generated values. IFN/ENIT, APTI, ACDAR, AHDB, Zeki DB, and AHDB/
FTR. Based on these experiments, the proposed method
was observed to yield better performance compared to
6. Conclusion that of the Zeki and Al Hamad methods. The proposed
The primary focus of this study was to determine whether method solved the problem of identifying connected
an identified component is either a single Arabic character components and whether they formed one or more than
or more than one character to reduce the time required one character. The proposed method is highly adaptable

Table 2. Experimental results from the proposed method


Classifier Feature Type DATA SETS
Types IFNENIT APTI ACDAR AHDB Zeki DB AHDB_
DB DB DB DB FTRDB
Trees. Ran- EDMS 93.962 97.413 85.643 90.909 96.015 89.767
dom Forest GLCM 73.437 80.952 72.727 73.684 85.001 69.829
MOMENT 68.679 88.185 72.449 68.750 86.345 69.544
Functions. EDMS 95.094 98.275 89.108 91.636 96.812 90.697
Multilayer GLCM 71.698 72.222 62.505 60.004 72.002 59.005
Perceptron MOMENT 70.892 76.793 65.007 65.625 72.690 63.177
Rules. Ridor EDMS 90.188 96.120 77.722 89.090 91.235 85.116
GLCM 69.434 71.428 65.217 60.312 72.413 69.586
MOMENT 70.754 80.168 76.530 73.010 78.313 75.570

Table 3. Result values for three methods


Proposed Zeki Projection
Method Method Method
Location of Seg-
94.88 80.04 57.24
mentati-on point
Over-Segmentati-on 3.71 7.28 17.67
The error rate 1.22 6.91 7.02
Overall success rate 96.81 85.81 75.22

6 Vol 9 (40) | October 2016 | www.indjst.org Indian Journal of Science and Technology
Tabril Ramdan, Khairuddin Omar and Mohammad Faidzul

Table 4. Summary of single factor value


Groups Count Sum Average Variance
Col1 4 195.4 48.85 2947.62
Col 2 4 180.04 45.01 1922.30
Col 3 4 157.15 39.287 1040.64

Table 5. ANOVA single-factor value


Source of Variation SS df MS F P- Value F crit
Between Groups 185.2454 2 92.6226 0.0470 0.9543 4.2564
Within Groups 17731.71 9 1970.19
Total 17916.96 11

for managing any over-segmentation problems and 8. Slimane F, Kanoun S, Alimi AM. Database and Evaluation
can solve special challenges, such as those described in Protocols for Arabic Printed Text Recognition. Proceedings
of 10th International Conference on Document Analysis
Section 2.
and Recognition. 2012.
9. Al Hamad H, Hamdi-Cherif A. The Arabic Center for Doc-
7. Acknowledgements ument Analysis and Recognition ( ACDAR ) - Structure
and Perspectives. Recent Advances in Information Science.
2010; (1):85–91.
The authors would like to thank Prof Dr. Khairuddin 10. Al-ma S, Elliman D, Higgins C. A Data Base for Arabic
Omar of the National University of Malaysia for his Handwritten Text Recognition Research. Proceedings of
assistance and cooperation. This study was funded by 8th the International Workshop on Frontiers in Handwrit-
the FRGS/1/2014/ICT07/UKM/01/1 grant entitled ing Recognition. 2002; 1(1):117–21.
“Improving Segmentation of Arabic Handwriting 11. Ramdan J, Omar K, Mady A. Arabic Handwriting Data
Base for Text Recognition. Proceedings of the 4th Interna-
by Determination of Neighborhood Using Voronoi
tional Conference on Electrical Engineering and Informat-
Diagrams.” ics. 2013; 11(13). p. 580–4.
12. Shukla MK, Banka H, Yadav KP. Structural Features Ex-
traction for Devnagari and Bangla Language Documents.
8. References Indian Journal of Science and Technology. 2015 Jul; 8(13).
Doi: 10.17485/ijst/2015/v8i13/56453.
1. Zeki A. Segmentation of Arabic characters using Voronoi 13. Bataineh B, Abdullah SNHS, Omar K. A statistical glob-
diagrams. University Kebangsaan Malaysia, Malaysia. 2008. al feature extraction method for optical font recognition.
2. Ramdan J, Omar K. Comparative Study of Algorithms for Intelligent Information and Database Systems. Springer.
Voronoi Diagram Construction on Segmentation of Arabic 2011; 257–67.
Hand Writing. Australian Journal of Basic and Applied Sci- 14. Thakare V. Survey On Image Texture Classification Tech-
ences. 2011; 5(11):1653–67. niques. International Journal of Advancements in Tech-
3. Al Hamad H. Over-segmentation of handwriting Arabic nology. 2013; 4(1):97–10.4 Available from: http://ijict.org/
scripts using an efficient heuristic technique. International index.php/ijoat/article/viewArticle/imagetexture
Conferenceon Wavelet Analysis and Pattern Recognition. 15. Zitova B. Image registration methods: a survey. Image Vis
2012; 180–5. Comput [Internet]. 2003; 21(11):977–1000. Available from:
4. Zeki A, Zakaria M, Liong C. The use of Area-Voronoi Di- http://dx.doi.org/10.1016/S0262-8856(03)00137-9\nhttp://
agram in Separating Arabic Text Connected Components. www.sciencedirect.com/science/article/pii/
Proceedings of 3th the International Conference on Electri- 16. Zeki A. The segmentation problem in arabic character rec-
cal Engineering and Informatics. 2007. ognition the state of the art. 1stInternational Conference
5. Jyothi J, Manjusha K, Kumar MA, Soman KP. Innovative on Information and Communication Technologies. 2005;
Feature Sets for Machine Learning based Telugu Charac- 11–26.
ter Recognition. Indian Journal of Science and Technology. 17. Zeki A, Zakaria MS, Liong CY. Isolation of dots for arabic
2015 Sep; 8(24). Doi: 10.17485/ijst/2015/v8i24/79996. ocr using voronoi diagrams. Proceedings of the Interna-
6. Billauer E. Peak detection using MATLAB [Internet]. 2013. tional Conference on Electrical Engineering and Informat-
Available from: http://www.billauer.co.il. ics. 2007. p. 199–202.
7. Pechwitz M, Maddouri SS, Margner V, Ellouze N, Amiri
H. IFN/ENIT-database of handwritten Arabic words. Pro-
ceedings of CIFED. Citeseer. 2002. p. 127–36.

Vol 9 (40) | October 2016 | www.indjst.org Indian Journal of Science and Technology 7

You might also like