Survey On Handwritten Gujarati Word Image Matching

The document discusses different approaches for retrieving documents from handwritten text, focusing on the Gujarati script. It describes recognition-based approaches using OCR and recognition-free approaches using word spotting. It analyzes related work on word spotting techniques for various scripts and identifies gaps for research on handwritten Gujarati text.

Uploaded by

Alper Demir

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views5 pages

Survey On Handwritten Gujarati Word Image Matching

Uploaded by

Alper Demir

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

2020 6th International Conference on Advanced Computing & Communication Systems (ICACCS)

Survey on Handwritten Gujarati Word Image

Matching

Kinjal S. Gautam Mukesh M. Goswami

Department of Information Technology Department of Information Technology
Dharmsinh Desai University Dharmsinh Desai University
Nadiad, India Nadiad, India
[email protected] [email protected]

Abstract—Over the past decade, there has been a rising reported in Section IV followed by conclusion and discussion
interest in addressing knowledge indexing using word matching, in Section V.
demonstrated by the ever-increasing number of approaches.
Research on the subject of word matching in handwritten
II. APPROACHES FOR RETRIEVAL OF DOCUMENT
documents has been an active research field, and substantial
progress has been made in dominating scripts in recent years.
However, a regional script like Gujarati still lacks research A. Recognition Based Approaches (OCR)
attention. This paper provides an overview of the published word OCR stands for Optical Character Recognition [2]. OCR
spotting research efforts in handwritten as well as printed technology is a program that scans images containing text and
document images for Indian scripts. The major contribution of transforms them into editable documents. Fig.1 shows the
this paper is to provide a bird’s eye view of the field and identify major component of the OCR system which includes pre-
potential research gaps specifically for handwritten Gujarati processing steps like skew-correction, binarization, noise
word image matching. removal, etc., followed by line, word and symbol level
segmentation. Segmented symbols are recognized using
Keywords—Word spotting, Word Matching, Handwritten text, suitable features and classifiers and finally converted into
Indian Script, Gujarati
machine editable text. Accuracy of OCR depends on the
accuracy of segmentation which is a challenging task
I. INTRODUCTION specifically in handwritten document images due to the large
There are numerous handwritten document-image and complex set of characters, including the base character,
repositories available in the public as well as the private modifiers, conjunct, joint and touching symbols, etc.
domain. Retrieval of the relevant documents from the image
repository is a challenging task. Word-spotting from a
document image is one of the popularly used techniques for
information retrieval. Plenty of work is reported for word-
spotting in the printed text for the western as well as Indian
scripts. However, only a few references are found in the
literature for word-spotting in the handwritten text especially
Gujarati script. There are two main approaches to document
image retrieval, namely recognition-based (OCR-based) and
word-spotting (word-spotting) [1]. The first method is
standard text search methods that require an efficient OCR
system which is currently unavailable for most handwritten
documents. OCR-based techniques are not the proper choice
for handwritten documents due to the issues of poor quality
documents, writing style variability, overlapping and touching
symbols, segmentation of words into symbols, etc. The second
approach is the Recognition free approach which includes
direct matching of query word image with document word
images stored in the database. The current paper emphasizes
on the review of existing word-spotting approaches suggested
for printed as well as handwritten Indian scripts.
The rest of the paper is organized as follows: Approaches
for retrieval of the document is discussed in section II. Related
work is discussed in section III. Analysis and findings are Fig. 1. Architecture of Recognition Based Approach

978-1-7281-5197-7/20/$31.00 ©2020 IEEE 534

Authorized licensed use limited to: ULAKBIM UASL - Suleyman Demirel Universitesi. Downloaded on February 02,2022 at 12:36:37 UTC from IEEE Xplore. Restrictions apply.
2020 6th International Conference on Advanced Computing & Communication Systems (ICACCS)
B. Recognition Free Approach Toni M. Rath and R. Manmatha [6] used the dynamic time
Recognition Free Approach sees word image as a single warping algorithm to provide several features appropriate for
unit of Recognition and performs the function of Information word image matching. They have used 2381 words database
Retrieval (IR) by comparing the query word image directly of the English language. The Precision-Recall for overall
with document word images without segmenting the symbols. Single features is 72.56%, 6517%. The Precision-Recall for
Hence, it is a preferred choice for handwritten document overall feature set is 67.31%, 64.02%.
images where symbol segmentation is challenging due to Kefali and Chemmam [7] have suggested an approach
touching and overlapping characters. Fig 2. Shows the outline based on word coding using topological characteristics
of the recognition free approach. Feature extraction and including diacritics, ascenders, descenders, and loops. For
matching along with indexing forms crucial components of a each sub-word a set of 4 features is extracted, and a sequence
recognition free system. of codes for each sub-word is represented. Experimentation
was conducted on a compiled database of 1,100 Algerian
postal envelope photographs written in Arabic, producing an
accuracy of 75 % while attaining an 87 %t recall.
Ali Abidi et al. [8] defined a word spotting method for
handwritten Urdu text and used the height, width and convex
area of each partial word together with the vertical and
horizontal word profiles as scalar characteristics. In terms of
F-Measure, the recorded performance level was 72 % on 115
queries and 90 handwritten documents written by 90 authors.
Kathiriya Khushali and Mukesh M. Goswami in [9]
proposed a deep learning-based method. They have used a
convolution auto encoder as feature extraction and retrieving
of the matching images from Gujarati handwritten word image
using complete matching. The experiments were performed on
the dataset of 8500-word images from 106-word groups. The
method gives 67.95% precision for 50% recall on Gujarati
word images.
Mukesh Goswami and Suman Mitra [10] introduced Shape
Similarity Measure for High-Level Strokes K-NN. The
experiment was done on a small word-group database
Fig. 2. Architecture of Recognition Free Approach containing 280 word-images. The method offers optimum
precision and recall values were 77.61 % and 80.91 %, with a
similarity threshold equal to 0.62 respectively.
III. RELATED WORK
Anand Kumar, C.V. Jawahar, and R. Manmathaan [11]
There is some research out for word matching in printed presents efficient searches in the set of document images.
Indian language. Two well-known feature extraction They have used a combination of scalar, profile structural and
techniques, namely HOG (Histogram Oriented Gradients) and transform domain for extraction of features and Content
SD (Shape Descriptors), are proposed by Himanshu M. Sensitive Hashing (complete matching of the hashing
Kathiriya and Mukesh M. Goswami. Using Dynamic Time technique) for matching features. This approach shows that the
Wrapping (DTW) on a small word object database obtained technique archives high precision and recall (between 94%-
from machine printed book images in this matching document 96%) using a large image corpus consisting of seven
feature. The method gives Precision 59% for HOG and Shape Kalidasa's books (such as Abhijnanasakuntalam, Ritusamhara,
Descriptor gives the Precision value 71%. etc.) in Telugu.
Some research for Indian language is recorded in the A Line-Oriented Approach to Word Spotting in
handwritten word spotting literature. Toni M. Rath and R. Handwritten Documents is presented in the work of Kolcz et
Manmatha can be found in the most significant handwritten al. [5]. They used 13 pages of the Spanish language. For the
word spotting work for Non-Indian language [3]. Moreover, a extraction of feature the line-oriented approach used, but the
brief overview of some important work recorded for Indian line is not further divided into two words, and the word must
and Non-Indic language of handwritten as well as printed be searched in any line location. Dynamic time wrapping
word spotting is given in the following section. (DTW) was used to match the word. The result is based on the
In the work of Muhammad Ismail Shah Ching Y. Suen [4], hand-picked queries with multiple examples- the best result
the database consisted of handwritten Pashto terms that were for any single word examples seems to have a precision of 0.4
consistent in length and represented as binary features. They or less.
have used profile features for feature extraction and cosine
similarity for feature matching which is a complete matching
method. The method gives precision 94.57%.

978-1-7281-5197-7/20/$31.00 ©2020 IEEE 535

TABLE I. NON-INDIC SCRIPT HANDWRITTEN DOCUMENT WORD SPOTTING METHODS

Feature Extraction Feature Matching
Reference Language Dataset Evaluation Measure
Technique Technique
Muhammad Ismail Shah Cosine similarity PR:
Pashto 4200 words Profile features
Ching Y. Suen [4] (Complete matching) 94.57%
Dynamic Time
Kolcz et al.[5] Spanish 13 Pages Line-Oriented Approach -
Wrapping
PR,RE:
Single Feature (UPP,
Overall Single features:
Single Author 15 MPP, LPP, UWP, LWP,
Toni M. Rath and R. Dynamic Time 72.56%, 65.17%
English image collection B/I, GV)
Manmatha [6] Wrapping Overall Feature set:
2381 words Feature set (GSD, GHD,
67.31%,
GVD)
64.02%
Abderrahmane
1100 image of Dynamic time PR:,RE:
Kefali and Chaouki Arabic Shape coding
postal envelopes wrapping 75%,87%
Chemmam [7]

TABLE II. INDIAN LANGUAGE HANDWRITTEN WORD SPOTTING METHODS

Feature Extraction Feature Matching

Reference Language Dataset Evaluation Measure
Technique Technique
Dynamic Time
Ali Abidi et al. [8] Urdu 90 pages Profile features FM: 72%
Wrapping
8500-word
Kathiriya Khushali and Convolution Auto Complete Matching PR: 67.95%
Gujarati image from 106-
Mukesh M. Goswami [9] encoder(CAE) using KNN RE: 50%
word groups

TABLE III. IAN LANGUAGE PRINTED WORD SPOTTING METHODS

Feature Extraction Feature Matching

Reference Language Dataset Evaluation Measure
Technique Technique
Himanshu M. Kathiriya and 15 books from HOG, Shape Descriptor Dynamic Time PR: HOG: 59%, Shape
Gujarati
Mukesh M. Goswami [2] DLI (Block word) Wrapping Descriptor: 71%
Shape similarity
280-word image
Mukesh Goswami and Stroke extraction computation using
Gujarati of 48-word PR:77.61%, RE:80.91%
Suman Mitra [10] algorithm HLS (High-level
groups
stroke)
GSC (Gradient
Correlation similarity PR:Sanskrit:90%
Sargur N. Srihari et al. [15] Devanagari Devanagari: 18 Structural and
(Complete matching) English:60%
Concavity)
C. V. Jawahar,
DTW (Partial
A.Balasubramanian, Million Hindi Hindi: 3354 Profiling features PR, RE: Hindi: 92%, 93%
matching)
Meshesha [16]
Kalidasa’s
Content Sensitive
(Indian poet of Combination of scalar, PR, RE: Abhijnanas-
Hashing (Hashing
Anand et.al [11] Telugu antiquity) Books profile, structural and akuntalam:96.79%, 91.27%
technique complete
20 words in each transform domain Ritusamhara:94.65%, 93.67%
matching)
book
DTW (Partial
PR,RE:DTW:
Word profiles, Moments Matching Similarity),
Million Meshesha and C.V. Each language 90.81%,89.58%
Hindi based and Transform Morphological
Jawahar [17] 4000 words Morphological: 78.83%,
domain representation (Extract matching
76.43%
image)

IV. ANALYSIS AND FINDINGS A. Feature Extarction

Table I, II and III summarize the work reported in From the literature table, we list some of the features
literature review. The summary includes feature extraction, extraction techniques in which Profile Features [9], Shape
feature matching, and database description and evaluation Coding [7], Stroke Extraction Algorithm [10], Word Profile
measure. From the analysis, we find that there is few work [17], Histogram Oriented (HOG) and Gradient Structural and
reported for the Guajarati language especially in handwritten. Concavity (GSC) [15] are listed. The best feature extraction
Further analysis of feature extraction, feature matching and method that produces a good result is profile features and GSC
database are reported in section A, B and C. features. Fig 3. Shows the classification of feature extraction
techniques.
Identify applicable sponsor/s here. If no sponsors, delete this text box
(sponsors).

Fig. 3. Feature Extraction Techniques

B. Feature Matching Techniques matching or incremental matching. Euclidean Distance and

The Matching process is divided into two categories of cosine similarity [4] are complete matching techniques.
training based and training free. In training based hidden Dynamic time wrapping [16] is an incremental matching
Markov model and K-Nearest neighbors are included but in technique. Fig.4 shows the major classification of feature
training, free further divided into two categories complete matching technique.

Fig. 4. Feature Matching Techniques

V. CONCLUSION
C. Handwritten Databases Description
The database for handwritten document or word images is The current paper presents a bird’s eye view of the word
the key requirement for successful research in the field. The spotting techniques used for handwritten as well as printed
various factors affecting the quality of handwritten image document images for Indian script. It also summarize the
database includes writer or writing styles, font size, ink most commonly used feature extraction and feature matching
variations, etc. The size of database (i.e. number of word techniques as well as list some of the existing datasets
groups and samples per word-group) is also important available in the literature. It was also concluded that the work
specifically for deep learning based methods. Table IV. List reported for handwritten word matching for Gujarati script is
some of the most commonly used repositories for handwritten very low and incremental matching is not experimented so far.
document word matching experiments in the literature. It also
describes size, unique characteristics as well as the script they
belong to.

Authorized licensed use limited to: ULAKBIM UASL - Suleyman Demirel Universitesi. Downloaded on February 02,2022 at 12:36:37 UTC from IEEE Xplore. Restrictions apply.
2020 6th International Conference on Advanced Computing & Communication Systems (ICACCS)
TABLE IV. DATASET OF HANDWRITTEN DOCUMENTS [3] Rath, T. M., et al. "Indexing for a digital library of George Washington’s
manuscripts: a study of word matching techniques." CIIR Technical
Data set Script Description Report (2002).
George Washington English 20 pages of handwritten letters [4] Shah, Muhammad Ismail, and Ching Y. Suen. "Word spotting in gray
database (GW) [1] written in 1755 by George scale handwritten pashto documents." 2010 12th International
Washington. Conference on Frontiers in Handwriting Recognition. IEEE, 2010.
Parzival database [12] German The Parzival file consists of 45 [5] Kolcz, Aleksander, et al. "A line-oriented approach to word spotting in
handwritten pages by three handwritten documents." Pattern Analysis & Applications 3.2 (2000):
different German poets in the 153-168.
thirteenth century. [6] Rath, Toni M., and Raghavan Manmatha. "Features for word spotting in
CENPARMI [13] English 137 handwritten documents by 13 historical manuscripts." Seventh International Conference on Document
writers divided into two sets: Analysis and Recognition, 2003. Proceedings. IEEE, 2003.
evaluation (112 documents) and [7] Kefali, Abderrahmane, and Chaouki Chemmam. "A Semi-Automatic
validation (25 documents) Approach of old Arabic Documents Indexing." CIIA. 2011.
IAM databse [14] English 1539 lines of handwritten records [8] Abidi, Ali, et al. "Word spotting based retrieval of Urdu handwritten
written by 657 writers in 1978. It documents." 2012 International Conference on Frontiers in Handwriting
is divided into three sets: training, Recognition. IEEE, 2012.
testing and validation. A good [9] Kathiriya Khushali and Mukesh M. Goswami, “Hand written Gujarati
feature of this database is that word Image Matching using Convolutional Auto encoder’’ M.Tech.
each collection contains text lines Dissertation Thesis published at Dharamsinh Desai University 2019.
written by several authors,
[10] Goswami, Mukesh M., and Suman K. Mitra. "High Level Shape
making it a good source for word
Representation in Printed Gujarati Character." ICPRAM. 2017.
searching for different writing
styles. [11] Kumar, Anand, C. V. Jawahar, and R. Manmatha. "Efficient search in
Gujarati handwritten Gujarati The database of 8500 handwritten document image collections." Asian Conference on Computer Vision.
database [9] Gujarati word images divided Springer, Berlin, Heidelberg, 2007.
into106 word-groups. The [12] Wüthrich, Markus, et al. "Language model integration for the
database was collected from recognition of handwritten medieval documents." 2009 10th
variety of subjects including International Conference on Document Analysis and Recognition. IEEE,
school children, parents, 2009.
university students, professor, [13] Alamri, Huda, et al. "A novel comprehensive database for Arabic off-
staff, as well as random subjects. line handwriting recognition." Proceedings of 11th International
It contains rich writing style and Conference on Frontiers in Handwriting Recognition, ICFHR. Vol. 8.
ink variations. 2008.
[14] Shekhar, Ravi, and C. V. Jawahar. "Word image retrieval using bag of
visual words." 2012 10th IAPR International Workshop on Document
Analysis Systems. IEEE, 2012.
References [15] Srihari, Sargur N., et al. "Spotting words in Latin, Devanagari and
Arabic scripts." Vivek-Bombay- 16.3 (2006): 2
[1] Ahmed, Rashad, Wasfi G. Al-Khatib, and Sabri Mahmoud. "A survey [16] C Jawahar, C. V., A. Balasubramanian, and Million Meshesha. "Word-
on handwritten documents word spotting." International Journal of level access to document image datasets." Proceedings of the workshop
Multimedia Information Retrieval 6.1 (2017): 31-47. on computer vision, graphics and image processing. 2004.
[2] Kathiriya, Himanshu M., and Mukesh M. Goswami. "Performance [17] Meshesha, Million, and C. V. Jawahar. "Matching word images for
Analysis of Word Spotting Techniques Using HOG and Shape content-based retrieval from printed document images." International
Descriptor on Gujarati Script." Proceedings of the International Journal of Document Analysis and Recognition (IJDAR) 11.1 (2008):
Conference on Intelligent Systems and Signal Processing. Springer, 29-38.
Singapore, 2018.

Authorized licensed use limited to: ULAKBIM UASL - Suleyman Demirel Universitesi. Downloaded on February 02,2022 at 12:36:37 UTC from IEEE Xplore. Restrictions apply.