Survey On Handwritten Gujarati Word Image Matching
Survey On Handwritten Gujarati Word Image Matching
Abstract—Over the past decade, there has been a rising reported in Section IV followed by conclusion and discussion
interest in addressing knowledge indexing using word matching, in Section V.
demonstrated by the ever-increasing number of approaches.
Research on the subject of word matching in handwritten
II. APPROACHES FOR RETRIEVAL OF DOCUMENT
documents has been an active research field, and substantial
progress has been made in dominating scripts in recent years.
However, a regional script like Gujarati still lacks research A. Recognition Based Approaches (OCR)
attention. This paper provides an overview of the published word OCR stands for Optical Character Recognition [2]. OCR
spotting research efforts in handwritten as well as printed technology is a program that scans images containing text and
document images for Indian scripts. The major contribution of transforms them into editable documents. Fig.1 shows the
this paper is to provide a bird’s eye view of the field and identify major component of the OCR system which includes pre-
potential research gaps specifically for handwritten Gujarati processing steps like skew-correction, binarization, noise
word image matching. removal, etc., followed by line, word and symbol level
segmentation. Segmented symbols are recognized using
Keywords—Word spotting, Word Matching, Handwritten text, suitable features and classifiers and finally converted into
Indian Script, Gujarati
machine editable text. Accuracy of OCR depends on the
accuracy of segmentation which is a challenging task
I. INTRODUCTION specifically in handwritten document images due to the large
There are numerous handwritten document-image and complex set of characters, including the base character,
repositories available in the public as well as the private modifiers, conjunct, joint and touching symbols, etc.
domain. Retrieval of the relevant documents from the image
repository is a challenging task. Word-spotting from a
document image is one of the popularly used techniques for
information retrieval. Plenty of work is reported for word-
spotting in the printed text for the western as well as Indian
scripts. However, only a few references are found in the
literature for word-spotting in the handwritten text especially
Gujarati script. There are two main approaches to document
image retrieval, namely recognition-based (OCR-based) and
word-spotting (word-spotting) [1]. The first method is
standard text search methods that require an efficient OCR
system which is currently unavailable for most handwritten
documents. OCR-based techniques are not the proper choice
for handwritten documents due to the issues of poor quality
documents, writing style variability, overlapping and touching
symbols, segmentation of words into symbols, etc. The second
approach is the Recognition free approach which includes
direct matching of query word image with document word
images stored in the database. The current paper emphasizes
on the review of existing word-spotting approaches suggested
for printed as well as handwritten Indian scripts.
The rest of the paper is organized as follows: Approaches
for retrieval of the document is discussed in section II. Related
work is discussed in section III. Analysis and findings are Fig. 1. Architecture of Recognition Based Approach
Authorized licensed use limited to: ULAKBIM UASL - Suleyman Demirel Universitesi. Downloaded on February 02,2022 at 12:36:37 UTC from IEEE Xplore. Restrictions apply.
2020 6th International Conference on Advanced Computing & Communication Systems (ICACCS)
B. Recognition Free Approach Toni M. Rath and R. Manmatha [6] used the dynamic time
Recognition Free Approach sees word image as a single warping algorithm to provide several features appropriate for
unit of Recognition and performs the function of Information word image matching. They have used 2381 words database
Retrieval (IR) by comparing the query word image directly of the English language. The Precision-Recall for overall
with document word images without segmenting the symbols. Single features is 72.56%, 6517%. The Precision-Recall for
Hence, it is a preferred choice for handwritten document overall feature set is 67.31%, 64.02%.
images where symbol segmentation is challenging due to Kefali and Chemmam [7] have suggested an approach
touching and overlapping characters. Fig 2. Shows the outline based on word coding using topological characteristics
of the recognition free approach. Feature extraction and including diacritics, ascenders, descenders, and loops. For
matching along with indexing forms crucial components of a each sub-word a set of 4 features is extracted, and a sequence
recognition free system. of codes for each sub-word is represented. Experimentation
was conducted on a compiled database of 1,100 Algerian
postal envelope photographs written in Arabic, producing an
accuracy of 75 % while attaining an 87 %t recall.
Ali Abidi et al. [8] defined a word spotting method for
handwritten Urdu text and used the height, width and convex
area of each partial word together with the vertical and
horizontal word profiles as scalar characteristics. In terms of
F-Measure, the recorded performance level was 72 % on 115
queries and 90 handwritten documents written by 90 authors.
Kathiriya Khushali and Mukesh M. Goswami in [9]
proposed a deep learning-based method. They have used a
convolution auto encoder as feature extraction and retrieving
of the matching images from Gujarati handwritten word image
using complete matching. The experiments were performed on
the dataset of 8500-word images from 106-word groups. The
method gives 67.95% precision for 50% recall on Gujarati
word images.
Mukesh Goswami and Suman Mitra [10] introduced Shape
Similarity Measure for High-Level Strokes K-NN. The
experiment was done on a small word-group database
Fig. 2. Architecture of Recognition Free Approach containing 280 word-images. The method offers optimum
precision and recall values were 77.61 % and 80.91 %, with a
similarity threshold equal to 0.62 respectively.
III. RELATED WORK
Anand Kumar, C.V. Jawahar, and R. Manmathaan [11]
There is some research out for word matching in printed presents efficient searches in the set of document images.
Indian language. Two well-known feature extraction They have used a combination of scalar, profile structural and
techniques, namely HOG (Histogram Oriented Gradients) and transform domain for extraction of features and Content
SD (Shape Descriptors), are proposed by Himanshu M. Sensitive Hashing (complete matching of the hashing
Kathiriya and Mukesh M. Goswami. Using Dynamic Time technique) for matching features. This approach shows that the
Wrapping (DTW) on a small word object database obtained technique archives high precision and recall (between 94%-
from machine printed book images in this matching document 96%) using a large image corpus consisting of seven
feature. The method gives Precision 59% for HOG and Shape Kalidasa's books (such as Abhijnanasakuntalam, Ritusamhara,
Descriptor gives the Precision value 71%. etc.) in Telugu.
Some research for Indian language is recorded in the A Line-Oriented Approach to Word Spotting in
handwritten word spotting literature. Toni M. Rath and R. Handwritten Documents is presented in the work of Kolcz et
Manmatha can be found in the most significant handwritten al. [5]. They used 13 pages of the Spanish language. For the
word spotting work for Non-Indian language [3]. Moreover, a extraction of feature the line-oriented approach used, but the
brief overview of some important work recorded for Indian line is not further divided into two words, and the word must
and Non-Indic language of handwritten as well as printed be searched in any line location. Dynamic time wrapping
word spotting is given in the following section. (DTW) was used to match the word. The result is based on the
In the work of Muhammad Ismail Shah Ching Y. Suen [4], hand-picked queries with multiple examples- the best result
the database consisted of handwritten Pashto terms that were for any single word examples seems to have a precision of 0.4
consistent in length and represented as binary features. They or less.
have used profile features for feature extraction and cosine
similarity for feature matching which is a complete matching
method. The method gives precision 94.57%.
Authorized licensed use limited to: ULAKBIM UASL - Suleyman Demirel Universitesi. Downloaded on February 02,2022 at 12:36:37 UTC from IEEE Xplore. Restrictions apply.
2020 6th International Conference on Advanced Computing & Communication Systems (ICACCS)
Authorized licensed use limited to: ULAKBIM UASL - Suleyman Demirel Universitesi. Downloaded on February 02,2022 at 12:36:37 UTC from IEEE Xplore. Restrictions apply.
2020 6th International Conference on Advanced Computing & Communication Systems (ICACCS)
V. CONCLUSION
C. Handwritten Databases Description
The database for handwritten document or word images is The current paper presents a bird’s eye view of the word
the key requirement for successful research in the field. The spotting techniques used for handwritten as well as printed
various factors affecting the quality of handwritten image document images for Indian script. It also summarize the
database includes writer or writing styles, font size, ink most commonly used feature extraction and feature matching
variations, etc. The size of database (i.e. number of word techniques as well as list some of the existing datasets
groups and samples per word-group) is also important available in the literature. It was also concluded that the work
specifically for deep learning based methods. Table IV. List reported for handwritten word matching for Gujarati script is
some of the most commonly used repositories for handwritten very low and incremental matching is not experimented so far.
document word matching experiments in the literature. It also
describes size, unique characteristics as well as the script they
belong to.
Authorized licensed use limited to: ULAKBIM UASL - Suleyman Demirel Universitesi. Downloaded on February 02,2022 at 12:36:37 UTC from IEEE Xplore. Restrictions apply.
2020 6th International Conference on Advanced Computing & Communication Systems (ICACCS)
TABLE IV. DATASET OF HANDWRITTEN DOCUMENTS [3] Rath, T. M., et al. "Indexing for a digital library of George Washington’s
manuscripts: a study of word matching techniques." CIIR Technical
Data set Script Description Report (2002).
George Washington English 20 pages of handwritten letters [4] Shah, Muhammad Ismail, and Ching Y. Suen. "Word spotting in gray
database (GW) [1] written in 1755 by George scale handwritten pashto documents." 2010 12th International
Washington. Conference on Frontiers in Handwriting Recognition. IEEE, 2010.
Parzival database [12] German The Parzival file consists of 45 [5] Kolcz, Aleksander, et al. "A line-oriented approach to word spotting in
handwritten pages by three handwritten documents." Pattern Analysis & Applications 3.2 (2000):
different German poets in the 153-168.
thirteenth century. [6] Rath, Toni M., and Raghavan Manmatha. "Features for word spotting in
CENPARMI [13] English 137 handwritten documents by 13 historical manuscripts." Seventh International Conference on Document
writers divided into two sets: Analysis and Recognition, 2003. Proceedings. IEEE, 2003.
evaluation (112 documents) and [7] Kefali, Abderrahmane, and Chaouki Chemmam. "A Semi-Automatic
validation (25 documents) Approach of old Arabic Documents Indexing." CIIA. 2011.
IAM databse [14] English 1539 lines of handwritten records [8] Abidi, Ali, et al. "Word spotting based retrieval of Urdu handwritten
written by 657 writers in 1978. It documents." 2012 International Conference on Frontiers in Handwriting
is divided into three sets: training, Recognition. IEEE, 2012.
testing and validation. A good [9] Kathiriya Khushali and Mukesh M. Goswami, “Hand written Gujarati
feature of this database is that word Image Matching using Convolutional Auto encoder’’ M.Tech.
each collection contains text lines Dissertation Thesis published at Dharamsinh Desai University 2019.
written by several authors,
[10] Goswami, Mukesh M., and Suman K. Mitra. "High Level Shape
making it a good source for word
Representation in Printed Gujarati Character." ICPRAM. 2017.
searching for different writing
styles. [11] Kumar, Anand, C. V. Jawahar, and R. Manmatha. "Efficient search in
Gujarati handwritten Gujarati The database of 8500 handwritten document image collections." Asian Conference on Computer Vision.
database [9] Gujarati word images divided Springer, Berlin, Heidelberg, 2007.
into106 word-groups. The [12] Wüthrich, Markus, et al. "Language model integration for the
database was collected from recognition of handwritten medieval documents." 2009 10th
variety of subjects including International Conference on Document Analysis and Recognition. IEEE,
school children, parents, 2009.
university students, professor, [13] Alamri, Huda, et al. "A novel comprehensive database for Arabic off-
staff, as well as random subjects. line handwriting recognition." Proceedings of 11th International
It contains rich writing style and Conference on Frontiers in Handwriting Recognition, ICFHR. Vol. 8.
ink variations. 2008.
[14] Shekhar, Ravi, and C. V. Jawahar. "Word image retrieval using bag of
visual words." 2012 10th IAPR International Workshop on Document
Analysis Systems. IEEE, 2012.
References [15] Srihari, Sargur N., et al. "Spotting words in Latin, Devanagari and
Arabic scripts." Vivek-Bombay- 16.3 (2006): 2
[1] Ahmed, Rashad, Wasfi G. Al-Khatib, and Sabri Mahmoud. "A survey [16] C Jawahar, C. V., A. Balasubramanian, and Million Meshesha. "Word-
on handwritten documents word spotting." International Journal of level access to document image datasets." Proceedings of the workshop
Multimedia Information Retrieval 6.1 (2017): 31-47. on computer vision, graphics and image processing. 2004.
[2] Kathiriya, Himanshu M., and Mukesh M. Goswami. "Performance [17] Meshesha, Million, and C. V. Jawahar. "Matching word images for
Analysis of Word Spotting Techniques Using HOG and Shape content-based retrieval from printed document images." International
Descriptor on Gujarati Script." Proceedings of the International Journal of Document Analysis and Recognition (IJDAR) 11.1 (2008):
Conference on Intelligent Systems and Signal Processing. Springer, 29-38.
Singapore, 2018.
Authorized licensed use limited to: ULAKBIM UASL - Suleyman Demirel Universitesi. Downloaded on February 02,2022 at 12:36:37 UTC from IEEE Xplore. Restrictions apply.