0% found this document useful (0 votes)
69 views

Text Detection and Localization in Natural Scene Images Using MSER and Fast Guided Filter

Text detection

Uploaded by

Parul Narula
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
69 views

Text Detection and Localization in Natural Scene Images Using MSER and Fast Guided Filter

Text detection

Uploaded by

Parul Narula
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

2017 Fourth International Conference on Image Information Processing (ICIIP)

Text Detection and Localization in Natural Scene


Images Using MSER and Fast Guided Filter
Rituraj Soni∗ , Bijendra Kumar† , Satish Chand‡
∗ Scholar, Department of Computer Engineering, NSIT, New Delhi, India
† Professor, Department of Computer Engineering, NSIT, New Delhi, India
‡ Professor, School of Computer and Systems Science, JNU, New Delhi, India
[email protected], † [email protected], ‡ [email protected]

Abstract—Textual matter present in a natural scene image three attributes namely Stroke Width Dissimilarity, Color
provides indispensable information about it. The semantics and Dissimilarity and Occupy Rate Convex Area are calculated on
information present in the natural scene images can be perceived these areas. c) Third, we blend these attributes using Bayesian
by extracting the text regions in them. Detection and localization
of text from natural scene images is a challenging task for analysis classifier to estimate the TCS(Text Confidence Score), that
of images due to various font size, font type, and illumination. In determines the feasibility of a region as a text. d) Last, the
this paper, we propose a hybrid approach for text detection and labeling of constituent as text and non-text is carried out using
localization based on text confidence score using three attributes graph cut and Markov Random Field (MRFs) [2], followed by
namely stroke width dissimilarity, color dissimilarity and occupy text line integration with the help of the mean-shift clustering
rate convex area to discern text and non-text constituents. The
aim of this paper is to achieve fast detection and localization of approach.
text regions in low resolution and blurred images. To accomplish The arrangement of the paper is as follows: Introduction is
this, the possible candidate regions are extracted using edge discussed in Section I, related work is reviewed in Section
smoothing by fast guided filter followed by MSER. The text II. Section III defines the working of the proposed method.
confidence score on these constituents is calculated using the Experiments and results are discussed in Section IV, with
Bayesian framework with the help of above mentioned three
attributes. Experimental results on benchmark ICDAR 2013 concluding remarks mentioned in Section V.
testing dataset shows the efficacy of our method in the form
of precision, recall, and f-measure. II. R ELATED WORK
Index Terms—Text Detection and Localization, Text Confi- Numerous methods have been developed and proposed in
dence Score, Edge Smoothing MSER,Fast Guided Filter. past for scene text detection and localization, therefore based
on extensive survey [1] these methods can be divided on the
I. I NTRODUCTION basis of edge, stroke, texture, connected component (CC) and
Extraction of text regions from natural scene images is hybrid.
one of the crucial tasks in computer vision. The information In edge based method [3], edges are detected by edge detector
about scene images like notice boards, advertisement boards, and text components are extracted by morphological opera-
road signs etc. is embedded in the form of text. These texts tions. It is suitable for images with uniform gradient and gives
provide a rich amount of information about such images and poor results for images having a complicated background.
can be used in heterogeneous applications such as license The texture based method uses Wavelet Transform, Discrete
plate localization, robot navigation, content-based image re- Cosine Transform(DCT) [1], Histogram of Gradients(HOG)
trieval, guidance for visually impaired people [1] etc. Our and Local Binary Pattern(LBP) to acquire texture features, as
proposed work in this paper revolves around text detection text regions have different texture properties as compared to
and localization process, which is focused on estimating non-text regions. They are delicate to text arrangement, but
text locations in the image and creating bounding boxes efficient for distinguishing crowded characters.
around them. Researchers over the years have made significant The stroke based method [4], uses stroke width as the text-
progress in this field; however, this domain is still open due determining feature to extract text regions from an image,
to various challenges like variable font size, alignment of the but it is unsuccessful on images with varying background.
text, color, complex natural scene, occlusion, noise, blur [1], The connected component(CC) based method [4], uses color
illumination variations, viewpoint, distortion of the image, etc. clustering or edge detection for separating text components
Further, some false positives detected due to presence of some from image. They have lower computation cost due to less
background objects like bricks, windows, leaves, etc. may number of segmented candidate components, but it requires
decrease precision. advance information about scale and position of the text.
In this paper, we present a hybrid approach for scene text Maximal Stable Extremal Regions (MSER) based methods [5]
detection and localization which consists of a) First, an which are sensitive to image bluriness [6] can be incuded in
edge smoothing process using fast guided filter followed by CC based methods.
extracting prospective text areas using MSER. b) Second, The disadvantages present in each of these methods prompt

978-1-5090-6734-3/17/$31.00 ©2017 IEEE 351


2017 Fourth International Conference on Image Information Processing (ICIIP)

researchers to choose hybrid methods [1] for text detection Testing Images 2013 dataset
purpose to achieve higher precision and recall. Yi and Tian
[7] present a hybrid method to locate horizontal text in the
steady colored image. The clustering of text at the pixel level ESMSER and Constituent Filtering
is achieved by applying bigram color uniformity based method
and extraction of text is done by stroke segmentation. Li Extract three attributes on constituents
et al. [8] discuss a method using integration of three cues
namely stroke width variation, perpetual divergence, histogram
of gradients to classify text and non-text components. Get TCS using attributes by bayesian Classifier
Wang et al. [9] proposes a method to design a confidence
map by combination of seed appearance and its relationship Text Labelling Process using MRF
with adjoining candidates to extract text. Missing texts are
recaptured by utilizing the context information. Fabrizio et al.
[10] present a method depending on texture and connected Text Line Integration by Mean Shift Clustering
component methods to detect letters after segmentation using
Fig. 1: Flowchart of the proposed method
wavelet descriptor and form text areas by applying graph
modeling. Gomez and Karatzas et al. [11] detect text with
discriminatory and probabilistic stopping rules by applying
agglomerative clustering process over individual regions. localization process [12].
Every method has some disadvantages associated with it. Although the original version of MSER algorithm detects re-
The methods based on MSER [12], [13], [7], are sensitive gions with consistent intensity enclosed by strikingly different
to blur and low-resolution images. The problem of strong background, but its efficiency decreases in case of diversified
reflection is not dealt properly in [13]. Text with small contrast and blurriness in images. As a result of which certain
distorted, artistic and unconventional fonts in images cannot text constituents remains undiscovered. To resolve this issue,
be detected properly by methods [8], [9], [11]. Texts cannot Chen et al. [12] locate and eliminate MSER pixels outside the
be properly segmented as they stuck together in [10]. The edge boundary using canny edge detector. Li and Lu [6] have
method in [8] is slow whereas the problem of low contrast extracted text components using contrast-enhanced MSER
in text and its background cannot be handled by [9]. These (CEMSER) whereas, Li et al. [8] extracted text components
disadvantages inspire us to propose a new hybrid approach by applying eMSER (edge preserving MSER) using guided
for text detection and localization in natural scene images to filter [15], but it [8] is slow and takes more time.
increase performance in terms of accuracy. In this paper, we propose Edge Smoothing MSER (ESMSER)
for detecting the possible text constituents using the fast
III. P ROPOSED M ETHOD guided filter [16](see Algorithm 1). The eMSER [8](using
In this work, we propose a hybrid method based on edge guided filter [15]) takes more time for smoothing of edges as
smoothing MSER using the fast guided filter to detect and compare to propsed ESMSER (using fast guided filter [16]).
localize text in natural scene images. The training is accom- The sensitivity of MSER to image blurriness due to diverse
plished on the dataset for text segmentation task (challenge 2, pixels as discussed above creates the need to get rid of these
task 2.2) from ICDAR 2013 robust reading competition [14] pixels so as to decrease the effect of the blurriness and improve
to generate the distribution for three attributes on text and detection of text in low-resolution and blurred images. To
non-text constituents, which is needed for the calculation of perform this, firstly an edge smoothing process is carried out
TCS using the Bayesian framework. The proposed method is on sample image in HSI color space using the fast guided
applied on ICDAR 2013 test dataset. The flowchart in Fig.1 filter and then the MSER detection is applied to the edge
depicts the working of the proposed method. Figure 2 exhibits smoothened image to extract possible text constituents. The
the working of the proposed method. miscellaneous pixels around the boundary of the characters are
removed by this edge smoothing process and thus separates the
A. Edge Smoothing MSER and Constituent Filtering characters. Figure 3(a) shows the sample image, the result by
1) ESMSER: The MSER [5] with a time complexity of original MSER is shown in Fig.3(b) (characters are connected),
O(nloglogn), where n is number of pixels in image were Fig.3(c) shows the effect of ESMSER(proposed) using fast
originally used to determine resemblance points between im- guided filter (characters are detached properly). The fast
ages. It is accepted in numerous discipline like object tracking, guided filter [16] having time complexity O(n/s2 ), (s is sub-
image matching, object recognition etc. The MSER algorithm sampling ratio) decreases the execution time for smoothing of
generates stable regions across a range of thresholds which are edges as discussed in SectionIV-2. The time complexity of the
either brighter or darker than their adjoining areas. Immutable Algorithm 1 is O(nloglogn)+O(n/s2 ). The space complexity
to affine transformation, steady to the range of thresholds, is proportional to n (pixels in image).
resilient to multiscale detection [5] are few advantages of 2) Constituent Filtering: The texts like constituents such as
MSER that makes them suitable for scene text detection and bricks, windows, boundaries of sign boards, doors, etc [13],

352
2017 Fourth International Conference on Image Information Processing (ICIIP)

(a) (b) (c) (d) (e)


Fig. 2: Proposed Method a) Sample Image. b) ESMSER. c) Constituent Filtering. d) Labelling. e) Grouping (Detected) text.

Algorithm 1 ESMSER: Prospective Text Constitutents


Input: Sample Image Is and corresponding parameters
Output: Prospective text constituents.
1: Transform image Is in intensity Image Ic utilizing HSI.
2: The edges of Image Ic are smoothened by using Fast (a) (b) (c)
Guided Filter [16]. Fig. 4: (a)Sample (b)ESMSER (c)SWD. (best viewed in color)
3: ∇Ic gradient value map is calculated.
4: Normalize ∇Ic to [0, 255]and get edge smoothened (Ies )
image Ies = Ic + λ ∇ Ic and Ies = Ic - λ ∇ Ic . consistent color, with a part of the text having approximately
5: Extract text constituents by applying MSER on Ies . constant width. In [13], for a region r the width variation
of a component (c) is defined as wv(r) = σ(c) μ(c) . Inspired by
[13], we propose to use attribute Stroke Width Dissimilarity
(equation 1 and Algorithm 2 )as follows:
σ(lsw )
SW D(r) = (1)
μ(lsw )
In fig. 4 color is used to show the stroke width (lsw ) similarity
in text regions 4(a), whereas in non-text constituents 4(b), there
is large dissimilarity of stroke width (lsw ) .
Algorithm 2 Stroke Width Dissimilarity(SW D)
Input: Prospective text Ct region.
Output: Stroke Width Dissimilarity.
(a) (b) (c) 1: Obtain the outline So (r) of the candidate Ct region r.
Fig. 3: a) Sample Image. b)Original MSER. c) Proposed ESMSER. 2: The distance transform is used to calculate its shortest
path to the region boundary for every pixel pSo (r). This
calculated shortest path lsw is called as Stroke width.
σ(l )
[6] may contribute to false positives in the detection process, 3: Calculate SW D(r) = μ(lsw ) .
sw
therefore, it is required to remove them. The basic structural
properties of text and non-text are different from each other,
2) Color Dissimilarity(CD): The color of the text regions
therefore certain heuristics and basic rules are implemented to
are disparate from its background in images, so text can
eliminate these non-text elements. These rules are as follows
be effortlessly, pointed out by humans. We propose attribute
(a) The aspect ratio for constituents is kept in the range 0.3 and
color dissimilarity (CD) as color distinction between text
3. (b) The occupation ratio for constituents is kept in between
and its adjacent area. The color dissimilarity of two regions
0 to 1. (c) The skeleton of constituents is kept less than 18,
can be estimated with the assistance of the Jensen-Shannon
as texts are smaller as compared to non-texts.
Divergence(JSD) [17]. JSD is well defined in information
B. Attributes for text regions theory, symmetric and a measure of discernibility. It’s square
We present three different attributes to discern between text root is a true metric for the probability distribution space.
and non-text constituents. The JSD between the probability distributions M and N is
1) Stroke Width Dissimilarity (SW D): The text con- calculated as:
1 1
stituents have almost unvarying stroke width and it denotes DSJSD (M ||N ) = (DSKL (M ||A)) + (DSKL (N ||A))
2 2
the text region in the image. It is extensively used in the field (2)
of text detection as an inital step [4], [6], [8], [13]. Stroke where, DSKL (M ||A) is the KLD divergence [18] between M
width transform (SWT) is specified as the length of straight and A is defined as
b
line in the perpendicular direction between two edge pixels M (i)
DSKL (M ||A) = M (i) log (3)
[4]. A stroke [7] can be interpret as a coupled region with
i
A(i)

353
2017 Fourth International Conference on Image Information Processing (ICIIP)

Distribution facet of SWD(text components) Distribution facet of SWD(Non Text Components)


0.2 0.2

0.16 0.16

Feasibility

Feasibility
0.12 0.12

0.08 0.08

(a) Convex Area (b) Area of bounding Box 0.04 0.04

Fig. 5: (a) Convex Area. (b) Bounding Box.


0 0
5 10 15 20 25 30 35 40 45 50 5 10 15 20 25 30 35 40 45 50
Bins Bins
Distribution facet of CD (Text Components) Distribution facet of CD(Non Text Components)

where, b denotes the number of bins. and A= 12 (M + N ).


0.25 0.25

The color dissimilarity of a region L against its surrounding 0.2 0.2

Feasibility
L∗ , is calculated as: 

Feasibility
0.15 0.15

 b
 
CD(L) =  DSJSD (Ci (L)||Ci (L∗ )) (4)
0.1 0.1

0.05 0.05
RGB i=1
where, C(L) and C(L∗ ) are color histograms of two region 0
10 20 30 40 50
0
10 20 30 40 50

L and L∗ in a channel (R,G,B). Here L∗ exhibits the region Bins


Distribution facet of O C (Text Components)
Bins
Distribution facet of O C (Non Text Components)
outside L but within its bounding box, ibL is the index of
r a r a
0.2 0.2

0.18 0.18

histogram bins. The decisive color dissimilarity attribute for 0.16 0.16

the region L is acquired by summing the discrete color

Feasibility

Feasibility
0.14 0.14

0.12 0.12

dissimilarity of each (R,G,B) channel(equation 4). 0.1 0.1

3) Occupy Rate Convex Area attributes (Or Ca ): Inspired


0.08 0.08

0.06 0.06

by [19] we propose Or Ca to discern between text and non-text 0.04 0.04

0.02 0.02
constituents. It is calculated as the ratio of the convex area Ca 0
5 10 15 20 25 30 35 40 45 50
0
5 10 15 20 25 30 35 40 45 50

of a region r to the bounding box’s area enclosing the region r. Bins Bins

Convex Area is determined as an area of the smallest convex


(a) (b)
hull that comprises the region. The Or Ca attribute [19] for Fig. 6: Observation feasibility of text (green) in (a) and non-text
region r has range [0,1] is given as: (magenta) in (b) on three attributes i.e., SW D (first row), CV
CA (r) (second row), Or Ca (third row) are different each other.
Or Ca = (5)
ABB
where, CA , ABB is convex area and area of bounding box of
p(atr|nt) via distribution of attributes on text and non-text
the region r respectively. This feature is used in [12] to discern
constituents. This dataset is available with the pixel-level
between text and non-text constituents. Figure 5 displays both
ground truth information. The distribution of text and non-
convex area 5(a) and area of bounding box 5(b).
text constituents is computed as follows:
C. Text Confidence (TCS) 1) Distribution of attributes on text constituents is computed
1) TCS: Among the three attributes, SW D discern text and directly by applying ground truth information.
non-text constituents on structural dissimilarity, CD explores 2) Distribution of attributes on non-text constituents is com-
color dissimilarity of region r with its surrounding and Or Ca puted by first applying ESMSER to obtain possible candidate
discover area occupied dissimilarity. As these three attributes regions. The text constituents are masked from them by using
examine different ingredients of text and non-text constituents ground truth information, and three attributes are calculated
which are complementary and independent properties for a on non-text constituents.
region r, so it encourage us to blend them in a naive Bayesian In Figure 6 the normalized histograms (50 bins) are used to
framework to learn the feasibility TCS of a region r of being show the distribution of three attributes. The text constituents
a text (t) as follows: have smaller values (nearly within 10-15 bins) for distribution
p(Ψ|t)p(t) of SW D as compared with non-text constituents. The text
T CS(t|Ψ) = (6) elements have higher values of CD as compared to non-text
p(Ψ)
p(t)Πatr∈Ψ p(atr|t) and for Or Ca , text constituents almost follow half Gaussian
=  (7) distribution but, such estimation does not exist for non-text
j∈{t,nt} p(j)Πatr∈Ψ p(atr|j)
constituents. [19].
Where Ψ= (SW D, CD, Or Ca ),atr stands for attribute and
p(t), p(nt) denote prior probabilities of text and non-text, D. Labeling and Grouping
respectively and calculated on the basis of relative frequency. 1) Overview of Labeling Model: We propose to use binary
2) Training on ICDAR 2013 :Distribution of the attributes: constituent association and unary constituent characteristics to
The dataset of ICDAR 2013 (text segmentation task) challenge classify constituents properly (in text and non-text category).
2 of robust reading competition [14] is selected for training A standard image segmentation problem can be formulated
purpose to compute the observation feasibility p(atr|t) and using graph cut model [2] for giving binary labels to text and

354
2017 Fourth International Conference on Image Information Processing (ICIIP)

non-text constituents. A standard graph model GI = (VI , EI ) region. Integration of text line is performed by taking at least
is constructed for every input image I where, vertex set two constituents on the basis of the spatial distance(calculated
associated to possible text regions is defined as VI = {vi } and by euclidean norm) of labeled constituents.
the edge set associated to the interaction between vertexes is
IV. E XPERIMENTAL R ESULTS AND D ISCUSSION
defined as EI = {ei }. To give label to each vi as either text
ki = 1, or non-text ki = 0 i.e ki ∈ {0, 1} is known as binary 1) Performance Evaluation Measure and Dataset: To mea-
labeling problem. Text and non-text can be isolated by means sure the usefulness of our proposed approach, we evaluate it on
of text labeling set K = {ki }. In this paper, inspired by [2] ICDAR 2013 [14] dataset of text localization task (challenge
the energy function (see equation (8)) is minimized to obtain 2, task 2.1) which contains 233 and 229 images for test and
optimal labeling K∗ . training set respectively. The detected bounding box around
K∗ = arg min E(K) (8) the texts is used to assess the performance in terms of three
 K  parameter namely precision(p), recall (r) and f -measure [20].
E(K) = ui (ki ) + vij (ki , kj ) (9) The deteval software [20] is used to calculate p and r by using
i i,j∈E many-to-one, one-to-many matches and one-to-one matches
where, ui (ki ), is unary potential,that determines the expenses between ground truth and detected bounding boxes. The f-
of giving label ki to ui , and vij (ki , kj ), is pairwise potential, measure is calculated as the harmonic mean [20] of the recall
that determines the expenses of assigning different labels to vi and precision.
and vj . Optimal labeling K∗ can be calculated efficiently using 2) Comparison of Execution time on Smoothing of Edges:
graph cut [2] as labeling is an energy minimization problem. The original MSER suffers due to the presence of the varied
2) Estimation of unary potential: Text Confidence Score pixels in the vicinity of edges so, it is imperative to smoothen
(TCS) in equation(7) can be used for estimation of the unary edges to extract the text properly. The edge smoothing can be
potential for the region  as: achieved by guided filter [15] due to its perceivable quality.
T CS(k|Ψ), ki = 1 In this paper, the fast guided filter [16] is used for smoothing
ui (ki ) = (10) of edges and can accelerate from O(n) time to O(n/s2 ) (n
1 − T CS(k|Ψ), ki = 0
3) Estimation of pairwise potential: Due to some features is number of pixels) time for a sub-sampling ratio s. The
like color, spatial distance, texture, geometric etc. neighboring Table I shows that, fast guided filter (s=2) reduces the average
text constituents appear to be similar to each other. Two execution time for smoothing of edges by 67%. In both
features are used to quantify correspondence between regions. experiments the value of delta (which controls how stability
Distance Feature (DF): The distance features between two is calculated) parameter of MSER is kept 10.
adjacent constituents of extracted possible text regions is TABLE I: Execution time for smoothing of edges.
calculated as the euclidean distance DF (tri , trj ) between the Filter Avg. Execution Time (in seconds)
m and n coordinates of centroids of constituent of possible Guided filter 0.56
Fast Guided filter 0.182
text regions tri and trj .
Color Distance Feature (CDF): The CDF [8] is defined as
the average color distance between two region tri , trj in LAB 3) Effect of Proposed ESMSER: As mentioned in section
space model using L2 norm. The joint difference (JD) using III-A1 that original MSER algorithm is unable to deal with
(DF) and (CDF) can be estimated as follows [8]: blurriness present in the images, which creates hurdle in
JD(tri , trj ) = γDF (tri , trj ) + (1 − γ)CDF (tri , trj ) (11) detecting text properly in natural scene images. Therefore, in
where, γ specifies the relative weight of the two differences this paper we prefer to use Edge smoothing MSER (ESMSER)
and its value is set to 0.5 to give equal weightage to the DF to reduce the effect of blurriness in such images for efficient
and CDF. The joint difference is used to estimate the pairwise scene text detection. In Figure 7, the first and second row
potential as follows:
 displays the prospective candidate regions detected by original
(1 − tanh(JD(ki , kj ))), ki = ki MSER and proposed ESMSER(Algorithm 1) respectively. It
vij (ki , kj ) = (12) is evident from the results shown in Fig.7 that characters
0, otherwise
are properly separated by proposed ESMSER (second row)
4) Text Line Integration: The labeled text components can as compared to MSER (first row) in which characters are
be integrated into text line on the basis of homogeneous interconnected to each other. Thus, proposed ESMSER helps
features such as average color, height, width, stroke width [4] in detecting the text in images with blurriness in them.
etc. Therefore, text line integration in this paper is achieved by 4) Text Detection and Localization Results: The proposed
using mean-shift clustering (bandwidth =2.2) with the help of method has been compared with few methods like [21],
two normalized features for a given constituent: Eccentricity [9], and some methods from ICDAR 2013 [14] competitions
and Orientation [13], for clustering the text regions using mean for scene text detection and localization methods on dataset
shift algorithm. The Eccentricity is the ratio of the distance ICDAR 2013. It is evident from Table II that the proposed
between the foci of the ellipse and its major axis length. The method attains a precision of 82%, a recall of 64% and f
Orientation is defined as an angle between x-axis and major measure of 72%. Figure 8 displays the few outputs of our
axis of the ellipse that has the same second-moments as the method as applied on ICDAR 2013 test dataset in the form

355
2017 Fourth International Conference on Image Information Processing (ICIIP)

and localization. The fast guided filter reduces the processing


time for edge smoothing. In future work, we intend to increase
the recall rate by improving MSER.
R EFERENCES
[1] H. Zhang, K. Zhao, Y.-Z. Song, and J. Guo, “Text extraction from natural
Fig. 7: Detection by Original MSER (First Row). Detection by scene image: A survey,” Neurocomputing, vol. 122, pp. 310–323, 2013.
ESMSER(Second Row) [2] Y. Boykov, O. Veksler, and R. Zabih, “Fast approximate energy min-
imization via graph cuts,” IEEE Transactions on pattern analysis and
TABLE II: Outcome on ICDAR 2013 dataset. machine intelligence, vol. 23, no. 11, pp. 1222–1239, 2001.
Method Year P R F [3] S. Lee, M. S. Cho, K. Jung, and J. H. Kim, “Scene text extraction with
TCS(Proposed) 2017 0.82 0.64 0.72 edge constraint and text collinearity,” in Pattern Recognition (ICPR),
Wang et al. [21] 2015 0.80 0.73 0.76 2010 20th International Conference on. IEEE, 2010, pp. 3983–3986.
Wang et al. [9] 2015 0.77 0.60 0.68 [4] B. Epshtein, E. Ofek, and Y. Wexler, “Detecting text in natural scenes
Text Detection [14] 2013 0.74 0.53 0.62 with stroke width transform,” in Computer Vision and Pattern Recogni-
TH-TextLoc [14] 2013 0.69 0.65 0.67 tion (CVPR), 2010 IEEE Conference on. IEEE, 2010, pp. 2963–2970.
I2R-NUS-FAR [14] 2013 0.75 0.69 0.71 [5] M. S. Extremal, J. Matas, O. Chum, M. Urban, and T. Pajdla, “Robust
CASIA-NLPR [14] 2013 0.78 0.68 0.73 wide baseline stereo from,” in In British Machine Vision Conference.
Citeseer, 2002.
[6] Y. Li and H. Lu, “Scene text detection via stroke width,” in Pattern
Recognition (ICPR), 2012 21st International Conference on. IEEE,
2012, pp. 681–684.
[7] C. Yi and Y. Tian, “Localizing text in scene images by boundary
clustering, stroke segmentation, and string fragment classification,” IEEE
Transactions on Image Processing, vol. 21, no. 9, pp. 4256–4268, 2012.
[8] Y. Li, W. Jia, C. Shen, and A. van den Hengel, “Characterness: An
indicator of text in the wild,” IEEE transactions on image processing,
vol. 23, no. 4, pp. 1666–1677, 2014.
[9] R. Wang, N. Sang, and C. Gao, “Text detection approach based on
confidence map and context information,” Neurocomputing, vol. 157,
pp. 153–165, 2015.
[10] J. Fabrizio, M. Robert-Seidowsky, S. Dubuisson, S. Calarasanu, and
R. Boissel, “Textcatcher: a method to detect curved and challenging
text in natural scenes,” International Journal on Document Analysis and
Recognition (IJDAR), vol. 19, no. 2, pp. 99–117, 2016.
[11] L. Gomez and D. Karatzas, “A fast hierarchical method for multi-script
and arbitrary oriented scene text extraction,” International Journal on
Document Analysis and Recognition (IJDAR), vol. 19, no. 4, pp. 335–
349, 2016.
[12] H. Chen, S. S. Tsai, G. Schroth, D. M. Chen, R. Grzeszczuk, and
Fig. 8: Sample Images from ICDAR 2013 dataset. B. Girod, “Robust text detection in natural images with edge-enhanced
maximally stable extremal regions,” in Image Processing (ICIP), 2011
18th IEEE International Conference on. IEEE, 2011, pp. 2609–2612.
[13] C. Yao, X. Bai, W. Liu, Y. Ma, and Z. Tu, “Detecting texts of
arbitrary orientations in natural images,” in Computer Vision and Pattern
Recognition (CVPR), 2012 IEEE Conference on. IEEE, 2012, pp. 1083–
1090.
[14] D. Karatzas, F. Shafait, S. Uchida, M. Iwamura, L. G. i Bigorda, S. R.
Mestre, J. Mas, D. F. Mota, J. A. Almazan, and L. P. de las Heras,
“Icdar 2013 robust reading competition,” in Document Analysis and
(a) (b) (c) Recognition (ICDAR), 2013 12th International Conference on. IEEE,
Fig. 9: Failure Cases 2013, pp. 1484–1493.
[15] K. He, J. Sun, and X. Tang, “Guided image filtering,” in European
conference on computer vision. Springer, 2010, pp. 1–14.
of detected text bounded by blue rectangles. The proposed [16] K. He and J. Sun, “Fast guided filter,” arXiv preprint arXiv:1505.00996,
2015.
method is able to detect text of font size, distinct fonts, [17] A. Majtey, P. Lamberti, and D. Prato, “Jensen-shannon divergence as a
color and orientation. It works effectively as compared to measure of distinguishability between mixed quantum states,” Physical
other state of the art work in occlusion, jumbled scene and Review A, vol. 72, no. 5, p. 052310, 2005.
[18] D. A. Klein and S. Frintrop, “Center-surround divergence of feature
divergent lighting conditions, but it needs improvement in few statistics for salient object detection,” in Computer Vision (ICCV), 2011
cases(see Fig.9) like text color mix with background and text IEEE International Conference on. IEEE, 2011, pp. 2214–2219.
in uncommon fonts. [19] A. Gonzalez, L. M. Bergasa, J. J. Yebes, and S. Bronte, “Text location
in complex images,” in Pattern Recognition (ICPR), 2012 21st Interna-
V. C ONCLUSION tional Conference on. IEEE, 2012, pp. 617–620.
[20] C. Wolf and J.-M. Jolion, “Object count/area graphs for the evaluation of
The paper presents a hybrid method for text detection and object detection and segmentation algorithms,” International Journal of
localization based on the text confidence score calculated by Document Analysis and Recognition (IJDAR), vol. 8, no. 4, pp. 280–296,
combination of three complementary attributes blended in 2006.
[21] Q. Wang, Y. Lu, and S. Sun, “Text detection in nature scene images using
the Bayesian framework to discern between text and non- two-stage nontext filtering,” in Document Analysis and Recognition
text areas. Results on ICDAR 2013 dataset by our method (ICDAR), 2015 13th International Conference on. IEEE, 2015, pp.
achieves good precision and recall for scene text detection 106–110.

356

You might also like