Text Detection and Localization in Natural Scene Images Using MSER and Fast Guided Filter
Text Detection and Localization in Natural Scene Images Using MSER and Fast Guided Filter
Abstract—Textual matter present in a natural scene image three attributes namely Stroke Width Dissimilarity, Color
provides indispensable information about it. The semantics and Dissimilarity and Occupy Rate Convex Area are calculated on
information present in the natural scene images can be perceived these areas. c) Third, we blend these attributes using Bayesian
by extracting the text regions in them. Detection and localization
of text from natural scene images is a challenging task for analysis classifier to estimate the TCS(Text Confidence Score), that
of images due to various font size, font type, and illumination. In determines the feasibility of a region as a text. d) Last, the
this paper, we propose a hybrid approach for text detection and labeling of constituent as text and non-text is carried out using
localization based on text confidence score using three attributes graph cut and Markov Random Field (MRFs) [2], followed by
namely stroke width dissimilarity, color dissimilarity and occupy text line integration with the help of the mean-shift clustering
rate convex area to discern text and non-text constituents. The
aim of this paper is to achieve fast detection and localization of approach.
text regions in low resolution and blurred images. To accomplish The arrangement of the paper is as follows: Introduction is
this, the possible candidate regions are extracted using edge discussed in Section I, related work is reviewed in Section
smoothing by fast guided filter followed by MSER. The text II. Section III defines the working of the proposed method.
confidence score on these constituents is calculated using the Experiments and results are discussed in Section IV, with
Bayesian framework with the help of above mentioned three
attributes. Experimental results on benchmark ICDAR 2013 concluding remarks mentioned in Section V.
testing dataset shows the efficacy of our method in the form
of precision, recall, and f-measure. II. R ELATED WORK
Index Terms—Text Detection and Localization, Text Confi- Numerous methods have been developed and proposed in
dence Score, Edge Smoothing MSER,Fast Guided Filter. past for scene text detection and localization, therefore based
on extensive survey [1] these methods can be divided on the
I. I NTRODUCTION basis of edge, stroke, texture, connected component (CC) and
Extraction of text regions from natural scene images is hybrid.
one of the crucial tasks in computer vision. The information In edge based method [3], edges are detected by edge detector
about scene images like notice boards, advertisement boards, and text components are extracted by morphological opera-
road signs etc. is embedded in the form of text. These texts tions. It is suitable for images with uniform gradient and gives
provide a rich amount of information about such images and poor results for images having a complicated background.
can be used in heterogeneous applications such as license The texture based method uses Wavelet Transform, Discrete
plate localization, robot navigation, content-based image re- Cosine Transform(DCT) [1], Histogram of Gradients(HOG)
trieval, guidance for visually impaired people [1] etc. Our and Local Binary Pattern(LBP) to acquire texture features, as
proposed work in this paper revolves around text detection text regions have different texture properties as compared to
and localization process, which is focused on estimating non-text regions. They are delicate to text arrangement, but
text locations in the image and creating bounding boxes efficient for distinguishing crowded characters.
around them. Researchers over the years have made significant The stroke based method [4], uses stroke width as the text-
progress in this field; however, this domain is still open due determining feature to extract text regions from an image,
to various challenges like variable font size, alignment of the but it is unsuccessful on images with varying background.
text, color, complex natural scene, occlusion, noise, blur [1], The connected component(CC) based method [4], uses color
illumination variations, viewpoint, distortion of the image, etc. clustering or edge detection for separating text components
Further, some false positives detected due to presence of some from image. They have lower computation cost due to less
background objects like bricks, windows, leaves, etc. may number of segmented candidate components, but it requires
decrease precision. advance information about scale and position of the text.
In this paper, we present a hybrid approach for scene text Maximal Stable Extremal Regions (MSER) based methods [5]
detection and localization which consists of a) First, an which are sensitive to image bluriness [6] can be incuded in
edge smoothing process using fast guided filter followed by CC based methods.
extracting prospective text areas using MSER. b) Second, The disadvantages present in each of these methods prompt
researchers to choose hybrid methods [1] for text detection Testing Images 2013 dataset
purpose to achieve higher precision and recall. Yi and Tian
[7] present a hybrid method to locate horizontal text in the
steady colored image. The clustering of text at the pixel level ESMSER and Constituent Filtering
is achieved by applying bigram color uniformity based method
and extraction of text is done by stroke segmentation. Li Extract three attributes on constituents
et al. [8] discuss a method using integration of three cues
namely stroke width variation, perpetual divergence, histogram
of gradients to classify text and non-text components. Get TCS using attributes by bayesian Classifier
Wang et al. [9] proposes a method to design a confidence
map by combination of seed appearance and its relationship Text Labelling Process using MRF
with adjoining candidates to extract text. Missing texts are
recaptured by utilizing the context information. Fabrizio et al.
[10] present a method depending on texture and connected Text Line Integration by Mean Shift Clustering
component methods to detect letters after segmentation using
Fig. 1: Flowchart of the proposed method
wavelet descriptor and form text areas by applying graph
modeling. Gomez and Karatzas et al. [11] detect text with
discriminatory and probabilistic stopping rules by applying
agglomerative clustering process over individual regions. localization process [12].
Every method has some disadvantages associated with it. Although the original version of MSER algorithm detects re-
The methods based on MSER [12], [13], [7], are sensitive gions with consistent intensity enclosed by strikingly different
to blur and low-resolution images. The problem of strong background, but its efficiency decreases in case of diversified
reflection is not dealt properly in [13]. Text with small contrast and blurriness in images. As a result of which certain
distorted, artistic and unconventional fonts in images cannot text constituents remains undiscovered. To resolve this issue,
be detected properly by methods [8], [9], [11]. Texts cannot Chen et al. [12] locate and eliminate MSER pixels outside the
be properly segmented as they stuck together in [10]. The edge boundary using canny edge detector. Li and Lu [6] have
method in [8] is slow whereas the problem of low contrast extracted text components using contrast-enhanced MSER
in text and its background cannot be handled by [9]. These (CEMSER) whereas, Li et al. [8] extracted text components
disadvantages inspire us to propose a new hybrid approach by applying eMSER (edge preserving MSER) using guided
for text detection and localization in natural scene images to filter [15], but it [8] is slow and takes more time.
increase performance in terms of accuracy. In this paper, we propose Edge Smoothing MSER (ESMSER)
for detecting the possible text constituents using the fast
III. P ROPOSED M ETHOD guided filter [16](see Algorithm 1). The eMSER [8](using
In this work, we propose a hybrid method based on edge guided filter [15]) takes more time for smoothing of edges as
smoothing MSER using the fast guided filter to detect and compare to propsed ESMSER (using fast guided filter [16]).
localize text in natural scene images. The training is accom- The sensitivity of MSER to image blurriness due to diverse
plished on the dataset for text segmentation task (challenge 2, pixels as discussed above creates the need to get rid of these
task 2.2) from ICDAR 2013 robust reading competition [14] pixels so as to decrease the effect of the blurriness and improve
to generate the distribution for three attributes on text and detection of text in low-resolution and blurred images. To
non-text constituents, which is needed for the calculation of perform this, firstly an edge smoothing process is carried out
TCS using the Bayesian framework. The proposed method is on sample image in HSI color space using the fast guided
applied on ICDAR 2013 test dataset. The flowchart in Fig.1 filter and then the MSER detection is applied to the edge
depicts the working of the proposed method. Figure 2 exhibits smoothened image to extract possible text constituents. The
the working of the proposed method. miscellaneous pixels around the boundary of the characters are
removed by this edge smoothing process and thus separates the
A. Edge Smoothing MSER and Constituent Filtering characters. Figure 3(a) shows the sample image, the result by
1) ESMSER: The MSER [5] with a time complexity of original MSER is shown in Fig.3(b) (characters are connected),
O(nloglogn), where n is number of pixels in image were Fig.3(c) shows the effect of ESMSER(proposed) using fast
originally used to determine resemblance points between im- guided filter (characters are detached properly). The fast
ages. It is accepted in numerous discipline like object tracking, guided filter [16] having time complexity O(n/s2 ), (s is sub-
image matching, object recognition etc. The MSER algorithm sampling ratio) decreases the execution time for smoothing of
generates stable regions across a range of thresholds which are edges as discussed in SectionIV-2. The time complexity of the
either brighter or darker than their adjoining areas. Immutable Algorithm 1 is O(nloglogn)+O(n/s2 ). The space complexity
to affine transformation, steady to the range of thresholds, is proportional to n (pixels in image).
resilient to multiscale detection [5] are few advantages of 2) Constituent Filtering: The texts like constituents such as
MSER that makes them suitable for scene text detection and bricks, windows, boundaries of sign boards, doors, etc [13],
352
2017 Fourth International Conference on Image Information Processing (ICIIP)
353
2017 Fourth International Conference on Image Information Processing (ICIIP)
0.16 0.16
Feasibility
Feasibility
0.12 0.12
0.08 0.08
Feasibility
L∗ , is calculated as:
Feasibility
0.15 0.15
b
CD(L) = DSJSD (Ci (L)||Ci (L∗ )) (4)
0.1 0.1
0.05 0.05
RGB i=1
where, C(L) and C(L∗ ) are color histograms of two region 0
10 20 30 40 50
0
10 20 30 40 50
0.18 0.18
histogram bins. The decisive color dissimilarity attribute for 0.16 0.16
Feasibility
Feasibility
0.14 0.14
0.12 0.12
0.06 0.06
0.02 0.02
constituents. It is calculated as the ratio of the convex area Ca 0
5 10 15 20 25 30 35 40 45 50
0
5 10 15 20 25 30 35 40 45 50
of a region r to the bounding box’s area enclosing the region r. Bins Bins
354
2017 Fourth International Conference on Image Information Processing (ICIIP)
non-text constituents. A standard graph model GI = (VI , EI ) region. Integration of text line is performed by taking at least
is constructed for every input image I where, vertex set two constituents on the basis of the spatial distance(calculated
associated to possible text regions is defined as VI = {vi } and by euclidean norm) of labeled constituents.
the edge set associated to the interaction between vertexes is
IV. E XPERIMENTAL R ESULTS AND D ISCUSSION
defined as EI = {ei }. To give label to each vi as either text
ki = 1, or non-text ki = 0 i.e ki ∈ {0, 1} is known as binary 1) Performance Evaluation Measure and Dataset: To mea-
labeling problem. Text and non-text can be isolated by means sure the usefulness of our proposed approach, we evaluate it on
of text labeling set K = {ki }. In this paper, inspired by [2] ICDAR 2013 [14] dataset of text localization task (challenge
the energy function (see equation (8)) is minimized to obtain 2, task 2.1) which contains 233 and 229 images for test and
optimal labeling K∗ . training set respectively. The detected bounding box around
K∗ = arg min E(K) (8) the texts is used to assess the performance in terms of three
K parameter namely precision(p), recall (r) and f -measure [20].
E(K) = ui (ki ) + vij (ki , kj ) (9) The deteval software [20] is used to calculate p and r by using
i i,j∈E many-to-one, one-to-many matches and one-to-one matches
where, ui (ki ), is unary potential,that determines the expenses between ground truth and detected bounding boxes. The f-
of giving label ki to ui , and vij (ki , kj ), is pairwise potential, measure is calculated as the harmonic mean [20] of the recall
that determines the expenses of assigning different labels to vi and precision.
and vj . Optimal labeling K∗ can be calculated efficiently using 2) Comparison of Execution time on Smoothing of Edges:
graph cut [2] as labeling is an energy minimization problem. The original MSER suffers due to the presence of the varied
2) Estimation of unary potential: Text Confidence Score pixels in the vicinity of edges so, it is imperative to smoothen
(TCS) in equation(7) can be used for estimation of the unary edges to extract the text properly. The edge smoothing can be
potential for the region as: achieved by guided filter [15] due to its perceivable quality.
T CS(k|Ψ), ki = 1 In this paper, the fast guided filter [16] is used for smoothing
ui (ki ) = (10) of edges and can accelerate from O(n) time to O(n/s2 ) (n
1 − T CS(k|Ψ), ki = 0
3) Estimation of pairwise potential: Due to some features is number of pixels) time for a sub-sampling ratio s. The
like color, spatial distance, texture, geometric etc. neighboring Table I shows that, fast guided filter (s=2) reduces the average
text constituents appear to be similar to each other. Two execution time for smoothing of edges by 67%. In both
features are used to quantify correspondence between regions. experiments the value of delta (which controls how stability
Distance Feature (DF): The distance features between two is calculated) parameter of MSER is kept 10.
adjacent constituents of extracted possible text regions is TABLE I: Execution time for smoothing of edges.
calculated as the euclidean distance DF (tri , trj ) between the Filter Avg. Execution Time (in seconds)
m and n coordinates of centroids of constituent of possible Guided filter 0.56
Fast Guided filter 0.182
text regions tri and trj .
Color Distance Feature (CDF): The CDF [8] is defined as
the average color distance between two region tri , trj in LAB 3) Effect of Proposed ESMSER: As mentioned in section
space model using L2 norm. The joint difference (JD) using III-A1 that original MSER algorithm is unable to deal with
(DF) and (CDF) can be estimated as follows [8]: blurriness present in the images, which creates hurdle in
JD(tri , trj ) = γDF (tri , trj ) + (1 − γ)CDF (tri , trj ) (11) detecting text properly in natural scene images. Therefore, in
where, γ specifies the relative weight of the two differences this paper we prefer to use Edge smoothing MSER (ESMSER)
and its value is set to 0.5 to give equal weightage to the DF to reduce the effect of blurriness in such images for efficient
and CDF. The joint difference is used to estimate the pairwise scene text detection. In Figure 7, the first and second row
potential as follows:
displays the prospective candidate regions detected by original
(1 − tanh(JD(ki , kj ))), ki = ki MSER and proposed ESMSER(Algorithm 1) respectively. It
vij (ki , kj ) = (12) is evident from the results shown in Fig.7 that characters
0, otherwise
are properly separated by proposed ESMSER (second row)
4) Text Line Integration: The labeled text components can as compared to MSER (first row) in which characters are
be integrated into text line on the basis of homogeneous interconnected to each other. Thus, proposed ESMSER helps
features such as average color, height, width, stroke width [4] in detecting the text in images with blurriness in them.
etc. Therefore, text line integration in this paper is achieved by 4) Text Detection and Localization Results: The proposed
using mean-shift clustering (bandwidth =2.2) with the help of method has been compared with few methods like [21],
two normalized features for a given constituent: Eccentricity [9], and some methods from ICDAR 2013 [14] competitions
and Orientation [13], for clustering the text regions using mean for scene text detection and localization methods on dataset
shift algorithm. The Eccentricity is the ratio of the distance ICDAR 2013. It is evident from Table II that the proposed
between the foci of the ellipse and its major axis length. The method attains a precision of 82%, a recall of 64% and f
Orientation is defined as an angle between x-axis and major measure of 72%. Figure 8 displays the few outputs of our
axis of the ellipse that has the same second-moments as the method as applied on ICDAR 2013 test dataset in the form
355
2017 Fourth International Conference on Image Information Processing (ICIIP)
356