0% found this document useful (0 votes)
117 views

Deep Matching Prior Network: Toward Tighter Multi-Oriented Text Detection

The document proposes a new CNN-based method called Deep Matching Prior Network (DMPNet) to detect text with tighter quadrangle bounding boxes rather than rectangles. DMPNet uses quadrilateral sliding windows to roughly recall text and then refines the predictions. It achieves state-of-the-art performance on a public dataset, outperforming existing methods.

Uploaded by

To Isaac
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
117 views

Deep Matching Prior Network: Toward Tighter Multi-Oriented Text Detection

The document proposes a new CNN-based method called Deep Matching Prior Network (DMPNet) to detect text with tighter quadrangle bounding boxes rather than rectangles. DMPNet uses quadrilateral sliding windows to roughly recall text and then refines the predictions. It achieves state-of-the-art performance on a public dataset, outperforming existing methods.

Uploaded by

To Isaac
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Deep Matching Prior Network: Toward Tighter Multi-oriented Text Detection

Yuliang Liu, Lianwen Jin+


College of Electronic Information Engineering
South China University of Technology
[email protected]
arXiv:1703.01425v1 [cs.CV] 4 Mar 2017

Abstract

Detecting incidental scene text is a challenging task be-


cause of multi-orientation, perspective distortion, and vari-
ation of text size, color and scale. Retrospective research
has only focused on using rectangular bounding box or hor- (a) Rectangular bounding box cause unnecessary overlap.
izontal sliding window to localize text, which may result in
redundant background noise, unnecessary overlap or even
information loss. To address these issues, we propose a
new Convolutional Neural Networks (CNNs) based method,
named Deep Matching Prior Network (DMPNet), to detect
text with tighter quadrangle. First, we use quadrilateral
sliding windows in several specific intermediate convolu-
tional layers to roughly recall the text with higher overlap-
ping area and then a shared Monte-Carlo method is pro-
posed for fast and accurate computing of the polygonal ar- (c) Marginal text can not be exactly localized with rectangle.
eas. After that, we designed a sequential protocol for rela-
tive regression which can exactly predict text with compact
quadrangle. Moreover, a auxiliary smooth Ln loss is also
proposed for further regressing the position of text, which
has better overall performance than L2 loss and smooth L1
loss in terms of robustness and stability. The effectiveness (d) Rectangular bounding box brings redundant noise.
of our approach is evaluated on a public word-level, multi- Figure 1. Comparison of quadrilateral bounding box and rectangu-
oriented scene text database, ICDAR 2015 Robust Reading lar bounding box for localizing texts.
Competition Challenge 4 “Incidental scene text localiza-
tion”. The performance of our method is evaluated by using
spective distortions, and variation of text size, color or
F-measure and found to be 70.64%, outperforming the ex-
scale [40], which makes it a very challenging task [39]. In
isting state-of-the-art method with F-measure 63.76%.
the past few years, various existing methods have success-
fully been used for detecting horizontal or near-horizontal
texts [2, 4, 23, 11, 10]. However, due to the horizontal rect-
1. Introduction
angular constraints, multi-oriented text are restrictive to be
Scene text detection is an important prerequisite [32, 31, recalled in practice, e.g. low accuracies reported in ICDAR
37, 1, 34] for many content-based applications, e.g., mul- 2015 Competition Challenge 4 “Incidental scene text local-
tilingual translation, blind navigation and automotive assis- ization” [14].
tance. Especially, the recognition stage always stands in Recently, numerous techniques [35, 36, 13, 39] have
need of localizing scene text in advance, thus it is a signifi- been devised for multi-oriented text detection; these meth-
cant requirement for detecting methods that can tightly and ods used rotated rectangle to localize oriented text. How-
robustly localize scene text. ever, Ye and Doermann [34] indicated that because of char-
Camera captured scene text are often found with low- acters distortion, the boundary of text may lose rectangular
quality; these texts may have multiple orientations, per- shape, and the rectangular constraints may result in redun-

1
dant background noise, unnecessary overlap or even infor- • The proposed smooth Ln loss has better overall per-
mation loss when detecting distorted incidental scene text formance than L2 loss and smooth L1 loss in terms of
as shown in Figure 1. It can be visualized from the Figure robustness and stability.
that the rectangle based methods must face three kinds of
circumstances: i) redundant information may reduce the re- • Our approach shows state-of-the-art performance in
liability of detected confidence [18] and make subsequent detecting incidental scene text.
recognition harder [40]; ii) marginal text may not be lo-
calized completely; iii) when using non-maximum suppres-
2. Related work
sion [21], unnecessary overlap may eliminate true predic- Reading text in the wild have been extensively studied in
tion. recent years because scene text conveys numerous valuable
To address these issues, in this paper, we proposed a new information that can be used on many intelligent applica-
Convolutional Neural Networks (CNNs) based method, tions, e.g. autonomous vehicles and blind navigation. Un-
named Deep Matching Prior Network (DMPNet), toward like generic objects, scene text has unconstrained lengths,
tighter text detection. To the best of our knowledge, this shape and especially perspective distortions, which make
is the first attempt to detect text with quadrangle. Ba- text detection hard to simply adopt techniques from other
sically, our method consists of two steps: roughly recall domains. Therefore, the mainstream of text detection meth-
text and finely adjust the predicted bounding box. First, ods always focused on the structure of individual charac-
based on the priori knowledge of textual intrinsic shape, we ters and the relationships between characters [40], e.g. con-
design different kinds of quadrilateral sliding windows in nected component based methods [38, 27, 22]. These meth-
specific intermediate convolutional layers to roughly recall ods often use stroke width transform (SWT) [9] or max-
text by comparing the overlapping area with a predefined imally stable extremal region (MSER) [20, 24] to first ex-
threshold. During this rough procedure, because numer- tract character candidates, and using a series of subsequence
ous polygonal overlapping areas between the sliding win- steps to eliminate non-text noise for exactly connecting the
dow (SW) and ground truth (GT) need to be computed, candidates. Although accurate, such methods are somewhat
we design a shared Monte-Carlo method to solve this is- limited to preserve various true characters in practice [3].
sue, which is qualitatively proved more accurate than the Another mainstream method is based on sliding win-
previous computational method [30]. After roughly recall- dow [2, 15, 8, 17], which shifts a window in each position
ing text, those SWs with higher overlapping area would be with multiple scales from an image to detect text. Although
finely adjusted for better localizing; different from existing this method can effectively recall text, the classification of
methods [2, 4, 23, 11, 10, 35, 36, 39] that predict text with the locations can be sensitive to false positives because the
rectangle, our method can use quadrangle for tighter local- sliding windows often carry various background noise.
izing scene text, which owe to the sequential protocol we Recently, Convolutional Neural Networks [28, 6, 26, 19,
purposed and the relative regression we used. Moreover, a 25] have been proved powerful enough to suppress false
new smooth Ln loss is also proposed for further regressing positives, which enlightened researchers in the area of scene
the position of text, which has better overall performance text detection; in [10], Huang et al. integrated MSER and
than L2 loss and smooth L1 loss in terms of robustness and CNN to significantly enhance performance over conven-
stability. Experiments on the public word-level and multi- tional methods; Zhang et al. utilized Fully Convolutional
oriented dataset, ICDAR 2015 Robust Reading Competition Network [39] to efficiently generate a pixel-wise text/non-
Challenge 4 “Incidental scene text localization”, demon- text salient map, which achieve state-of-the-art performance
strate that our method outperforms previous state-of-the-art on public datasets. It is worth mentioning that the common
methods [33] in terms of F-measure. ground of these successful methods is to utilized textual in-
We summarize our contributions as follow: trinsic information for training the CNN. Inspired by this
• We are the first to put forward prior quadrilateral slid- promising idea, instead of using constrained rectangle, we
ing window, which significantly improve the recall designed numerous quadrilateral sliding windows based on
rate. the textual intrinsic shape, which significantly improves re-
call rate in practice.
• We proposed sequential protocol for uniquely deter-
mining the order of 4 points in arbitrary plane convex 3. Proposed methodology
quadrangle, which enable our method for using rela-
This section presents details of the Deep Matching Prior
tive regression to predict quadrilateral bounding box.
Network (DMPNet). It includes the key contributions that
• The proposed shared Monte-Carlo computational make our method reliable and accurate for text localiza-
method can fast and accurately compute the polygonal tion: firstly, roughly recalling text with quadrilateral sliding
overlapping area. window; then, using a shared Monte-Carlo method for fast
(a) Comparison of recalling scene text. (b) Horizontal sliding windows. (c) Proposed quadrilateral sliding windows.
Figure 2. Comparison between horizontal sliding window and quadrilateral sliding window. (a): Black bounding box represents ground
truth; red represents our method. Blue represents horizontal sliding window. It can be visualized that quadrilateral window can easier
recall text than rectangular window with higher overlapping area. (b): Horizontal sliding windows used in [19]. (c): Proposed quadrilateral
sliding windows. Different quadrilateral sliding window can be distinguished with different color.

and accurate computing of polygonal areas; finely localiz- are added inside the square; b) two long parallelograms are
ing text with quadrangle and design a Smooth Ln loss for added inside the long rectangle. c) two tall parallelograms
moderately adjusting the predicted bounding box. are added inside the tall rectangle.
With these flexible sliding windows, the rough bound-
3.1. Roughly recall text with quadrilateral sliding ing boxes become more accurate and thus the subsequence
window finely procedure can be easier to localize text tightly. In ad-
Previous approaches [19, 26] have successfully adopted dition, because of less background noise, the confidence of
sliding windows in the intermediate convolutional layers to these quadrilateral sliding windows can be more reliable in
roughly recall text. Although the methods [26] can accu- practice, which can be used to eliminate false positives.
rately learn region proposal based on the sliding windows,
these approaches have been too slow for real-time or near 3.1.1 Shared Monte-Carlo method
real-time applications. To raise the speed, Liu [19] sim-
ply evaluate a small set of prior windows of different aspect As mentioned earlier, for each ground truth, we need to
ratios at each location in several feature maps with differ- compute its overlapping area with every quadrilateral slid-
ent scales, which can successfully detect both small and big ing window. However, the previous method [30] can only
objects. However, the horizontal sliding windows are of- compute rectangular area with unsatisfactory computational
ten hard to recall multi-oriented scene text in our practice. accuracy, thus we proposed a shared Monte-Carlo method
Inspired by the recent successful methods [10, 39] that in- that has both high speed and accuracy properties when com-
tegrated the textual feature and CNN, we put forward nu- puting the polygonal area. Our method consists of two
merous quadrilateral sliding windows based on the textual steps.
intrinsic shape to roughly recall text. a) First, we uniformly sample 10,000 points in circum-
During this rough procedure, an overlapping threshold scribed rectangle of the ground truth. The area of ground
was used to judge whether the sliding window is positive truth (SGT ) can be computed by calculating the ratio of
or negative. If a sliding window is positive, it would be overlapping points in total points multiplied by the area of
used to finely localize the text. Basically, a small threshold circumscribed rectangle. In this step, all points inside the
may bring a lot of background noise, reducing the preci- ground truth would be reserved for sharing computation.
sion, while a large threshold may make text harder to be b) Second, if the circumscribed rectangle of each slid-
recalled. But if we use quadrilateral sliding window, the ing window and the circumscribed rectangle of each ground
overlapping area between sliding window and ground truth truth do not have a intersection, the overlapping area is con-
can be larger enough to reach a higher threshold, which are sidered zero and we do not need to further compute. If
beneficial to improve both the recall rate and the precision the overlapping area is not zero, we use the same sam-
as shown in Figure 2. As the figure presents, we reserve the pling strategy to compute the area of sliding window (SSW )
horizontal sliding windows, simultaneously designing sev- and then calculating how many the reserved points from
eral quadrangles inside them based on the prior knowledge the first step inside the sliding window. The ratio of inside
of textual intrinsic shape: a) two rectangles with 45 degrees points multiplies the area of the circumscribed rectangle is
Figure 3. Comparison between previous method and our method in computing overlapping area.

the overlapping area. Specially, this step is suitable for us-


ing GPU parallelization, because we can use each thread to
be responsible for calculating each sliding window with the
specified ground truth, and thus we can handle thousands of
sliding windows in a short time.
Note that we use a method proposed in [12] to judge
whether a point is inside a polygon, and this method is
also known as the crossing number algorithm or the even-
odd rule algorithm [5]. The comparison between previous
method and our algorithm is shown in Figure 3, our method
shows satisfactory performance for computing polygonal
area in practice.

3.2. Finely localize text with quadrangle


The fine procedure focuses on using those sliding win-
Figure 4. Procedure of uniquely determining the sequence of four
dows with higher overlapping area to tightly localize text. points from a plane convex quadrangle.
Unlike horizontal rectangle that can be determined by two
diagonal points, we need to predict the coordinates of four
points to localize a quadrangle. However, simply using from the line with middle slope. The second and the fourth
the 4 points to shape a quadrangle is prone to be self- points are in the opposite side (defined “bigger” side and
contradictory, because the subjective annotation may make “smaller” side) of the middle line. Here, we assume middle
the network ambiguous to decide which is the first point. line Lm : ax + by + c = 0, and we define an undeter-
Therefore, before training, it is essential to order 4 points in mined point P (xp , yp ). If Lm (P ) > 0, we assume P is
advance. in the “bigger” side. If Lm (P ) < 0, P is assumed in the
Sequential protocol of coordinates. The propose proto- “smaller” side. Based on this assumption, the point in the
col can be used to determine the sequence of four points in “bigger” side would be assigned as second point, and the
the plane convex quadrangle, which contains four steps as last point would be regarded as the fourth point. The last
shown in Figure 4. First, we determine the first point with step is to compare the slopes between two diagonals (line13
minimum value x. If two points simultaneously have the and line24 ). From the line with bigger slope, we choose
minimum x, then we choose the point with smaller value the point with smaller x as the new first point. Specially,
y as the first point. Second, we connect the first point if the bigger slope is infinite, the point that has smaller y
to the other three points, and the third point can be found would be chosen as the first point. Similarly, we find out
p∗ p∗
y −py
p∗w1 −pw1
the third point, and then the second and fourth point can be x −px
dx = wchr , dy = hchr , dw1 = wchr , dh1 =
determined again. After finishing these four steps, the final p∗
h1 −ph1

pw −pw2 ∗
ph −ph2
hchr , dw2 = wchr , dh2 = hchr , dw3 =
2 2
sequence of the four points from a given convex quadrangle
p∗
w3 −pw3

ph −ph3 ∗
pw −pw4
can be uniquely determined. wchr , dh3 = hchr , dw4 =
3
wchr , dh4
4
=
Based on the sequential protocol, DMPNet can clearly p∗
h4 −ph4

learn and regress the coordinate of each point by computing hchr . This can be thought of as fine regression from an
quadrilateral sliding window to a nearby ground-truth box.
the relative position to the central point. Different from [26]
which regress two coordinates and two lengths for a rect- 3.3. Smooth Ln loss
angular prediction, our regressive method predicts two
Different from [19, 26], our approach uses a proposed
coordinates and eight lengths for a quadrilateral detection.
smooth Ln loss instead of smooth L1 loss to further localize
For each ground truth, the coordinates of four points would
scene text. Smooth L1 loss is less sensitive to outliers than
be reformatted to (x, y, w1 , h1 , w2 , h2 , w3 , h3 , w4 , h4 ),
the L2 loss used in R-CNN [7], however, this loss is not
where x, y are the central coordinate of the minimum
stable enough for adjustment of a data, which means the re-
circumscribed horizontal rectangle, and wi , hi are the
gression line may jump a large amount for small adjustment
relative position of the i-th point (i = {1, 2, 3, 4})
or just a little modification was used for big adjustment. As
to the central point. As Figure 5 shows, the co-
for proposed smooth Ln loss, the regressive parameters are
ordinates of four points (x1 ,y1 ,x2 ,y2 ,x3 ,y3 ,x4 ,y4 )=
continuous functions of the data, which means for any small
(x + w1 ,y + h1 ,x + w2 ,y + h2 ,x + w3 ,y + h3 ,x + w4 ,y + h4 ).
adjustment of a data point, the regression line will always
Note that wi and hi can be negative. Actually, eight coordi-
move only slightly, improving the precision in localizing
nates are enough to determine the position of a quadrangle,
small text. For bigger adjustment, the regression can always
and the reason why we use ten coordinates is because we
move to a moderate step based on smooth Ln loss, which
can avoid regressing 8 coordinates, which do not contain
can accelerate the converse of training procedure in prac-
relative information and it is more difficult to learn in
tice. As mentioned in section 3.2, the recursive loss, Lreg,
practice [6]. Inspired by [26], we also use Lreg(pi ;p∗i ) =
is defined over a tuple of true bounding-box regression tar-
R(pi -p∗i ) for multi-task loss, where R is our proposed loss
gets p∗ and a predicted tuple p for class text. The Smooth
function (smooth Ln) that would be described in section
L1 loss proposed in [6] is given by:
3.4. p∗ = (p∗x , p∗y , p∗w1 , p∗h1 , p∗w2 , p∗h2 , p∗w3 , p∗h3 , p∗w4 , p∗h4 )
represents the ten parameterized coordinates X
Lreg(p; p∗ ) = smoothL1 (pi , p∗ ), (1)
of the predicted bounding box, and p =
i∈S
(px , py , pw1 , ph1 , pw2 , ph2 , pw3 , ph3 , pw4 , ph4 ) represents
the ground truth. in which,

0.5x2

if |x| < 1
smoothL1 (x) = (2)
|x| − 0.5 otherwise.

The x in the function represents the error between predicted


value and ground truth (x = w · (p − p∗ )). The deviation
function of smoothL1 is:

x if |x| < 1
deviationL1 (x) = (3)
sign(x) otherwise.

As equation 3 shows, the deviation function is a piecewise


function while the smooth Ln loss is a continuous derivable
function. The proposed Smooth Ln loss is given by:
Figure 5. The position of each point of quadrangle can be calcu-
lated by central point and the relative lengths. X
Lreg(p; p∗ ) = smoothLn (pi , p∗ ), (4)
i∈S
From the given coordinates, we can calculate the mini-
mum x (xmin ) and maximum x (xmax ) of the circumscribed in which,
rectangle, and the width of circumscribed horizontal rectan-
smoothLn (x) = (|d| + 1)ln(|d| + 1) − |d|, (5)
gle wchr = xmax − xmin . Similarly, we can get the height
hchr = ymax − ymin . and the deviation function of smoothLn is:
We adopt the parameterizations of the 10 coordinates as
following: deviationLn (x) = sign(x) · ln(sign(x) · x + 1). (6)
5 5

4 4

3 3

2 2

1 1

loss
loss

0 0

−1 −1

−2 Smooth L1 loss −2
Smooth L1 loss
−3 Smooth ln loss −3
Smooth ln loss
L2 loss
−4 −4 L2 loss

−5 −5
−5 −4 −3 −2 −1 0 1 2 3 4 5 −5 −4 −3 −2 −1 0 1 2 3 4 5

x (error) x (error)

(a) forward loss functions. (b) backward deviation functions.


Figure 6. Visualization of differences among three loss functions (L2, smooth L1 and smooth Ln). Here, the L2 function uses the same
coefficient 0.5 with smooth L1 loss.

property L2 loss smooth L1 loss smooth Ln loss rate and precision, which is always used for ranking
Robustness Worst Best Good
Stability Good Worst Best
the methods.

Table 1. Different properties of different loss functions. Robust- Particularly, we simply use official 1000 training images
ness represents the ability of resistance to outliers in the data and as our training set without any extra data augmentation, but
stability represents the capability of adjusting regressive step. we have modified some rectangular labels to quadrilateral
labels for adapting to our method.
Dataset - ICDAR 2015 Competition Challenge 4 “In-
Equation 5 and equation 6 are both continuous function
cidental Scene Text” [14]. Different from the previous
with a single equation. For equation 6, it is easy to prove
ICDAR competition, in which the text are well-captured,
|x| ≥ |deviationLn (x)|, which means the smooth Ln loss
horizontal, and typically centered in images. The datasets
is also less sensitive to outliers than the L2 loss used in
includes 1000 training images and 500 testing incidental
R-CNN [7]. A intuitive representation of the differences
scene image in where text may appear in any orientation
among three loss functions is shown in Figure 6. The com-
and any location with small size or low resolution and the
parisons of properties in terms of robustness and stability
annotations of all bounding boxes are marked at the word
are summarized in Table 1. The results demonstrate that the
level.
smooth Ln loss promises better text localization and rela-
Baseline network. The main structure of DMPNet is
tively tighter bounding boxes around the texts.
based on the VGG-16 model [28], Similar to Single Shot
Detector [19], we use the same intermediate convolutional
4. Experiments layers to apply quadrilateral sliding windows. All input
Our testing environment is a desktop running Ubuntu images would be resized to a 800x800 for preserving tiny
14.04 64bit version with TitanX. In this section, we quan- texts.
titatively evaluate our method on the public dataset: IC- Experimental results. For comprehensively evaluat-
DAR 2015 Competition Challenge 4: “Incidental Scene ing our algorithm, we collect and list the competition re-
Text” [14], and as far as we know, this is the only one dataset sults [14] in Table 2. The previous best method of this
in which texts are both word-level and multi-oriented. All dataset, proposed by Yao et al. [33], achieved a F measure
results of our methods are evaluated from its online evalua- of 63.76% while our approach obtains 70.64%. The preci-
tion system, which would calculate the recall rate, precision sion of these two methods are comparable but the recall rate
and F-measure to rank the submitted methods. The general of our method has greatly increased, which is mainly due to
criteria of these three index can be explained below: the quadrilateral sliding windows described in section 3.1.
Figure 7 shows several detected results taken from the
• Recall rate evaluates the ability of finding text. test set of ICDAR 2015 challenge 4. DMPNet can robustly
• Precision evaluates the reliability of predicted bound- localize all kinds of scene text with less background noise.
ing box. However, due to the complexity of incidental scene, some
false detections still exist, and our method may fail to re-
• F-measure is the harmonic mean (Hmean) of recall call some inconspicuous text as shown in the last column of
Table 2. Evaluation on the ICDAR 2015 competition on robust adjusting the prediction, which shows better overall perfor-
reading challenge 4 “Incidental Scene Text” localization.
Algorithm Recall (%) Precision (%) Hmean (%)
mance than L2 loss and smooth L1 loss in terms of robust-
Baseline (SSD-VGGNet) 25.48 63.25 36.326 ness and stability. Experiments on the well-known ICDAR
Proposed DMPNet 68.22 73.23 70.64 2015 robust reading challenge 4 dataset demonstrate that
Megvii-Image++ [33] 56.96 72.40 63.76
CTPN [29] 51.56 74.22 60.85
DMPNet can achieve state-of-the-art performance in detect-
MCLAB FCN [14] 43.09 70.81 53.58 ing incidental scene text. In the following, we discuss an
StardVision-2 [14] 36.74 77.46 49.84
StardVision-1 [14] 46.27 53.39 49.57
issue related to our approach and briefly describe our future
CASIA USTB-Cascaded [14] 39.53 61.68 48.18 work.
NJU Text [14] 35.82 72.73 48.00 Ground truth of the text. Texts in camera captured im-
AJOU [16] 46.94 47.26 47.10
HUST MCLAB [14] 37.79 44.00 40.66 ages are always with perspective distortion. However rect-
Deep2Text-MO [36] 32.11 49.59 38.98 angular constraints of labeling data may bring a lot of back-
CNN Proposal [14] 34.42 34.71 34.57
TextCatcher-2 [14] 34.81 24.91 29.04
ground noise, and it may lose information for not containing
all texts when labeling marginal text. As far as we know,
ICDAR 2015 Challenge 4 is the first dataset to use quadri-
Figure 7. lateral labeling, and our method prove the effectiveness of
utilizing quadrilateral labeling. Thus, quadrilateral labeling
for scene text may be more reasonable.
Future Work. The high recall rate of the DMPNet
mainly depends on numerous prior-designed quadrilateral
sliding windows. Although our method have been proved
effective, the man-made shape of sliding window may not
be the optimal designs. In future, we will explore using
shape-adaptive sliding windows toward tighter scene text
detection.

References
[1] A. Bissacco, M. Cummins, Y. Netzer, and H. Neven. Pho-
toocr: Reading text in uncontrolled conditions. In IEEE In-
ternational Conference on Computer Vision, pages 785–792,
2013. 1
[2] X. Chen and A. L. Yuille. Detecting and reading text in nat-
Figure 7. Experimental results of samples on ICDAR 2015 Chal- ural scenes. In IEEE Computer Society Conference on Com-
lenge 4, including multi-scale and multi-language word-level text. puter Vision and Pattern Recognition, pages 366–373, 2004.
Our method can tightly localize text with less background infor- 1, 2
mation as shown in the first two columns. Top three images from
[3] H. Cho, M. Sung, and B. Jun. Canny text detector: Fast
last column are the failure recalling cases of the proposed method.
and robust scene text localization algorithm. In Proceed-
Specially, some labels are missed in some images, which may re-
ings of the IEEE Conference on Computer Vision and Pattern
duce our accuracy as the red bounding box listed in the fourth
Recognition, pages 3566–3573, 2016. 2
image of the last column.
[4] B. Epshtein, E. Ofek, and Y. Wexler. Detecting text in natu-
ral scenes with stroke width transform. In Computer Vision
and Pattern Recognition (CVPR), 2010 IEEE Conference on,
5. Conclusion and future work pages 2963–2970. IEEE, 2010. 1, 2
In this paper, we have proposed an CNN based method, [5] M. Galetzka and P. O. Glauner. A correct even-odd algo-
named Deep Matching Prior Network (DMPNet), that can rithm for the point-in-polygon (pip) problem for complex
effectively reduce the background interference. The DMP- polygons. CVPR, 2012. 4
[6] R. Girshick. Fast r-cnn. In Proceedings of the IEEE Inter-
Net is the first attempt to adopt quadrilateral sliding win-
national Conference on Computer Vision, pages 1440–1448,
dows, which are designed based on the priori knowledge
2015. 2, 5
of textual intrinsic shape, to roughly recall text. And we
[7] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich fea-
use a proposed sequential protocol and a relative regressive ture hierarchies for accurate object detection and semantic
method to finely localize text without self-contradictory. segmentation. In Proceedings of the IEEE conference on
Due to the requirement of computing numerous polygo- computer vision and pattern recognition, pages 580–587,
nal overlapping area in the rough procedure, we proposed 2014. 5, 6
a shared Monte-Carlo method for fast and accurate calcula- [8] S. M. Hanif and L. Prevost. Text detection and localization
tion. In addition, a new smooth Ln loss is used for further in complex scene images using constrained adaboost algo-
rithm. In 2009 10th International Conference on Document [25] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You
Analysis and Recognition, pages 1–5. IEEE, 2009. 2 only look once: Unified, real-time object detection. arXiv
[9] W. Huang, Z. Lin, J. Yang, and J. Wang. Text localization in preprint arXiv:1506.02640, 2015. 2
natural images using stroke feature transform and text covari- [26] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: To-
ance descriptors. In International Conference on Computer wards real-time object detection with region proposal net-
Vision, pages 1241–1248, 2013. 2 works. IEEE Transactions on Pattern Analysis & Machine
[10] W. Huang, Y. Qiao, and X. Tang. Robust scene text detec- Intelligence, pages 1–1, 2016. 2, 3, 5
tion with convolution neural network induced mser trees. In [27] C. Shi, C. Wang, B. Xiao, Y. Zhang, S. Gao, and Z. Zhang.
ECCV, pages 497–511, 2014. 1, 2, 3 Scene text recognition using part-based tree-structured char-
[11] M. Jaderberg, A. Vedaldi, and A. Zisserman. Deep features acter detection. In IEEE Conference on Computer Vision and
for text spotting. In European conference on computer vi- Pattern Recognition, pages 2961–2968, 2013. 2
sion, pages 512–528. Springer, 2014. 1, 2 [28] K. Simonyan and A. Zisserman. Very deep convolutional
[12] H. Kai and A. Agathos. The point in polygon problem for ar- networks for large-scale image recognition. arXiv preprint
bitrary polygons. Computational Geometry, 20(3):131–144, arXiv:1409.1556, 2014. 2, 6
2001. 4 [29] Z. Tian, W. Huang, T. He, P. He, and Y. Qiao. Detecting Text
[13] L. Kang, Y. Li, and D. Doermann. Orientation robust text in Natural Image with Connectionist Text Proposal Network.
line detection in natural images. In 2014 IEEE Conference Springer International Publishing, 2016. 7
on Computer Vision and Pattern Recognition, pages 4034– [30] Z. Tu, Y. Ma, W. Liu, X. Bai, and C. Yao. Detecting texts of
4041. IEEE, 2014. 1 arbitrary orientations in natural images. In IEEE Conference
[14] D. Karatzas, S. Lu, F. Shafait, S. Uchida, E. Valveny, on Computer Vision and Pattern Recognition, pages 1083–
L. Gomezbigorda, A. Nicolaou, S. Ghosh, A. Bagdanov, and 1090, 2012. 2, 3
M. Iwamura. Icdar 2015 competition on robust reading. In [31] J. J. Weinman, Z. Butler, D. Knoll, and J. Feild. Toward in-
International Conference on Document Analysis and Recog- tegrated scene text reading. IEEE Transactions on Software
nition, 2015. 1, 6, 7 Engineering, 36(2):375–87, 2014. 1
[15] K. I. Kim, K. Jung, and H. K. Jin. Texture-based ap-
[32] J. J. Weinman, E. Learned-Miller, and A. R. Hanson. Scene
proach for text detection in images using support vector ma-
text recognition using similarity and a lexicon with sparse
chines and continuously adaptive mean shift algorithm. Pat-
belief propagation. IEEE Transactions on Pattern Analysis
tern Analysis & Machine Intelligence IEEE Transactions on,
& Machine Intelligence, 31(10):1733–46, 2009. 1
25(12):1631–1639, 2003. 2
[33] C. Yao, J. Wu, X. Zhou, C. Zhang, S. Zhou, Z. Cao, and
[16] H. I. Koo and D. H. Kim. Scene text detection via connected
Q. Yin. Incidental scene text understanding: Recent pro-
component clustering and nontext filtering. IEEE Transac-
gresses on icdar 2015 robust reading competition challenge
tions on Image Processing A Publication of the IEEE Signal
4. PAMI, 2015. 2, 6, 7
Processing Society, 22(6):2296–2305, 2013. 7
[17] J.-J. Lee, P.-H. Lee, S.-W. Lee, A. L. Yuille, and C. Koch. [34] Q. Ye and D. Doermann. Text detection and recognition in
Adaboost for text detection in natural scene. In ICDAR, imagery: A survey. IEEE Transactions on Pattern Analysis
pages 429–434, 2011. 2 & Machine Intelligence, 37(7):1480–1500, 2015. 1
[18] M. Li and I. K. Sethi. Confidence-based active learning. [35] C. Yi and Y. Tian. Text string detection from natural scenes
IEEE Transactions on Pattern Analysis & Machine Intelli- by structure-based partition and grouping. IEEE Transac-
gence, 28(8):1251–61, 2006. 2 tions on Image Processing, 20(9):2594–605, 2011. 1, 2
[19] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, and S. Reed. [36] X. C. Yin, W. Y. Pei, J. Zhang, and H. W. Hao. Multi-
Ssd: Single shot multibox detector. arXiv preprint orientation scene text detection with adaptive clustering.
arXiv:1512.02325, 2015. 2, 3, 5, 6 IEEE Transactions on Pattern Analysis & Machine Intelli-
[20] J. Matas, O. Chum, M. Urban, and T. Pajdla. Robust wide- gence, 37(9):1930–7, 2015. 1, 2, 7
baseline stereo from maximally stable extremal regions. Im- [37] X. C. Yin, X. Yin, K. Huang, and H. W. Hao. Robust text de-
age & Vision Computing, 22(10):761–767, 2004. 2 tection in natural scene images. IEEE Transactions on Pat-
[21] A. Neubeck and L. V. Gool. Efficient non-maximum sup- tern Analysis & Machine Intelligence, 36(5):970–83, 2014.
pression. In International Conference on Pattern Recogni- 1
tion, pages 850–855, 2006. 2 [38] A. Zamberletti, L. Noce, and I. Gallo. Text localization based
[22] L. Neumann and J. Matas. Real-time scene text localization on fast feature pyramids and multi-resolution maximally sta-
and recognition. In IEEE Conference on Computer Vision ble extremal regions. In Asian Conference on Computer Vi-
and Pattern Recognition, pages 3538–3545, 2012. 2 sion, pages 91–105. Springer, 2014. 2
[23] L. Neumann and J. Matas. Scene text localization and recog- [39] Z. Zhang, C. Zhang, W. Shen, C. Yao, W. Liu, and X. Bai.
nition with oriented stroke detection. In IEEE International Multi-oriented text detection with fully convolutional net-
Conference on Computer Vision, pages 97–104, 2013. 1, 2 works. arXiv preprint arXiv:1604.04018, 2016. 1, 2, 3
[24] D. Nistr and H. Stewnius. Linear time maximally stable ex- [40] Y. Zhu, C. Yao, and X. Bai. Scene text detection and recog-
tremal regions. In Computer Vision - ECCV 2008, European nition: recent advances and future trends. Frontiers of Com-
Conference on Computer Vision, Marseille, France, October puter Science, 10(1):19–36, 2016. 1, 2
12-18, 2008, Proceedings, pages 183–196, 2008. 2

You might also like