0% found this document useful (0 votes)
8 views10 pages

Gupta Synthetic Data For CVPR 2016 Paper

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 10

Synthetic Data for Text Localisation in Natural Images

Ankush Gupta Andrea Vedaldi Andrew Zisserman


Dept. of Engineering Science, University of Oxford
{ankush,vedaldi,az}@robots.ox.ac.uk

Abstract Synthetic Text in the Wild

In this paper we introduce a new method for text detec-


tion in natural images. The method comprises two contribu-
tions: First, a fast and scalable engine to generate synthetic
images of text in clutter. This engine overlays synthetic text
to existing background images in a natural way, account-
ing for the local 3D scene geometry. Second, we use the
synthetic images to train a Fully-Convolutional Regression real image detected text
Network (FCRN) which efficiently performs text detection
and bounding-box regression at all locations and multiple FCRN
scales in an image. We discuss the relation of FCRN to the
recently-introduced YOLO detector, as well as other end-to-
end object detection systems based on deep learning. The
resulting detection network significantly out performs cur-
Figure 1. We propose a Fully-Convolutional Regression Network
rent methods for text detection in natural images, achiev-
(FCRN) for high-performance text recognition in natural scenes
ing an F-measure of 84.2% on the standard ICDAR 2013 (bottom) which detects text up to 45× faster than the current state-
benchmark. Furthermore, it can process 15 images per sec- of-the-art text detectors and with better accuracy. FCRN is trained
ond on a GPU. without any manual annotation using a new dataset of synthetic
text in the wild. The latter is obtained by automatically adding text
to natural scenes in a manner compatible with the scene geometry
1. Introduction (top).

Text spotting, namely the ability to read text in natu- reasons. First, the performance of the detection pipeline
ral scenes, is a highly-desirable feature in anthropocentric becomes the new bottleneck of text spotting: in [20] recog-
applications of computer vision. State-of-the-art systems nition accuracy for correctly cropped words is 98% whereas
such as [20] achieved their high text spotting performance the end-to-end text spotting F-score is only 69% mainly due
by combining two simple but powerful insights. The first to incorrect and missed word region proposals. Second, the
is that complex recognition pipelines that recognise text by pipeline is slow and inelegant.
explicitly combining recognition and detection of individ- In this paper we propose improvements similar to [20] to
ual characters can be replaced by very powerful classifiers the complementary problem of text detection. We make two
that directly map an image patch to words [13, 20]. The key contributions. First, we propose a new method for gen-
second is that these powerful classifiers can be learned by erating synthetic images of text that naturally blends text
generating the required training data synthetically [19, 44]. in existing natural scenes, using off-the-shelf deep learning
While [20] successfully addressed the problem of recog- and segmentation techniques to align text to the geometry
nising text given an image patch containing a word, the pro- of a background image and respect scene boundaries. We
cess of obtaining these patches remains suboptimal. The use this method to automatically generate a new synthetic
pipeline combines general purpose features such as HoG dataset of text in cluttered conditions (figure 1 (top) and
[6], EdgeBoxes [48] and Aggregate Channel Features [7] section 2). This dataset, called SynthText in the Wild (fig-
and brings in text specific (CNN) features only in the later ure 2), is suitable for training high-performance scene text
stages, where patches are finally recognised as specific detectors. The key difference with existing synthetic text
words. This state of affair is highly undesirable for two datasets such as the one of [20] is that these only contains

2315
1.1. Related Work

Object Detection with CNNs. Our text detection network


draws primarily on Long et al.’s Fully-Convolutional net-
work [31] and Redmon et al.’s YOLO image-grid based
bounding-box regression network [36]. YOLO is part of
a broad line of work on using CNN features for object cate-
gory detection dating back to Girshick et al.’s Region-CNN
(R-CNN) framework [12] combination of region propos-
als and CNN features. The R-CNN framework has three
Figure 2. Sample images from our synthetically generated scene- broad stages — (1) generating object proposals, (2) extract-
text dataset. Ground-truth word-level axis-aligned bounding boxes ing CNN feature maps for each proposal, and (3) filtering
are shown. the proposals through class specific SVMs. Jaderberg et
al.’s text spotting method also uses a similar pipeline for
detection [20]. Extracting feature maps for each region in-
dependently was identified as the bottleneck by Girshick et
# Images # Words
Dataset al. in Fast R-CNN [11]. They obtain 100× speed-up over
Train Test Train Test
R-CNN by computing the CNN features once and pooling
ICDAR {11,13,15} 229 255 849 1095
them locally for each proposal; they also streamline the last
SVT 100 249 257 647 two stages of R-CNN into a single multi-task learning prob-
Table 1. Size of publicly available text localisation datasets — lem. This work exposed the region-proposal stage as the
ICDAR [23, 24, 39], the Street View Text (SVT) dataset [43]. new bottleneck. Lenc et al. [29] drop the region proposal
Word numbers for the entry “ICDAR{11,13,15}” are from the IC- stage altogether and use a constant set of regions learnt
DAR15 Robust Reading Competition’s Focused Scene Text Lo- through K-means clustering on the PASCAL VOC data.
calisation dataset. Ren et al. [37] also start from a fixed set of proposal, but
word-level image regions and are unsuitable for training de- refined them prior to detection by using a Region Proposal
tectors. Network which shares weights with the later detection net-
work and streamlines the multi-stage R-CNN framework.
The second contribution is a text detection deep ar- Synthetic Data. Synthetic datasets provide detailed
chitecture which is both accurate and efficient (figure 1 ground-truth annotations, and are cheap and scalable al-
(bottom) and section 3). We call this a fully-convolutional ternatives to annotating images manually. They have been
regression network. Similar to models such as the Fully- widely used to learn large CNN models — Wang et al. [44]
Convolutional Networks (FCN) for image segmentation, it and Jaderberg et al. [19] use synthetic text images to train
performs prediction densely, at every image location. How- word-image recognition networks; Dosovitskiy et al. [9]
ever, differently from FCN, the prediction is not just a class use floating chair renderings to train dense optical flow re-
label (text/not text), but the parameters of a bounding box gression networks. Detailed synthetic data has also been
enclosing the word centred at that location. The latter idea used to learn generative models — Dosovitskiy et al. [8]
is borrowed from the You Look Only Once (YOLO) tech- train inverted CNN models to render images of chairs, while
nique of Redmon et al. [36], but with convolutional regres- Yildirim et al. [46] use deep CNN features trained on syn-
sors with a significant boost to performance. thetic face renderings to regress pose parameters from face
images.
The new data and detector achieve state-of-the-art text
detection performance on standard benchmark datasets Augmenting Single Images. There is a large body of
(section 4) while being an order of magnitude faster than work on inserting objects photo-realistically, and inferring
traditional text detectors at test time (up to 15 images per 3D structure from single images — Karsch et al. [25] de-
second on a GPU). We also demonstrate the importance of velop an impressive semi-automatic method to render ob-
verisimilitude in the dataset by showing that if the detec- jects with correct lighting and perspective; they infer the
tor is trained on images with words inserted synthetically actual size of objects based on the technique of Criminisi
that do not take account of the scene layout, then the de- et al. [5]. Hoiem et al. [15] categorise image regions into
tection performance is substantially inferior. Finally, due to ground-plane, vertical plane or sky from a single image and
the more accurate detection step, end-to-end word recogni- use it to generate “pop-ups” by decomposing the image into
tion is also improved once the new detector is swapped in planes [14]. Similarly, we too decompose a single image
for existing ones in state-of-the-art pipelines. Our findings into local planar regions, but use instead the dense depth
are summarised in section 5. prediction of Liu et al. [30].

2316
RGB Depth gPb-UCM Segmentation Text Regions

Sample Scene-Text Images

Figure 3. (Top, left to right): (1) RGB input image with no text instance. (2) Predicted dense depth map (darker regions are closer).
(3) Colour and texture gPb-UCM segments. (4) Filtered regions: regions suitable for text are coloured randomly; those unsuitable retain
their original image pixels. (Bottom): Four synthetic scene-text images with axis-aligned bounding-box annotations at the word level.

2. Synthetic Text in the Wild ages, each with multiple instances of words rendered in dif-
ferent styles as seen in Figure 2. The dataset is available at:
Supervised training of large models such as deep CNNs, http://www.robots.ox.ac.uk/~vgg/data/scenetext
which contain millions of parameters, requires a very sig-
nificant amount of labelled training data [26], which is ex- 2.1. Text and Image Sources
pensive to obtain manually. Furthermore, as summarised in
Table 1, publicly available text spotting or detection datasets The synthetic text generation process starts by sampling
are quite small. Such datasets are not only insufficient to some text and a background image. The text is extracted
train large CNN models, but also inadequate to represent the from the Newsgroup20 dataset [27] in three ways — words,
space of possible text variations in natural scenes — fonts, lines (up to 3 lines) and paragraphs (up to 7 lines). Words
colours, sizes, positions. Hence, in this section we develop are defined as tokens separated by whitespace characters,
a synthetic text-scene image generation engine for building lines are delimited by the newline character. This is a rich
a large annotated dataset for text localisation. dataset, with a natural distribution of English text inter-
Our synthetic engine (1) produces realistic scene-text spersed with symbols, punctuation marks, nouns and num-
images so that the trained models can generalise to real bers.
(non-synthetic) images, (2) is fully automated and, is (3) To favour variety, 8,000 background images are ex-
fast, which enables the generation of large quantities of tracted from Google Image Search through queries related
data without supervision. The text generation pipeline can to different objects/scenes and indoor/outdoor and natu-
be summarised as follows (see also Figure 3). After ac- ral/artificial locales. To guarantee that all text occurrences
quiring suitable text and image samples (section 2.1), the are fully annotated, these images must not contain text of
image is segmented into contiguous regions based on local their own (a limitation of the Street View Text [43] is that
colour and texture cues [2], and a dense pixel-wise depth annotations are not exhaustive). Hence, keywords which
map is obtained using the CNN of [30] (section 2.2). Then, would recall a large amount of text in the images (e.g.
for each contiguous region a local surface normal is esti- “street-sign”, “menu” etc.) are avoided; images containing
mated. Next, a colour for text and, optionally, for its outline text are discarded through manual inspection.
is chosen based on the region’s colour (section 2.3). Fi-
2.2. Segmentation and Geometry Estimation
nally, a text sample is rendered using a randomly selected
font and transformed according to the local surface orienta- In real images, text tends to be contained in well defined
tion; the text is blended into the scene using Poisson image regions (e.g. a sign). We approximate this constraint by re-
editing [35]. Our engine takes about half a second to gener- quiring text to be contained in regions characterised by a
ate a new scene-text image. uniform colour and texture. This also prevents text from
This method is used to generate 800,000 scene-text im- crossing strong image discontinuities, which is unlikely to

2317
2.3. Text Rendering and Image Composition
Once the location and orientation of text has been de-
cided, text is assigned a colour. The colour palette for text
is learned from cropped word images in the IIIT5K word
dataset [32]. Pixels in each cropped word images are par-
titioned into two sets using K-means, resulting in a colour
Figure 4. Local colour/texture sensitive placement. (Left) Exam- pair, with one colour approximating the foreground (text)
ple image from the Synthetic text dataset. Notice that the text is re- colour and the other the background. When rendering new
stricted within the boundaries of the step in the street. (Right) For text, the colour pair whose background colour matches the
comparison, the placement of text in this image does not respect target image region the best (using L2-norm in the Lab
the local region cues.
colour space) is selected, and the corresponding foreground
occur in practice. Regions are obtained by thresholding the colour is used to render the text.
gPb-UCM contour hierarchies [2] at 0.11 using the efficient About 20% of the text instances are randomly chosen to
graph-cut implementation of [3]. Figure 4 shows an exam- have a border. The border colour is chosen to be either the
ple of text respecting local region cues. same as foreground colour with its value channel increased
or decreased, or is chosen to be the mean of the foreground
In natural images, text tends to be painted on top of and background colours.
surfaces (e.g. a sign or a cup). In order to approximate
To maintain the illumination gradient in the synthetic
a similar effect in our synthetic data, the text is perspec-
text image, we blend the text on to the base image using
tively transformed according to local surface normals. The
Poisson image editing [35], with the guidance field defined
normals are estimated automatically by first predicting a
as in their equation (12). We solve this efficiently using the
dense depth map using the CNN of [30] for the regions
implementation provided by Raskar1
segmented above, and then fitting a planar facet to it using
RANSAC [10].
3. A Fast Text Detection Network
Text is aligned to the estimated region orientations as fol-
lows: first, the image region contour is warped to a frontal- In this section we introduce our CNN architecture for
parallel view using the estimated plane normal; then, a rect- text detection in natural scenes. While existing text detec-
angle is fitted to the fronto-parallel region; finally, the text is tion pipelines combine several ad-hoc steps and are slow,
aligned to the larger side (“width”) of this rectangle. When we propose a detector which is highly accurate, fast, and
placing multiple instances of text in the same region, text trainable end-to-end.
masks are checked for collision against each other to avoid Let x denote an image. The most common approach for
placing them on top of each other. Not all segmentation CNN-based detection is to propose a number of image re-
regions are suitable for text placement — regions should gions R that may contain the target object (text in our case),
not be too small, have an extreme aspect ratio, or have sur- crop the image, and use a CNN c = φ(cropR (x)) ∈ {0, 1}
face normal orthogonal to the viewing direction; all such to score them as correct or not. This approach, which has
regions are filtered in this stage. Further, regions with too been popularised by R-CNN [12], works well but is slow as
much texture are also filtered, where the degree of texture it entails evaluating the CNN thousands of times per image.
is measured by the strength of third derivatives in the RGB An alternative and much faster strategy for object de-
image. tection is to construct a fixed field of predictors (c, p) =
Discussion. An alternative to using a CNN to estimate φuv (x), each of which specialises in predicting the presence
depth, which is an error prone process, is to use a dataset c ∈ R and pose p = (x−u, y −v, w, h) of an object around
of RGBD images. We prefer to estimate an imperfect depth a specific image location (u, v). Here the pose parameters
map instead because: (1) it allows essentially any scene (x, y) and (w, h) denote respectively the location and size
type background image to be used, instead of only the of a bounding box tightly enclosing the object. Each pre-
ones for which RGBD data are available, and (2) because dictor φuv is tasked with predicting objects which occurs in
publicly available RGBD datasets such as NYUDv2 [40], some ball (x, y) ∈ Bρ (u, v) of the predictor location.
B3DO [22], Sintel [4], and Make3D [38] have several While this construction may sound abstract, it is actually
limitations in our context: small size (1,500 images in a common one, implemented for example by Implicit Shape
NYUDv21, 400 frames in Make3D, and a small number Models (ISM) [28] and Hough voting [16]. There a predic-
of videos in B3DO and Sintel), low-resolution and mo- tor φuv looks at a local image patch, centred at (u, v), and
tion blur, restriction to indoor images (in NYUDv2 and 1 Fast Poisson image editing code available at:
B3DO), and limited variability in the images for video- http://web.media.mit.edu/~raskar/photo/code.pdf based on Discrete
based datasets (B3DO and Sintel). Sine Transform.

2318
tries to predict whether there is an object around (u, v), and tures terminate with a dense feature field. Given that there
where the object is located relative to it. are four downsampling max-pooling layers, the stride of
In this paper we propose an extreme variant of Hough these features is ∆ = 16 pixels, each containing 512 feature
voting, inspired by Fully-Convolutional Network (FCN) of channels φfuv (x) (we express uv in pixels for convenience).
Long et al. [31] and the You Look Only Once (YOLO) tech- Given the features φfuv (x), we can now discuss the con-
nique of Redmon et al. [36]. In ISM and Hough voting, struction of the dense text predictors φuv (x) = φruv ◦φf (x).
individual predictions are aggregated across the image, in a These predictors are implemented as a further seven 5 × 5
voting scheme. YOLO is similar, but avoids voting and uses linear filters (C-7-5×5) φruv , each regressing one of seven
individual predictions directly; since this idea can acceler- numbers: the object presence confidence c, and up to six
ate detection, we adopt it here. object pose parameters p = (x − u, y − v, w, h, cos θ, sin θ)
The other key conceptual difference between YOLO and where x, y, w, h have been discussed before and θ is the
Hough voting is that in Hough voting predictors φuv (x) are bounding box rotation.
local and translation invariant, whereas in YOLO they are Hence, for an input image of size H×W , we obtain a
not: First, in YOLO each predictor is allowed to pool evi- grid of H W
∆ × ∆ predictions, one each for an image cell of
dence from the whole image, not just an image patch cen- size ∆×∆ pixels. Each predictor is responsible for detect-
tred at (u, v). Second, in YOLO predictors at different loca- ing a word if the word centre falls within the correspond-
tions (u, v) 6= (u′ , v ′ ) are different functions φuv 6= φu′ v′ ing cell.3 YOLO is similar but operates at about half this
learned independently. resolution; a denser predictor sampling is important to re-
While YOLO’s approach allows the method to pick up duce collisions (multiple words falling in the same cell) and
contextual information useful in detection of PASCAL or therefore to increase recall (since at most one word can be
ImageNet objects, we found this unsuitable for smaller and detected per cell). In practice, for a 224×224 image, we
more variable text occurrences. Instead, we propose here a obtain 14×14 cells/predictors
method which is in between YOLO and Hough voting. As Multi-scale detection. Limited receptive field of our con-
in YOLO, each detector φuv (x) still predicts directly object volutional filters prohibits detection of large text instances.
occurrences, without undergoing an expensive voting accu- Hence, we get the detections at multiple down-scaled ver-
mulation process; however, as in Hough voting, detectors sions of the input image and merge them through non-
φuv (x) are local and translation invariant, sharing parame- maximal suppression. In more detail, the input image is
ters. We implement this field of translation-invariant and lo- scaled down by factors {1, 1/2, 1/4, 1/8} (scaling up is an
cal predictors as the output of the last layer of a deep CNN, overkill as the baseline features are already computed very
obtaining a fully-convolutional regression network (FCRN). densely). Then, the resulting detections are combined by
suppressing those with a lower score than the score of an
3.1. Architecture overlapping detection.
This section describes the structure of the FCRN. First, Training loss. We use a squared loss term for each of the
we describe the first several layers of the architecture, which H W
∆ × ∆ ×7 outputs of the CNN as in YOLO [36]. If a cell
compute text-specific image features. Then, we describe the does not contain a ground-truth word, the loss ignores all
dense regression network built on top of these features and parameters but c (text/no-text).
finally its application at multiple scales. Comparison with YOLO. Our fully-convolutional regres-
Single-scale features. Our architecture is inspired by sion network (FCRN) has 30× less parameters than the
VGG-16 [41], using several layers of small dense filters; YOLO network (which has ∼90% of the parameters in the
however, we found that a much smaller model works just last two fully-connected layers). Due to its global nature,
as well and more efficiently for text. The architecture com- standard YOLO must be retrained for each image size, in-
prises nine convolutional layers, each followed by the Recti- cluding multiple scales, further increasing the model size
fied Linear Unit non-linearity, and, occasionally, by a max- (while our model requires 44MB, YOLO would require
pooling layer. All linear filters have a stride of 1 sample, 2GB). This makes YOLO not only harder to train, but also
and preserve the resolution of feature maps through zero less efficient (2× slower that FCRN).
padding. Max-pooling is performed over 2×2 windows
with a stride of 2 samples, therefore halving the feature 4. Evaluation
maps resolution.2
Class and bounding box prediction. The single-scale fea- First, in section 4.1 we describe the text datasets on
which we evaluate our model. Next, we evaluate our model
2 The
sequence of layers is as follows: 64 5×5 convolutional filters + on the text localisation task in section 4.2. In section 4.3,
ReLU (CR-64-5×5), max pooling (MP), CR-128-5×5, MP, CR128-3×3,
CR-128-3×3-conv, MP, CR-256-3×3, CR-256-3×3, MP, CR-512-3×3, 3 For regression, it was found beneficial to normalise the pose parame-

CR-512-3×3, CR-512-5×5. ters as follows: p̄ = ((x − u)/∆, (y − v)/∆, w/W, h/H, cos θ, sin θ).

2319
PASCAL Eval DetEval
IC11 IC13 SVT IC11 IC13 SVT
F P R RM F P R RM F P R RM F P R F P R F P R
Huang [17] - - - - - - - - - - - - 78 88 71 - - - - - -
Jaderberg [20] 77.2 87.5 69.2 70.6 76.2 86.7 68.0 69.3 53.6 62.8 46.8 55.4 76.8 88.2 68.0 76.8 88.5 67.8 24.7 27.7 22.3
Jaderberg
77.3 89.2 68.4 72.3 76.7 88.9 67.5 71.4 53.6 58.9 49.1 56.1 75.5 87.5 66.4 75.5 87.9 66.3 24.7 27.8 22.3
(trained on SynthText)
Neumann [33] - - - - - - - - - - - - 68.7 73.1 64.7 - - - - - -
Neumann [34] - - - - - - - - - - - - 72.3 79.3 66.4 - - - - - -
Zhang [47] - - - - - - - - - - - - 80 84 76 80 88 74 - - -
FCRN single-scale 60.6 78.8 49.2 49.2 61.0 77.7 48.9 48.9 45.6 50.9 41.2 41.2 64.5 81.9 53.2 64.3 81.3 53.1 31.4 34.5 28.9
FCRN multi-scale 70.0 78.4 63.2 64.6 69.5 78.1 62.6 67.0 46.2 47.0 45.4 53.0 73.0 77.9 68.9 73.4 80.3 67.7 34.5 29.9 40.7
FCRN + multi-filt 78.7 95.3 67.0 67.5 78.0 94.8 66.3 66.7 56.3 61.5 51.9 54.1 78.0 94.5 66.4 78.0 94.8 66.3 25.5 26.8 24.3
FCRNall + multi-filt 84.7 94.3 76.9 79.6 84.2 93.8 76.4 79.6 62.4 65.1 59.9 75.0 82.3 91.5 74.8 83.0 92.0 75.5 26.7 26.2 27.4

Table 2. Comparison with previous methods on text localisation. Precision (P) and Recall (R) at maximum F-measure (F) and the maximum
recall (RM ) are reported.

to investigate which components of the synthetic data gen- 4.2. Text Localisation Experiments
eration pipeline are important, we perform detailed ablation
We evaluate our detection networks to — (1) compare
experiments. In section 4.4, we use the results from our
the performance when applied to single-scale and multiple
localisation model for end-to-end text spotting. We show
down-scaled versions of the image and, (2) improve upon
substantial improvements over the state-of-the-art in both
the state-of-the-art results in text detection when used as
text localisation and end-to-end text spotting. Finally, in
high-quality proposals.
section 4.5 we discuss the speed-up gained by using our
models for text localisation. Training. FCRN is trained on 800,000 images from our
SynthText in the Wild dataset. Each image is resized to a size
4.1. Datasets of 512×512 pixels. We optimise using SGD with momen-
tum and batch-normalisation [18] after every convolutional
We evaluate our text detection networks on standard layer (except the last one). We use mini-batches of 16 im-
benchmarks: ICDAR 2011, 2013 datasets [24, 39] and the ages each, set the momentum to 0.9, and use a weight-decay
Street View Text dataset [43]. These datasets are reviewed of 5−4 . The learning rate is set to 10−4 initially and is re-
next and their statistics are given in Table 1. duced to 10−5 when the training loss plateaus.
SynthText in the Wild. This is a dataset of 800,000 train- As only a small number (1-2%) of grid-cells contain text,
ing images generated using our synthetic engine from sec- we weigh down the non-text probability error terms initially
tion 2. Each image has about ten word instances annotated by multiplying with 0.01; this weight is gradually increased
with character and word-level bounding-boxes. to 1 as the training progresses. Due to class imbalance, all
ICDAR Datasets. The ICDAR datasets (IC011, IC013) are the probability scores collapse to zero if such a weighting
obtained from the Robust Reading Challenges held in 2011 scheme is not used.
and 2013 respectively. They contain real world images of Inference. We get the class probabilities and bounding-box
text on sign boards, books, posters and other objects with predictions from our FCRN model. The predictions are fil-
world-level axis-aligned bounding box annotations. The tered by thresholding the class probabilities (at a threshold
datasets largely contain the same images, but shuffle the test t). Finally, multiple detections from nearby cells are sup-
and training splits. We do not evaluate on the more recent pressed using non-maximal suppression, whereby amongst
ICDAR 2015 dataset as it is almost identical to the 2013 two overlapping detections the one with the lower probabil-
dataset. ity is suppressed. In the following we first give results for a
conservative threshold of t = 0.3, for higher precision, and
Street View Text. This dataset, abbreviated SVT, consists
then relax this to t = 0.0 (i.e., all proposals accepted) for
of images harvested from Google Street View annotated
higher recall.
with word-level axis-aligned bounding boxes. SVT is more
challenging than the ICDAR data as it contains smaller and Evaluation protocol. We report text detection performance
lower resolution text. Furthermore, not all instances of text using two protocols commonly used in the literature —
are annotated. In practice, this means that precision is heav- (1) DetEval [45] popularly used in ICDAR competitions
ily underestimated in evaluation. Lexicons consisting of 50 for evaluating localisation methods, and (2) PASCAL VOC
distractor words along with the ground-truth words are pro- style intersection-over-union overlap method (≥ 0.5 IoU for
vided for each image; we refer to testing on SVT with these a positive detection).
lexicons as SVT-50. Single & multi-scale detection. The “FCRN single-scale”

2320
1 1

0.9 0.9

0.8 0.8

0.7 F AP Rmax 0.7

0.6 FCRN multi 69.5 60.4 67.0 0.6


scale
precision

precision
0.5 Jaderberg 76.2 67.1 69.3 0.5 F AP Rmax
Random 60.3 50.6 68.2
0.4 Jaderberg 76.7 68.5 71.4 0.4
Placement
(SynthText)
0.3 0.3 Colour/Texture 61.9 53.7 75.0
FCRNall+ 84.2 78.2 79.6 Regions
multi-filt
0.2 Perspective
0.2
distortion 62.4 54.5 75.0
+ regions
0.1 0.1

0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
recall recall

Figure 5. Precision-Recall curves for various text detection meth- Figure 6. Precision-Recall curves text localisation on the SVT
ods on IC13. The methods are: (1) multi-scale application of dataset using the model “FCRNall+multi-filt” when trained on in-
FCRN (“FCRN-multi”); (2) The original curve of Jaderberg et creasingly sophisticated training sets (section 4.3).
al. [20]; (3) Jaderberg et al. [20] retrained on the SynthText in the
Wild dataset; and, (4) “FCRNall + multi-filt” methods. Maximum tion are shown by the entries named “FCRN + multi-filt”
F-score (F), Average Precision (AP) and maximum Recall (Rmax ) and “FCRNall + multi-filt” respectively in Table 2. Note
are also given. The gray curve at the bottom is of multi-scale detec- that the low-recall method achieves better than the state-of-
tions from our FCRN network (max. recall = 85.9%), which is fed the-art performance on text detection, whereas high-recall
into the multi-filtering post-processing to get the refined “FCRNall method significantly improves the state-of-the-art with an
+ multi-filt” detections.
improvement of 6% in the F-measure for all the datasets.
entry in Table 2 shows the performance of our FCRN model Figure 5 shows the Precision-Recall curves for text de-
on the test datasets. The precision at maximum F-measure tection on the IC13 dataset. Note the high recall (85.9%) of
of single-scale FCRN is comparable to the methods of Neu- the multi-scale detections output from FCRN before refine-
man et al. [33, 34], while the recall is significantly worse by ment using the multi-filtering post-processing. Also, note
12%. the drastic increase in maximum recall (+10.3%) and in
The “FCRN multi-scale” entry in Table 2 shows per- Average Precision (+11.1%) for “FCRNall + multi-filt” as
formance on multi-scale application of our network. This compared to Jaderberg et al.
method improves maximum recall by more than 12% over Further, to establish that the improvement in text detec-
the single-scale method and outperforms the methods of tion is due to the new detection model, and not merely due
Neumann et al. to the large size of our synthetic dataset, we trained Jader-
Post-processing proposals. Current end-to-end text spot- berg et al.’s method on our SynthText in the Wild dataset
ting (detection and recognition) methods [1, 20, 44] boost – in particular, the ACF component of their region proposal
performance by combining detection with text recognition. stage.4 Figure 5 and Table 2 show that, even with 10× more
To further improve FCRN detections, we use the multi- (synthetic) training data, Jaderberg et al.’s model improves
scale detections from FCRN as proposals and refine them only marginally (+0.8% in AP, +2.1% in maximum recall).
by using the post-processing stages of Jaderberg et al. [20]. A common failure mode is text in unusual fonts which
There are three stages: first filtering using a binary text/no- are not present in the training set. The detector is also
text random-forest classifier; second, regressing an im- confused by symbols or patterns of constant stroke width
proved bounding-box using a CNN; and third recognition which look like text, for example road-signs, stick figures
based NMS where the word images are recognised using etc. Since the detector does not scale the image up, ex-
a large fixed lexicon based CNN, and the detections are tremely small sized text instances are not detected. Finally,
merged through non-maximal suppression based on word words get broken into multiple instances or merged into one
identities. Details are given in [20]. We use code provided instance due to large or small spacing between the charac-
by the authors for fair comparison. ters.
We test this in two modes — (1) low-recall: where only 4.3. Synthetic Dataset Evaluation
high-scoring (probability > 0.3) multi-scale FCRN detec-
tions are used (the threshold previously used in the single- We investigate the contribution that the various stages
and multi-scale inference). This typically yields less than of the synthetic text-scene data generation pipeline bring to
30 proposals. And, (2) high-recall: where all the multi- 4 Their other region proposal method, EdgeBoxes, was not re-trained;
scale FCRN detections (typically about a thousand in num- as it is learnt from low-level edge features from the Berkeley Segmentation
ber) are used. Performance of these methods on text detec- Dataset, which is not text specific.

2321
Model IC11 IC11* IC13 SVT SVT-50 Region Proposal BB-regression
Total Time Proposal Filtering & recognition
Wang [42] - - - - 38
FCRN+multi-filt 0.30 0.07 0.03 0.20
Wang & Wu [44] - - - - 46 FCRNall+multi-filt 2.47 0.07 1.20 1.20
Alsharif [1] - - - - 48 Jaderberg et al. 7.00 3.00 3.00 1.00
Neumann [34] - 45.2 - - -
Jaderberg [21] - - - - 56 Table 4. Comparison of end-to-end text-spotting time (in seconds).
Jaderberg [20] 76 69 76 53 76 methods. For recognition we use the output of the interme-
80.5 75.8 80.3 diary recognition stage of the pipeline based on the lexicon-
FCRN + multi-filt 54.7 68.0
(77.8) (73.5) (77.8) encoding CNN of Jaderberg et al. [19]. We improve upon
84.3 81.0 84.7 previously reported results (F-measure): +8% on the IC-
FCRNall + multi-filt 55.7 67.7
(81.2) (78.4) (81.8) DAR datasets, and +3% on the SVT dataset. Given the high
recall of our method (as noted before in Figure 5), the fact
Table 3. Comparison with previous methods on end-to-end text
spotting. Maximum F-measure% is reported. IC11* is evaluated
that many text instances are unlabelled in SVT cause pre-
according to the protocol described in [34]. Numbers in parenthe- cision to drop; hence, we see smaller gains in SVT and do
sis are obtained if words containing non-alphanumeric characters worse on SVT-50.
are not ignored – SVT does not have any of these.
4.5. Timings
localisation accuracy: We generate three synthetic training
datasets with increasing levels of sophistication, where the At test time FCRN can process 20 images per second
text is (1) is placed at random positions within the image, (of size 512×512px) at single scale and about 15 images
(2) restricted to the local colour and texture boundaries, and per second when run on multiple scales (1,1/2,1/4,1/8) on
(3) distorted perspectively to match the local scene depth a GPU. When used as high-quality proposals in the text lo-
(while also respecting the local colour and texture bound- calisation pipeline of Jaderberg et al. [20], it replaces the
aries as in (2) above). All other aspects of the datasets were region proposal stage which typically takes about 3 sec-
kept the same — e.g. the text lexicon, background images, onds per image. Hence, we gain a speed-up of about 45
colour distribution. times in the region proposal stage. Further, the “FCRN +
Figure 6 shows the results on localisation on the SVT multi-filt” method, which uses only the high-scoring detec-
dataset of our method “FCRNall+multi-filt”. Compared tions from multi-scale FCRN and achieves state-of-the-art
to random placement, restricting text to the local colour results in detection and end-to-end text spotting, cuts down
and texture regions significantly increases the maximum re- the number of proposals in the later stages of the pipeline by
call (+6.8%), AP (+3.85%), and the maximum F-measure a factor of 10: the region proposal stage of Jaderberg et al.
(+2.1%). Marginal improvements are seen with the addi- proposes about 2000 boxes which are quickly filtered using
tion of perspective distortion: +0.75% in AP, +0.55% in a random-forest classifier to a manageable set of about 200
maximum F-measure, and no change in the maximum re- proposals, whereas the high-scoring detections from multi-
call. This is likely due to the fact that most text instances in scale FCRN are typically less than 30. Table 4 compares
the SVT datasets are in a fronto-parallel orientation. Sim- the time taken for end-to-end text-spotting; our method is
ilar trends are observed with the ICDAR 2013 dataset, but between 3× to 23× faster than Jaderberg et al.’s, depend-
with more contained differences probably due to the fact ing on the variant.
that ICDAR’s text instances are much simpler than SVT’s
and benefit less from the more advanced datasets. 5. Conclusion
4.4. End-to-End Text Spotting We have developed a new CNN architecture for gen-
erating text proposals in images. It would not have been
Text spotting is limited by the detection stage, as state- possible to train this architecture on the available annotated
of-the-art cropped word image recognition accuracy is over datasets, as they contain far too few samples, but we have
98% [19]. We utilise our improvements in text localisation shown that training images of sufficient verisimilitude can
to obtain state-of-the-art results in text spotting. be generated synthetically, and that the CNN trained only on
Evaluation protocol. Unless otherwise stated, we follow these images exceeds the state-of-the-art performance for
the standard evaluation protocol by Wang et al. [42], where both detection and end-to-end text spotting on real images.
all words that are either less than three characters long or Acknowledgements. We are grateful for comments from
contain non-alphanumeric characters are ignored. An over- Jiri Matas. Financial support was provided by the UK EP-
lap (IoU) of at least 0.5 is required for a positive detection. SRC CDT in Autonomous Intelligent Machines and Sys-
Table 3 shows the results on end-to-end text spotting task tems Grant EP/L015987/2, EPSRC Programme Grant See-
using the “FCRN + multi-filt” and “FCRNall + multi-filt” bibyte EP/M013774/1, and the Clarendon Fund scholarship.

2322
References K. Saenko, and T. Darrell. A category-level 3-d object
dataset: Putting the kinect to work. In ICCV Workshop on
[1] O. Alsharif and J. Pineau. End-to-end text recognition with Consumer Depth Cameras in Computer Vision, 2011.
hybrid HMM maxout models. ArXiv e-prints, Oct 2013. [23] D. Karatzas, L. Gomez-Bigorda, A. Nicolaou, S. Ghosh,
[2] P. Arbelaez, M. Maire, C. Fowlkes, and J. Malik. Contour A. Bagdanov, M. Iwamura, J. Matas, L. Neumann, V. R.
detection and hierarchical image segmentation. IEEE PAMI, Chandrasekhar, S. Lu, et al. ICDAR 2015 robust reading
33:898–916, 2011. competition. In Proc. ICDAR, pages 1156–1160, 2015.
[3] P. Arbelaez, J. Pont-Tuset, J. Barron, F. Marques, and J. Ma- [24] D. Karatzas, F. Shafait, S. Uchida, M. Iwamura, S. R. Mestre,
lik. Multiscale combinatorial grouping. In Proc. CVPR, J. Mas, D. F. Mota, J. A. Almazan, L. P. de las Heras, et al.
2014. ICDAR 2013 robust reading competition. In Proc. ICDAR,
[4] D. J. Butler, J. Wulff, G. B. Stanley, and M. J. Black. A
pages 1484–1493, 2013.
naturalistic open source movie for optical flow evaluation. [25] K. Karsch, V. Hedau, D. Forsyth, and D. Hoiem. Rendering
In Proc. ECCV, 2014. synthetic objects into legacy photographs. ACM Transac-
[5] A. Criminisi, I. D. Reid, and A. Zisserman. Single view
tions on Graphics, 30(6):157, 2011.
metrology. In Proc. ICCV, pages 434–442, 1999. [26] A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet
[6] N. Dalal and B. Triggs. Histogram of Oriented Gradients for
classification with deep convolutional neural networks. In
Human Detection. In Proc. CVPR, volume 2, pages 886–
NIPS, pages 1106–1114, 2012.
893, 2005. [27] K. Lang and T. Mitchell. Newsgroup 20 dataset, 1999.
[7] P. Dollar, R. Appel, and S. Belongie. Fast feature pyramids [28] B. Leibe, A. Leonardis, and B. Schiele. Combined object cat-
for object detection. IEEE PAMI, 36(8):1532–1545, 2014. egorization and segmentation with an implicit shape model.
[8] A. Dosovitskiy and T. Brox. Inverting visual representations
In Workshop on Statistical Learning in Computer Vision,
with convolutional networks. In Proc. CVPR, 2016. To ap-
ECCV, May 2004.
pear. [29] K. Lenc and A. Vedaldi. R-CNN minus R. In Proc. BMVC.,
[9] A. Dosovitskiy, P. Fischer, E. Ilg, P. Hausser, C. Hazirbas,
2015.
V. Golkov, P. Smagt, D. Cremers, and T. Brox. Flownet: [30] F. Liu, C. Shen, and G. Lin. Deep convolutional neural fields
Learning optical flow with convolutional networks. In Proc. for depth estimation from a single image. In Proc. CVPR,
ICCV, 2015. 2015.
[10] M. A. Fischler and R. C. Bolles. Random sample consensus: [31] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional
A paradigm for model fitting with applications to image anal- networks for semantic segmentation. In Proc. CVPR, 2015.
ysis and automated cartography. Comm. ACM, 24(6):381– [32] A. Mishra, K. Alahari, and C. Jawahar. Scene text recogni-
395, 1981. tion using higher order language priors. Proc. BMVC., 2012.
[11] R. B. Girshick. Fast R-CNN. In Proc. ICCV, 2015. [33] L. Neumann and J. Matas. Real-time scene text localization
[12] R. B. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich and recognition. In Proc. CVPR, volume 3, pages 1187–
feature hierarchies for accurate object detection and semantic 1190, 2012.
segmentation. In Proc. CVPR, 2014. [34] L. Neumann and J. Matas. Scene text localization and recog-
[13] I. J. Goodfellow, Y. Bulatov, J. Ibarz, S. Arnoud, and V. Shet. nition with oriented stroke detection. In Proc. ICCV, pages
Multi-digit number recognition from street view imagery us- 97–104, December 2013.
ing deep convolutional neural networks. In Proc. ICLR, [35] P. Perez, M. Gangnet, and A. Blake. Poisson image editing.
2014. ACM Transactions on Graphics, 22(3):313–318, 2003.
[14] D. Hoiem, A. A. Efros, and M. Hebert. Automatic photo [36] J. Redmon, S. K. Divvala, R. B. Girshick, and A. Farhadi.
pop-up. In Proc. ACM SIGGRAPH, 2005. You only look once: Unified, real-time object detection. In
[15] D. Hoiem, A. A. Efros, and M. Hebert. Geometric context Proc. CVPR, 2016. To appear.
from a single image. In Proc. ICCV, 2005. [37] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: To-
[16] P. V. C. Hough. Method and means for recognizing complex wards real-time object detection with region proposal net-
patterns. US Patent 3,069,654, 1962. works. In NIPS, 2016.
[17] W. Huang, Y. Qiao, and X. Tang. Robust scene text detection [38] A. Saxena, M. Sun, and A. Y. Ng. Make3d: Learning
with convolution neural network induced mser trees. In Proc. 3d scene structure from a single still image. IEEE PAMI,
ECCV, 2014. 31(5):824–840, 2009.
[18] S. Ioffe and C. Szegedy. Batch normalization: Accelerating [39] A. Shahab, F. Shafait, and A. Dengel. ICDAR 2011 robust
deep network training by reducing internal covariate shift. In reading competition challenge 2: Reading text in scene im-
Proc. ICML, 2015. ages. In Proc. ICDAR, pages 1491–1496, 2011.
[19] M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisserman. [40] N. Silberman, D. Hoiem, P. Kohli, and R. Fergus. Indoor
Synthetic data and artificial neural networks for natural scene segmentation and support inference from rgbd images. In
text recognition. In Workshop on Deep Learning, NIPS, Proc. ECCV, 2012.
2014. [41] K. Simonyan and A. Zisserman. Very deep convolutional
[20] M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisserman. networks for large-scale image recognition. In International
Reading text in the wild with convolutional neural networks. Conference on Learning Representations, 2015.
IJCV, 2015. [42] K. Wang, B. Babenko, and S. Belongie. End-to-end scene
[21] M. Jaderberg, A. Vedaldi, and A. Zisserman. Deep features text recognition. In Proc. ICCV, pages 1457–1464, 2011.
for text spotting. In Proc. ECCV, 2014. [43] K. Wang and S. Belongie. Word spotting in the wild. In
[22] A. Janoch, S. Karayev, Y. Jia, J. T. Barron, M. Fritz,

2323
Proc. ECCV, 2010.
[44] T. Wang, D. J. Wu, A. Coates, and A. Y. Ng. End-to-end
text recognition with convolutional neural networks. In Proc.
ICPR, pages 3304–3308, 2012.
[45] C. Wolf and J. M. Jolion. Object count/area graphs for the
evaluation of object detection and segmentation algorithms.
International Journal on Document Analysis and Recogni-
tion, 8(4):280–296, 2006.
[46] I. Yildirim, T. D. Kulkarni, W. A. Freiwald, and J. B. Tenen-
baum. Efficient and robust analysis-by-synthesis in vision:
A computational framework, behavioral tests, and modeling
neuronal representations. In Annual Conference of the Cog-
nitive Science Society, 2015.
[47] Z. Zhang, W. Shen, C. Yao, and X. Bai. Symmetry-based
text line detection in natural scenes. In Proc. CVPR, 2015.
[48] C. L. Zitnick and P. Dollar. Edge boxes: Locating object pro-
posals from edges. In Proc. ECCV, pages 391–405, 2014.

2324

You might also like