0% found this document useful (0 votes)
9 views8 pages

Video Text WACV

This paper addresses the challenges of text detection and recognition in videos, extending existing image-based solutions to the video domain. It introduces a new dataset, ICDAR-VIDEO, and a performance metric based on precision-recall curves to evaluate text recognition in videos. The authors present their methodology, which leverages temporal redundancy and various character detection techniques to improve detection accuracy in video frames.

Uploaded by

Deep deeper
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views8 pages

Video Text WACV

This paper addresses the challenges of text detection and recognition in videos, extending existing image-based solutions to the video domain. It introduces a new dataset, ICDAR-VIDEO, and a performance metric based on precision-recall curves to evaluate text recognition in videos. The authors present their methodology, which leverages temporal redundancy and various character detection techniques to improve detection accuracy in video frames.

Uploaded by

Deep deeper
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Video Text Detection and Recognition: Dataset and Benchmark

Phuc Xuan Nguyen Kai Wang Serge Belongie


Department of Computer Science and Engineering Google Inc. Cornell NYC Tech
University of California, San Diego Cornell University

Abstract

This paper focuses on the problem of text detection and


recognition in videos. Even though text detection and recog-
nition in images has seen much progress in recent years, rel-
atively little work has been done to extend these solutions
to the video domain. In this work, we extend an existing
end-to-end solution for text recognition in natural images
to video. We explore a variety of methods for training lo-
cal character models and explore methods to capitalize on
the temporal redundancy of text in video. We present de-
tection performance using the Video Analysis and Content
Figure 1: Example frames in the YouTube Video Text
Extraction (VACE) benchmarking framework on the ICDAR
dataset. The green bounding boxes indicate the annotated
2013 Robust Reading Challenge 3 video dataset and on a
locations of text regions. Images in the top row are exam-
new video text dataset. We also propose a new performance
ples of scene text while the bottom rows show examples of
metric based on precision-recall curves to measure the per-
overlay text.
formance of text recognition in videos. Using this metric,
we provide early video text recognition results on the above
mentioned datasets.

purpose of providing a common benchmark for video text


1. Introduction detection. We refer to this dataset as ICDAR-VIDEO for
the remainder of the paper. Apart from the ABBYY base-
Text detection and recognition in unconstrained environ- line provided by the organizers, TextSpotter [15, 16] was
ments is a challenging computer vision problem. Such func- the only participant in this challenge. To the best of our
tionality can play valuable role in numerous real-world ap- knowledge, both of these systems were originally designed
plications, ranging from video indexing, assistive technol- for the task of text detection and recognition in images and
ogy for the visually impaired, automatic localization for were tweaked slightly to fit the competition format. Further-
businesses, and robotic navigation. In recent years, the more, while the ICDAR-VIDEO dataset provided both the
problem of scene text detection and recognition in natural position and string annotations for the words within each
images has received increasing attentions from the com- frame, only detection results were reported.
puter vision community [1, 2, 21, 20, 18, 17, 5]. As a
result, the domain has enjoyed significant advances on an In this work, we focus on the problems of detection and
increasing number of datasets of public scene text bench- recognition of text in videos. Figure 1 shows frames with
marks [12, 4, 22, 21, 13, 10]. text content in example videos. Similar to the image do-
Even though the amount of video data available is main counterparts [18, 20], this paper focuses on the prob-
rapidly increasing due to extensive use of camera phones lem of text detection and recognition in videos given a list
and wearable cameras (e.g. Google Glass, GoPro and video- of words (i.e., a lexicon). In many applications, the lexicon
sharing websites such as YouTube), relatively little work availability assumption is reasonable. Consider the problem
has been done on extending text reading solutions to the of assisting a blind person to shop at a grocery store: in this
video domain. The ICDAR 2013 Robust Reading Chal- case the lexicon would be a list of products in the store or
lenge 3 [7] introduced the first public video dataset for the within a given aisle.
Figure 2: An overview of our system. Starting from the left, raw input frames are fed independently to the character detection
module. This module returns a list of bounding boxes describing locations, character classes, and detection scores. These
character bounding boxes are temporally smoothed to remove false positives. Next, we perform word detection using Pictorial
Structures, rescore each word using global features, and perform non-maximum suppression. Word detections in each frame
are then passed through another round of temporal smoothing to remove false positive words. Finally, we link the per-frame
detections into tracks.

Contributions Our contributions are three-fold. (1) We Pre-processing Character Samples Character samples
extend Wang’s solution [20] to the video domain by em- from public text datasets sometimes do not have tight
ploying a different character detection method and includ- bounding boxes and, in some cases, have the wrong la-
ing multiple modules that exploit temporal redundancy to bels. We manually go through each character image, re-
improve the performance of the system. (2) We propose a move samples with wrong labels, and recrop the images for
precision/recall metric suited to the task of text recognition a tight fit with the characters. Enforcing a tight bounding
in video. (3) Finally, we introduce a new video text dataset. box gives the training data more consistent alignment and,
hence, produces better templates.
2. System Overview
Our approach is based on an extension of Wang’s solu- Mixture Models We observe that there is great intra-class
tion [20] as the components of this pipeline are simple yet variability within each character class. Different “proto-
effective. The source code is publicly available1 . Figure 2 types” of a letter can have very different appearances. Zhu
shows the complete overview of our video text detection et al. [24] suggest that we can grow the model complex-
and recognition system. We describe each module in detail ity by adding mixture models to capture the “sub-category”
in the following sections. structure of the data. The first step of training mixture mod-
els involves clustering the existing data into K different
2.1. Character Detection clusters. A classifier is then trained separately for each clus-
ter. This technique often requires a large amount of training
The first step in our pipeline is to detect characters given
data as each cluster requires sufficient positive examples to
an image frame. We perform multi-scale character detection
train a good model. Fortunately, in recent years, many text
via scanning-window templates trained with linear SVM
datasets with character-level annotations have been pub-
and Histogram of Oriented Gradients (HOG) [3] features.
lished [4, 12, 13, 21]. Combining the character samples
For each character, we train a binary classifier with five
from these datasets gives a large set for training mixture
rounds of hard negative mining. The initial round of train-
models. For each character, we split the samples into 10
ing uses the character samples of the target class as positive
clusters using K-means. Similar to Zhu’s [24] experiments,
training data and the other classes as negative training data.
we append the aspect ratio to the HOG features, and use
In subsequent rounds, we mine hard negative patches by
PCA to reduce the feature dimensionality prior to cluster-
running the previously trained model on images from the
ing.
Flickr dataset [6] and add top-scoring detections to the neg-
ative sets. To ensure the quality of the negative pool, we 2.2. Pictorial Structures and Word Rescoring
manually examined each image in the Flickr dataset and re-
We follow Wang’s Pictorial Structures [20] approach for
moved ones containing text.
constructing the word detections from character bounding
1 http://vision.ucsd.edu/ kai/grocr/ boxes. We also use the same word rescoring technique,
but construct the training set differently. To construct the Linker When the results are temporally smoothed to re-
training set for the rescoring SVM, we randomly select 100 move false positives, we proceed to link the per-frame de-
frames that contain text from each video in the training set. tections into tracks. For each detection in frame t, we search
We run the system on this set and label each returned word backwards in a buffer of 10 prior frames. We consider the
positive if matched with a ground truth and negative other- following features:
wise.
• the overlap ratio between the current bounding box and
2.3. Exploiting Temporal Redundancy the candidate in the prior frame,
Lienhart [11] suggests in his survey that one can exploit • the edit distance between the word and the candidate
temporal redundancy in video to remove false positives in word,
individual frames and recover missed detections. In the fol-
lowing sections, we describe three modules—the temporal • the temporal distance between the current detection
smoothing module, the linker, and the post-processing mod- and the candidate, measured in number of frames.
ule—that leverage temporal properties to improve the per-
We train a linear classifier to determine whether the
formance further.
bounding box detection in previous frames is a match for
the current bounding box. We only allow one match per
Temporal Smoothing Due to the local nature of previous frame. If there is more than one, we choose the
scanning-window approaches, false positives are often per- one with the highest score. We assign the current bounding
vasive, especially in a system that favors recall over preci- box the identifier that is present in the majority of matched
sion. This problem was also observed in [14]. Removing frames.
character false positives at this step is crucial as it reduces To establish a training set, we consider each detection in
the search space for the word detection step. each frame and the set of candidate bounding boxes in the
Given a bounding box detection at a given frame, we previous 10 frames. We label the bounding boxes with the
compute the overlap ratio between it and all detections in same track identifier as positives and the rest as negatives.
the preceeding and following N frames. If a sufficient We feed these labels and the computed features to a standard
amount of the neighboring frames contains detections that SVM package2 .
satisfy the overlap condition with the target frame, we keep
the detection in the current frame. Otherwise we discard it Post Processing Due to motion blur and other artifacts,
as a false positive. More formally, let bfi be the i-th bound- detections in some frames within a track might be miss-
ing box in frame f , |bfi | be the number of pixels in this ing, causing temporal fragmentations. To mitigate this ef-
region, Dt be the number of detections in time t and α be a fect, we linearly interpolate both the word scores and the
real number between 0 and 1, and define word bounding boxes between the detected frames to re-
cover missing detections. Finally, we remove all tracks with
  T 
1 W Dt |bfi bjf −t | less than 10 frames.
S >α
I(bfi , t) = j=1 |bfi bjf −t | (1)

0 otherwise 3. Experiments
Next, we define a function H that takes an i-th bounding 3.1. Dataset
box at frame f , bfi , and returns 1 for a true positive or 0 for ICDAR-VIDEO In the ICDAR 2013 Robust Reading
a false positive. The function is defined as Competition Challenge 3 [7], a new video dataset was pre-
sented in an effort to address the problem of text detection

N

N
! in videos. The dataset consists of 28 videos in total: 13
X X videos for the training set and 15 for the test set. The sce-
H(bfi ) = I(bfi , j) ≥ βN ∨ I(bfi , −k) ≥ βN
narios in the videos include walking outdoor, shopping in
j=1 k=1
(2) grocery stores, driving and searching for directions within
a building. Each video is around 10 seconds to 1 minute
where β is a real number between 0 and 1.
long capturing scenes from real-life situations using differ-
This module has three hyperparameters: (1) N controls
ent types of cameras.
the number of frames to search forward and backward, (2) α
To construct the lexicon for a video, we extract all
specifies the overlap ratio condition, and (3) β controls the
ground truth words in the dataset to form a vocabulary. We
fraction of neighboring frames needed to decide whether
then assign a lexicon for each video by taking a union of its
the current detection is a true positive. We perform a grid
search on a validation set to tune these parameters. 2 http://www.csie.ntu.edu.tw/˜cjlin/liblinear/
ground truth words and a random subset of 500 “distractor” The first step is to establish a one-to-one mapping be-
words sampled from the vocabulary. tween the ground truth and the system output tracks. We
In the ICDAR-VIDEO dataset, the annotators assigned follow the mapping process described in [8].
each word a quality, which can be “LOW”, “MEDIUM”, Once the mapping between detection tracks and ground
or “HIGH”. During the competition, the “LOW” quality truth is established, the Sequence Track Detection Accuracy
was not considered. More specifically, misses on the low (STDA) is defined as
quality words did not penalize the system and detections of
NM P t t
the low quality words did not improve the score. For the X
t m(Gi , Di )
sake of simplicity of the evaluation framework, we decided ST DA = (3)
i=1
NGi ∪Di 6=0
to include words with all qualities in our evaluation.
where NM is the number of correspondences in the map-
YouTube Video Text We introduce YouTube Video Text 3 ping M, NGi ∪Di 6=0 is the number of frames where ei-
(YVT) dataset harvested from YouTube. The text content ther Gi or Di exist and m(Gti , Dit ) takes the value 1 if
in the dataset can be divided into two categories, overlay overlap(Gti , Dit ) > 0.5 and 0 otherwise.
text (e.g., captions, songs title, logos) and scene text (e.g. The Average Tracking Accuracy is then calculated as,
street signs, business signs, words on shirt). Figure 1 shows ST DA
examples of text content in the YVT dataset. AT A =  NG +ND  (4)
We downloaded a large number of videos by crawling 2

YouTube. We split each downloaded video into 15-second where NG and ND are the number of ground truth tracks
segments and created Amazon Mechanical Turk tasks to fil- and the the number of detection tracks.
ter out segments without text content. For each segment
that contained text, we annotated the locations of text re-
gions using VATIC [19]. We instructed the annotators to Video Precision and Recall Precision-recall curves are
draw tight bounding boxes only for readable English words. used widely in the literature of text recognition in the im-
Next, we asked the annotators to go through each frame in age domain. This metric demonstrates the tradeoff of the
the ground truth track and type the word contained in this system and facilitates the selection of operating points. We
bounding box. propose to extend this to a sequence-level precision and re-
The dataset contains a total of 30 videos, 15 in the train- call performance metric. In particular, ground truths are
ing set and 15 in the testing set. Each video has HD 720p matched at sequence-level similar to the computation for
quality, 30 frames per second, and 15-second duration. We ATA. We, however, place additional restrictions. We define
constructed the lexicon in the same way as for the ICDAR- m(Gti , Dit ) to be 1 if the overlap ratio is greater than 0.5 and
VIDEO dataset. the words in frame t match (ignoring cases).
A ground truth track Gi and a detection track Di are con-
3.2. Performance Metrics sidered a match if and only if
There are many evaluation frameworks proposed for m(Gti , Dit )
P
multiple object tracking systems [8, 9]. In this paper, we overlap = t > 0.5 (5)
NGi ∪Di 6=0
use ATA metrics from the VACE framework [8] to measure
the performance of the detection systems. We also propose The threshold at 50% is arbitrary but reasonable. This
a video precision-recall metric to measure the recognition extra restriction requires that the detected track must suf-
performance. ficiently fit the ground truth track at least half of the time.
Once the one-to-one mapping is established, the matched
detections are considered true positives, the unmatched de-
VACE Metrics The Average Tracking Accuracy (ATA) in
tection tracks are false positives and the unmatched ground
the VACE framework provides a spatio-temporal detection
truth tracks are false negatives. We then use the conven-
measure that penalizes fragmentations both in temporal and
tional definitions of precision, and recall to evaluate the per-
spatial dimensions.
formance of the detections.
For every frame t, a text tracking system outputs a
set of bounding box detections {dt1 , . . . , dtn }. We denote 3.3. Character Detection
the ground truth at frame t as {g1t , . . . , gnt }. Using their
unique identifiers, we can group these per-frame detections We begin with the evaluations of 4 different charac-
and ground truth into separate tracks, {D1 , . . . , Dr } and ter detection models. Wang et al. [20] synthesizes char-
{G1 , . . . , Gq }. acters with different fonts, backgrounds, rotation angles,
and noises. (1) We train character models (SYNTH) from
3 http://vision.ucsd.edu/datasetsAll these synthetic data. (2) The second set of models is
Figure 3: Character detection performance (F2-score) comparing different methods.

trained with unprocessed real characters obtained from pub- 1


No Smoothing [0.34]
lic datasets (REAL). (3) The third set is trained with pre- With Smoothing [0.44]
0.9
processed data as described in section 2.1 (CLEAN). Fi-
nally, (4) we train mixture models by clustering prepro- 0.8

cessed data (CLEAN+MIX). Real characters are collected 0.7


from the SVT-CHAR [21, 14], Weinman’s dataset [23], and
0.6
the English characters of Chars74K [4]. For all character
Precision

detection experiments, we benchmark the character results 0.5

on the ICDAR 2003 Robust Reading [12] test set. 0.4


As character detection is an early step in the pipeline, 0.3
we favor recall over precision as it is often easier to remove
0.2
false positives than to recover false negatives. We use F2-
score, a modified version of the F-score, defined as Fβ = 0.1
(1 + β 2 ). β 2precision×recall
×precision+recall , with β = 2. 0
0 0.1 0.2 0.3 0.4 0.5
Figure 3 shows the F2-score for each character detec- Recall

tion methods. From this plot, we observe the following. (1)


Models trained from “dirty” data (REAL) produce the worst Figure 4: Character detection performance (F2-score) com-
results. This could be explained by the misalignments of the paring the bounding box results with (red line) and without
data and the labeling errors. (2) “Clean” real training data (blue line) temporal smoothing. The F2-score is in square
outperforms synthetic data. Even though synthetic data is bracket next to the method name.
different across fonts, angles, noises, and blurriness, this
variation is not enough to describe the entire appearance
space of real-scene characters. (3) In general, mixture mod-
els yield small performance improvements (1-3%).

3.4. Temporal Smoothing


locations for the character “A”. Despite its relatively small
We also evaluate the effectiveness of temporal smooth- size, this set makes possible a preliminary experiment to
ing in the task of removing false positives from the charac- demonstrate the effectiveness of temporal smoothing at the
ter detection step. Currently there is no public video text character level. Figure 4 shows the precision and recall per-
dataset that has character-level annotations. Since produc- formance with and without temporal smoothing. This fig-
ing a video dataset with this level of annotation is expensive, ures shows the effectiveness of temporal smoothing, espe-
we choose to evaluate the method on a smaller scale. We an- cially at lower thresholds. This allows the character detec-
notated a small set of consecutive frames from the ICDAR- tion modules to operate at a lower threshold and achieve
VIDEO dataset at character-level. We selected 30 consecu- high recall without introducing an undue amount of false
tive frames from 6 videos and annotated the bounding box positives.
3.5. Word Detection and Recognition read (even for humans). Temporal smoothing and linear in-
terpolation help to mitigate these effects.
Our main experiment consists of evaluating end-to-end
word detection and recognition for both ICDAR-VIDEO
and YVT datasets.
YVT Evaluation For the YVT dataset, we can only com-
pare the performances of PLEX and DR because the origi-
ICDAR-VIDEO Evaluation The first two pipelines we nal implementations of TextSpotter and ABBYY baseline
consider are ABBYY and TextSpotter. Even though the ac- are not publicly available. Figures 6b and 5b show the
tual implementations of these methods are not published, ATA metrics and the video precision/recall curves for the
we obtained the detection results by analyzing the Javascript YVT dataset. Even though, DR still outperforms PLEX, the
structure of the ICDAR 2013 Robust Reading Competition gap between PLEX and DR is smaller than in the ICDAR-
website4 . VIDEO dataset. One possible explanation is the presence
Next, we apply Wang’s original implementation (PLEX) of overlay text. Overlay text often appears with standard
on each frame and group per-frame detections into tracks in fonts, no occlusion, and has a high level of contrast with
a manner that similar to that of ABBYY baseline. More the backgrounds. These appearance characteristics resem-
specifically, after obtaining the word detections for each ble the natures of synthetic training data. Again temporal
frame, each detected word is assigned an identifier of a pre- smoothing improves the performance dramatically for both
viously detected word with the best overlap (at least 0.5) in methods. Comparing performance on the YVT dataset and
the buffer of 7 previous frames. Words are removed from the ICDAR-VIDEO dataset reveals the relative difficulties
detection unless there is a matching word detected in the between the two. This is expected as YVT’s videos are of
immediate previous frame. HD720p quality and contain overlay text.
We also run PLEX with temporal enhancements, Figure 7 shows qualitative results of DR+T on example
PLEX+T. This method uses temporal smoothing after the frames from both datasets. This figure offers insights into
character detection and word detection step, links the per- the successes and failures of the system. Failure cases are
frame detections using a trained linker and uses the post- often caused by perspective distortions and unseen, chal-
processing as described in section 2.3. lenging fonts. Phan et al. [18] has recently reported promis-
Finally, we apply detection pipeline (DR) on every ing results in the problem of recognizing scene text with
frame using ABBYY heuristic to connect the frames. This perspective distortions.
pipeline differs from PLEX at the character detection step.
Instead of using models trained from synthetic data, DR 4. Conclusion
uses mixtures models trained from processed real data. The
last pipeline is DR+T where we use the temporal techniques In this paper, we extend an end-to-end text detection and
to improve the performance of DR. Since PLEX and DR recognition solution to the video domain. Our work high-
output a score for each word in every frame, we can com- lights the importance of local character detection models to
pute a track score by averaging the scores of all words in the system as a whole. We also show the effectiveness of
the track. exploiting temporal redundancy to remove false positives.
Since ABBYY baseline and TextSpotter detections are We propose a video precision and recall metric for bench-
produced without a lexicon, the performance results are pro- marking text recognition in video. By performing different
vided mainly as references as opposed to strict comparisons. detection and recognition experiments, we reveal the cur-
Figure 6a shows the box plots for the ATA scores for rent state of text detection and recognition in video. Clearly,
each method. Figure 5a shows the video precision/recall there is plenty of room for improvement in performance. In
curves. We generate the video precision-recall curves for Figure 5, the best observed sequence recalls are around 15%
PLEX, PLEX+T, DR, DR+T by varying a threshold on the and 23% for the ICDAR-VIDEO and the YVT datasets re-
track score. From these figures, we observe the following: spectively. This means that less than one out of every four
(1) better local models lead to better detection and recog- tracks is recognized correctly. We hope our datasets and
nition performance as DR only differs from PLEX at the performance results will serve as a baseline for future stud-
character detection step, however, final results differs sig- ies in text detection and recognition in the video domain.
nificantly. (2) Temporal smoothing significantly improves
the performance. With a relatively large lexicon, false pos- 5. Acknowledgement
itives are unavoidable. Furthermore, text in frames that are
affected by motion blur and video artifacts is very hard to We thank Seung-Hoon Han and Alessandro Bissacco for
valuable feedbacks during the project. Additional support
4 http://dag.cvc.uab.es/icdar2013competition/?ch=3 was provided by Google Focused Research Award.
1 1
ABBYY PLEX
TextSpotter PLEX+T
0.9 0.9
PLEX DR
PLEX+T DR+T
0.8 DR 0.8
DR+T
0.7 0.7

0.6 0.6
Precision

Precision
0.5 0.5

0.4 0.4

0.3 0.3

0.2 0.2

0.1 0.1

0 0
0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0 0.05 0.1 0.15 0.2 0.25
Recall Recall

(a) ICDAR-VIDEO (b) YVT

Figure 5: Video precision-recall performance of the ABBYY baseline, TextSpotter, PLEX, and our detection-from-
recognition pipeline (DR). Methods with temporal enhancements are denoted with +T.

0.25 0.5

0.45
0.2
0.4

0.15 0.35

0.3
0.1

0.25

0.05
0.2

0.15
0

ABBYY TextSpotter Plex Plex+T DR DR+T Plex Plex+T DR DR+T

(a) ICDAR-VIDEO (b) YVT

Figure 6: ATA metrics for the baseline methods and our detection-from-recognition pipeline (DR), see text for details.

References [6] M. J. Huiskes and M. S. Lew. The MIR Flickr retrieval eval-
uation. In MIR, 2008.
[1] X. Chen and A. L. Yuille. Detecting and reading text in nat- [7] D. Karatzas, F. Shafait, S. Uchida, M. Iwamura, S. R. Mestre,
ural scenes. In CVPR, 2004. J. Mas, D. F. Mota, J. A. Almazan, and L. P. de las Heras.
[2] A. Coates, B. Carpenter, C. Case, S. Satheesh, B. Suresh, Icdar 2013 robust reading competition. In ICDAR, 2013.
T. Wang, D. J. Wu, and A. Y. Ng. Text detection and char- [8] R. Kasturi, D. Goldgof, P. Soundararajan, V. Manohar,
acter recognition in scene images with unsupervised feature J. Garofolo, R. Bowers, M. Boonstra, V. Korzhova, and
learning. In ICDAR, 2011. J. Zhang. Framework for performance evaluation of face,
[3] N. Dalal and B. Triggs. Histograms of oriented gradients for text, and vehicle detection and tracking in video: Data, met-
human detection. In CVPR, 2005. rics, and protocol. PAMI, 2009.
[4] T. E. de Campos, B. R. Babu, and M. Varma. Character [9] B. Keni and S. Rainer. Evaluating multiple object tracking
recognition in natural images. In ICCVTA, 2009. performance: the clear mot metrics. EURASIP JIVP, 2008.
[5] W. Huang, Z. Lin, J. Yang, and J. Wang. Text localization [10] S. Lee, M. S. Cho, K. Jung, and J. H. Kim. Scene text ex-
in natural images using stroke feature transform and text co- traction with edge constraint and text collinearity. In ICPR,
variance descriptors. In ICCV, 2013. 2010.
Figure 7: Selected results for the DR+T method. The detections are shown in green while the ground truth is shown in red.

[11] R. Lienhart. Video OCR: a survey and practitioners guide. ICCV, 2013.
In Video mining. 2003. [19] C. Vondrick, D. Patterson, and D. Ramanan. Efficiently scal-
[12] S. M. Lucas, A. Panaretos, L. Sosa, A. Tang, S. Wong, and ing up crowdsourced video annotation. IJCV, 2012.
R. Young. ICDAR 2003 robust reading competitions. In [20] K. Wang, B. Babenko, and S. Belongie. End-to-end scene
ICDAR, 2003. text recognition. In ICCV, 2011.
[13] A. Mishra, K. Alahari, and C. Jawahar. Top-down and [21] K. Wang and S. Belongie. Word spotting in the wild. In
bottom-up cues for scene text recognition. In CVPR, 2012. ECCV, 2010.
[14] A. Mishra, K. Alahari, C. Jawahar, et al. Scene text recogni- [22] J. J. Weinman and E. Learned-Miller. Improving recognition
tion using higher order language priors. In BMVC, 2012. of novel input with similarity. In CVPR, 2006.
[15] L. Neumann and J. Matas. A method for text localization [23] J. J. Weinman, E. Learned-Miller, and A. R. Hanson. Scene
and recognition in real-world images. In ACCV, 2010. text recognition using similarity and a lexicon with sparse
[16] L. Neumann and J. Matas. Real-time scene text localization belief propagation. PAMI, 2009.
and recognition. In CVPR, 2012. [24] X. Zhu, C. Vondrick, D. Ramanan, and C. Fowlkes. Do we
[17] L. Neumann and J. Matas. Scene text localization and recog- need more training data or better models for object detec-
nition with oriented stroke detection. In ICCV, 2013. tion?. In BMVC, 2012.
[18] T. Q. Phan, P. Shivakumara, S. Tian, and C. L. Tan. Recog-
nizing text with perspective distortion in natural scenes. In

You might also like