Video Text WACV
Video Text WACV
Abstract
Contributions Our contributions are three-fold. (1) We Pre-processing Character Samples Character samples
extend Wang’s solution [20] to the video domain by em- from public text datasets sometimes do not have tight
ploying a different character detection method and includ- bounding boxes and, in some cases, have the wrong la-
ing multiple modules that exploit temporal redundancy to bels. We manually go through each character image, re-
improve the performance of the system. (2) We propose a move samples with wrong labels, and recrop the images for
precision/recall metric suited to the task of text recognition a tight fit with the characters. Enforcing a tight bounding
in video. (3) Finally, we introduce a new video text dataset. box gives the training data more consistent alignment and,
hence, produces better templates.
2. System Overview
Our approach is based on an extension of Wang’s solu- Mixture Models We observe that there is great intra-class
tion [20] as the components of this pipeline are simple yet variability within each character class. Different “proto-
effective. The source code is publicly available1 . Figure 2 types” of a letter can have very different appearances. Zhu
shows the complete overview of our video text detection et al. [24] suggest that we can grow the model complex-
and recognition system. We describe each module in detail ity by adding mixture models to capture the “sub-category”
in the following sections. structure of the data. The first step of training mixture mod-
els involves clustering the existing data into K different
2.1. Character Detection clusters. A classifier is then trained separately for each clus-
ter. This technique often requires a large amount of training
The first step in our pipeline is to detect characters given
data as each cluster requires sufficient positive examples to
an image frame. We perform multi-scale character detection
train a good model. Fortunately, in recent years, many text
via scanning-window templates trained with linear SVM
datasets with character-level annotations have been pub-
and Histogram of Oriented Gradients (HOG) [3] features.
lished [4, 12, 13, 21]. Combining the character samples
For each character, we train a binary classifier with five
from these datasets gives a large set for training mixture
rounds of hard negative mining. The initial round of train-
models. For each character, we split the samples into 10
ing uses the character samples of the target class as positive
clusters using K-means. Similar to Zhu’s [24] experiments,
training data and the other classes as negative training data.
we append the aspect ratio to the HOG features, and use
In subsequent rounds, we mine hard negative patches by
PCA to reduce the feature dimensionality prior to cluster-
running the previously trained model on images from the
ing.
Flickr dataset [6] and add top-scoring detections to the neg-
ative sets. To ensure the quality of the negative pool, we 2.2. Pictorial Structures and Word Rescoring
manually examined each image in the Flickr dataset and re-
We follow Wang’s Pictorial Structures [20] approach for
moved ones containing text.
constructing the word detections from character bounding
1 http://vision.ucsd.edu/ kai/grocr/ boxes. We also use the same word rescoring technique,
but construct the training set differently. To construct the Linker When the results are temporally smoothed to re-
training set for the rescoring SVM, we randomly select 100 move false positives, we proceed to link the per-frame de-
frames that contain text from each video in the training set. tections into tracks. For each detection in frame t, we search
We run the system on this set and label each returned word backwards in a buffer of 10 prior frames. We consider the
positive if matched with a ground truth and negative other- following features:
wise.
• the overlap ratio between the current bounding box and
2.3. Exploiting Temporal Redundancy the candidate in the prior frame,
Lienhart [11] suggests in his survey that one can exploit • the edit distance between the word and the candidate
temporal redundancy in video to remove false positives in word,
individual frames and recover missed detections. In the fol-
lowing sections, we describe three modules—the temporal • the temporal distance between the current detection
smoothing module, the linker, and the post-processing mod- and the candidate, measured in number of frames.
ule—that leverage temporal properties to improve the per-
We train a linear classifier to determine whether the
formance further.
bounding box detection in previous frames is a match for
the current bounding box. We only allow one match per
Temporal Smoothing Due to the local nature of previous frame. If there is more than one, we choose the
scanning-window approaches, false positives are often per- one with the highest score. We assign the current bounding
vasive, especially in a system that favors recall over preci- box the identifier that is present in the majority of matched
sion. This problem was also observed in [14]. Removing frames.
character false positives at this step is crucial as it reduces To establish a training set, we consider each detection in
the search space for the word detection step. each frame and the set of candidate bounding boxes in the
Given a bounding box detection at a given frame, we previous 10 frames. We label the bounding boxes with the
compute the overlap ratio between it and all detections in same track identifier as positives and the rest as negatives.
the preceeding and following N frames. If a sufficient We feed these labels and the computed features to a standard
amount of the neighboring frames contains detections that SVM package2 .
satisfy the overlap condition with the target frame, we keep
the detection in the current frame. Otherwise we discard it Post Processing Due to motion blur and other artifacts,
as a false positive. More formally, let bfi be the i-th bound- detections in some frames within a track might be miss-
ing box in frame f , |bfi | be the number of pixels in this ing, causing temporal fragmentations. To mitigate this ef-
region, Dt be the number of detections in time t and α be a fect, we linearly interpolate both the word scores and the
real number between 0 and 1, and define word bounding boxes between the detected frames to re-
cover missing detections. Finally, we remove all tracks with
T
1 W Dt |bfi bjf −t | less than 10 frames.
S >α
I(bfi , t) = j=1 |bfi bjf −t | (1)
0 otherwise 3. Experiments
Next, we define a function H that takes an i-th bounding 3.1. Dataset
box at frame f , bfi , and returns 1 for a true positive or 0 for ICDAR-VIDEO In the ICDAR 2013 Robust Reading
a false positive. The function is defined as Competition Challenge 3 [7], a new video dataset was pre-
sented in an effort to address the problem of text detection
N
N
! in videos. The dataset consists of 28 videos in total: 13
X X videos for the training set and 15 for the test set. The sce-
H(bfi ) = I(bfi , j) ≥ βN ∨ I(bfi , −k) ≥ βN
narios in the videos include walking outdoor, shopping in
j=1 k=1
(2) grocery stores, driving and searching for directions within
a building. Each video is around 10 seconds to 1 minute
where β is a real number between 0 and 1.
long capturing scenes from real-life situations using differ-
This module has three hyperparameters: (1) N controls
ent types of cameras.
the number of frames to search forward and backward, (2) α
To construct the lexicon for a video, we extract all
specifies the overlap ratio condition, and (3) β controls the
ground truth words in the dataset to form a vocabulary. We
fraction of neighboring frames needed to decide whether
then assign a lexicon for each video by taking a union of its
the current detection is a true positive. We perform a grid
search on a validation set to tune these parameters. 2 http://www.csie.ntu.edu.tw/˜cjlin/liblinear/
ground truth words and a random subset of 500 “distractor” The first step is to establish a one-to-one mapping be-
words sampled from the vocabulary. tween the ground truth and the system output tracks. We
In the ICDAR-VIDEO dataset, the annotators assigned follow the mapping process described in [8].
each word a quality, which can be “LOW”, “MEDIUM”, Once the mapping between detection tracks and ground
or “HIGH”. During the competition, the “LOW” quality truth is established, the Sequence Track Detection Accuracy
was not considered. More specifically, misses on the low (STDA) is defined as
quality words did not penalize the system and detections of
NM P t t
the low quality words did not improve the score. For the X
t m(Gi , Di )
sake of simplicity of the evaluation framework, we decided ST DA = (3)
i=1
NGi ∪Di 6=0
to include words with all qualities in our evaluation.
where NM is the number of correspondences in the map-
YouTube Video Text We introduce YouTube Video Text 3 ping M, NGi ∪Di 6=0 is the number of frames where ei-
(YVT) dataset harvested from YouTube. The text content ther Gi or Di exist and m(Gti , Dit ) takes the value 1 if
in the dataset can be divided into two categories, overlay overlap(Gti , Dit ) > 0.5 and 0 otherwise.
text (e.g., captions, songs title, logos) and scene text (e.g. The Average Tracking Accuracy is then calculated as,
street signs, business signs, words on shirt). Figure 1 shows ST DA
examples of text content in the YVT dataset. AT A = NG +ND (4)
We downloaded a large number of videos by crawling 2
YouTube. We split each downloaded video into 15-second where NG and ND are the number of ground truth tracks
segments and created Amazon Mechanical Turk tasks to fil- and the the number of detection tracks.
ter out segments without text content. For each segment
that contained text, we annotated the locations of text re-
gions using VATIC [19]. We instructed the annotators to Video Precision and Recall Precision-recall curves are
draw tight bounding boxes only for readable English words. used widely in the literature of text recognition in the im-
Next, we asked the annotators to go through each frame in age domain. This metric demonstrates the tradeoff of the
the ground truth track and type the word contained in this system and facilitates the selection of operating points. We
bounding box. propose to extend this to a sequence-level precision and re-
The dataset contains a total of 30 videos, 15 in the train- call performance metric. In particular, ground truths are
ing set and 15 in the testing set. Each video has HD 720p matched at sequence-level similar to the computation for
quality, 30 frames per second, and 15-second duration. We ATA. We, however, place additional restrictions. We define
constructed the lexicon in the same way as for the ICDAR- m(Gti , Dit ) to be 1 if the overlap ratio is greater than 0.5 and
VIDEO dataset. the words in frame t match (ignoring cases).
A ground truth track Gi and a detection track Di are con-
3.2. Performance Metrics sidered a match if and only if
There are many evaluation frameworks proposed for m(Gti , Dit )
P
multiple object tracking systems [8, 9]. In this paper, we overlap = t > 0.5 (5)
NGi ∪Di 6=0
use ATA metrics from the VACE framework [8] to measure
the performance of the detection systems. We also propose The threshold at 50% is arbitrary but reasonable. This
a video precision-recall metric to measure the recognition extra restriction requires that the detected track must suf-
performance. ficiently fit the ground truth track at least half of the time.
Once the one-to-one mapping is established, the matched
detections are considered true positives, the unmatched de-
VACE Metrics The Average Tracking Accuracy (ATA) in
tection tracks are false positives and the unmatched ground
the VACE framework provides a spatio-temporal detection
truth tracks are false negatives. We then use the conven-
measure that penalizes fragmentations both in temporal and
tional definitions of precision, and recall to evaluate the per-
spatial dimensions.
formance of the detections.
For every frame t, a text tracking system outputs a
set of bounding box detections {dt1 , . . . , dtn }. We denote 3.3. Character Detection
the ground truth at frame t as {g1t , . . . , gnt }. Using their
unique identifiers, we can group these per-frame detections We begin with the evaluations of 4 different charac-
and ground truth into separate tracks, {D1 , . . . , Dr } and ter detection models. Wang et al. [20] synthesizes char-
{G1 , . . . , Gq }. acters with different fonts, backgrounds, rotation angles,
and noises. (1) We train character models (SYNTH) from
3 http://vision.ucsd.edu/datasetsAll these synthetic data. (2) The second set of models is
Figure 3: Character detection performance (F2-score) comparing different methods.
0.6 0.6
Precision
Precision
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0 0.05 0.1 0.15 0.2 0.25
Recall Recall
Figure 5: Video precision-recall performance of the ABBYY baseline, TextSpotter, PLEX, and our detection-from-
recognition pipeline (DR). Methods with temporal enhancements are denoted with +T.
0.25 0.5
0.45
0.2
0.4
0.15 0.35
0.3
0.1
0.25
0.05
0.2
0.15
0
Figure 6: ATA metrics for the baseline methods and our detection-from-recognition pipeline (DR), see text for details.
References [6] M. J. Huiskes and M. S. Lew. The MIR Flickr retrieval eval-
uation. In MIR, 2008.
[1] X. Chen and A. L. Yuille. Detecting and reading text in nat- [7] D. Karatzas, F. Shafait, S. Uchida, M. Iwamura, S. R. Mestre,
ural scenes. In CVPR, 2004. J. Mas, D. F. Mota, J. A. Almazan, and L. P. de las Heras.
[2] A. Coates, B. Carpenter, C. Case, S. Satheesh, B. Suresh, Icdar 2013 robust reading competition. In ICDAR, 2013.
T. Wang, D. J. Wu, and A. Y. Ng. Text detection and char- [8] R. Kasturi, D. Goldgof, P. Soundararajan, V. Manohar,
acter recognition in scene images with unsupervised feature J. Garofolo, R. Bowers, M. Boonstra, V. Korzhova, and
learning. In ICDAR, 2011. J. Zhang. Framework for performance evaluation of face,
[3] N. Dalal and B. Triggs. Histograms of oriented gradients for text, and vehicle detection and tracking in video: Data, met-
human detection. In CVPR, 2005. rics, and protocol. PAMI, 2009.
[4] T. E. de Campos, B. R. Babu, and M. Varma. Character [9] B. Keni and S. Rainer. Evaluating multiple object tracking
recognition in natural images. In ICCVTA, 2009. performance: the clear mot metrics. EURASIP JIVP, 2008.
[5] W. Huang, Z. Lin, J. Yang, and J. Wang. Text localization [10] S. Lee, M. S. Cho, K. Jung, and J. H. Kim. Scene text ex-
in natural images using stroke feature transform and text co- traction with edge constraint and text collinearity. In ICPR,
variance descriptors. In ICCV, 2013. 2010.
Figure 7: Selected results for the DR+T method. The detections are shown in green while the ground truth is shown in red.
[11] R. Lienhart. Video OCR: a survey and practitioners guide. ICCV, 2013.
In Video mining. 2003. [19] C. Vondrick, D. Patterson, and D. Ramanan. Efficiently scal-
[12] S. M. Lucas, A. Panaretos, L. Sosa, A. Tang, S. Wong, and ing up crowdsourced video annotation. IJCV, 2012.
R. Young. ICDAR 2003 robust reading competitions. In [20] K. Wang, B. Babenko, and S. Belongie. End-to-end scene
ICDAR, 2003. text recognition. In ICCV, 2011.
[13] A. Mishra, K. Alahari, and C. Jawahar. Top-down and [21] K. Wang and S. Belongie. Word spotting in the wild. In
bottom-up cues for scene text recognition. In CVPR, 2012. ECCV, 2010.
[14] A. Mishra, K. Alahari, C. Jawahar, et al. Scene text recogni- [22] J. J. Weinman and E. Learned-Miller. Improving recognition
tion using higher order language priors. In BMVC, 2012. of novel input with similarity. In CVPR, 2006.
[15] L. Neumann and J. Matas. A method for text localization [23] J. J. Weinman, E. Learned-Miller, and A. R. Hanson. Scene
and recognition in real-world images. In ACCV, 2010. text recognition using similarity and a lexicon with sparse
[16] L. Neumann and J. Matas. Real-time scene text localization belief propagation. PAMI, 2009.
and recognition. In CVPR, 2012. [24] X. Zhu, C. Vondrick, D. Ramanan, and C. Fowlkes. Do we
[17] L. Neumann and J. Matas. Scene text localization and recog- need more training data or better models for object detec-
nition with oriented stroke detection. In ICCV, 2013. tion?. In BMVC, 2012.
[18] T. Q. Phan, P. Shivakumara, S. Tian, and C. L. Tan. Recog-
nizing text with perspective distortion in natural scenes. In