0% found this document useful (0 votes)

9 views8 pages

Video Text WACV

This paper addresses the challenges of text detection and recognition in videos, extending existing image-based solutions to the video domain. It introduces a new dataset, ICDAR-VIDEO, and a performance metric based on precision-recall curves to evaluate text recognition in videos. The authors present their methodology, which leverages temporal redundancy and various character detection techniques to improve detection accuracy in video frames.

Uploaded by

Deep deeper

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views8 pages

Video Text WACV

Uploaded by

Deep deeper

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

Video Text Detection and Recognition: Dataset and Benchmark

Phuc Xuan Nguyen Kai Wang Serge Belongie

Department of Computer Science and Engineering Google Inc. Cornell NYC Tech
University of California, San Diego Cornell University

Abstract

This paper focuses on the problem of text detection and

recognition in videos. Even though text detection and recog-
nition in images has seen much progress in recent years, rel-
atively little work has been done to extend these solutions
to the video domain. In this work, we extend an existing
end-to-end solution for text recognition in natural images
to video. We explore a variety of methods for training lo-
cal character models and explore methods to capitalize on
the temporal redundancy of text in video. We present de-
tection performance using the Video Analysis and Content
Figure 1: Example frames in the YouTube Video Text
Extraction (VACE) benchmarking framework on the ICDAR
dataset. The green bounding boxes indicate the annotated
2013 Robust Reading Challenge 3 video dataset and on a
locations of text regions. Images in the top row are exam-
new video text dataset. We also propose a new performance
ples of scene text while the bottom rows show examples of
metric based on precision-recall curves to measure the per-
overlay text.
formance of text recognition in videos. Using this metric,
we provide early video text recognition results on the above
mentioned datasets.

purpose of providing a common benchmark for video text

1. Introduction detection. We refer to this dataset as ICDAR-VIDEO for
the remainder of the paper. Apart from the ABBYY base-
Text detection and recognition in unconstrained environ- line provided by the organizers, TextSpotter [15, 16] was
ments is a challenging computer vision problem. Such func- the only participant in this challenge. To the best of our
tionality can play valuable role in numerous real-world ap- knowledge, both of these systems were originally designed
plications, ranging from video indexing, assistive technol- for the task of text detection and recognition in images and
ogy for the visually impaired, automatic localization for were tweaked slightly to fit the competition format. Further-
businesses, and robotic navigation. In recent years, the more, while the ICDAR-VIDEO dataset provided both the
problem of scene text detection and recognition in natural position and string annotations for the words within each
images has received increasing attentions from the com- frame, only detection results were reported.
puter vision community [1, 2, 21, 20, 18, 17, 5]. As a
result, the domain has enjoyed significant advances on an In this work, we focus on the problems of detection and
increasing number of datasets of public scene text bench- recognition of text in videos. Figure 1 shows frames with
marks [12, 4, 22, 21, 13, 10]. text content in example videos. Similar to the image do-
Even though the amount of video data available is main counterparts [18, 20], this paper focuses on the prob-
rapidly increasing due to extensive use of camera phones lem of text detection and recognition in videos given a list
and wearable cameras (e.g. Google Glass, GoPro and video- of words (i.e., a lexicon). In many applications, the lexicon
sharing websites such as YouTube), relatively little work availability assumption is reasonable. Consider the problem
has been done on extending text reading solutions to the of assisting a blind person to shop at a grocery store: in this
video domain. The ICDAR 2013 Robust Reading Chal- case the lexicon would be a list of products in the store or
lenge 3 [7] introduced the first public video dataset for the within a given aisle.
Figure 2: An overview of our system. Starting from the left, raw input frames are fed independently to the character detection
module. This module returns a list of bounding boxes describing locations, character classes, and detection scores. These
character bounding boxes are temporally smoothed to remove false positives. Next, we perform word detection using Pictorial
Structures, rescore each word using global features, and perform non-maximum suppression. Word detections in each frame
are then passed through another round of temporal smoothing to remove false positive words. Finally, we link the per-frame
detections into tracks.

Contributions Our contributions are three-fold. (1) We Pre-processing Character Samples Character samples
extend Wang’s solution [20] to the video domain by em- from public text datasets sometimes do not have tight
ploying a different character detection method and includ- bounding boxes and, in some cases, have the wrong la-
ing multiple modules that exploit temporal redundancy to bels. We manually go through each character image, re-
improve the performance of the system. (2) We propose a move samples with wrong labels, and recrop the images for
precision/recall metric suited to the task of text recognition a tight fit with the characters. Enforcing a tight bounding
in video. (3) Finally, we introduce a new video text dataset. box gives the training data more consistent alignment and,
hence, produces better templates.
2. System Overview
Our approach is based on an extension of Wang’s solu- Mixture Models We observe that there is great intra-class
tion [20] as the components of this pipeline are simple yet variability within each character class. Different “proto-
effective. The source code is publicly available1 . Figure 2 types” of a letter can have very different appearances. Zhu
shows the complete overview of our video text detection et al. [24] suggest that we can grow the model complex-
and recognition system. We describe each module in detail ity by adding mixture models to capture the “sub-category”
in the following sections. structure of the data. The first step of training mixture mod-
els involves clustering the existing data into K different
2.1. Character Detection clusters. A classifier is then trained separately for each clus-
ter. This technique often requires a large amount of training
The first step in our pipeline is to detect characters given
data as each cluster requires sufficient positive examples to
an image frame. We perform multi-scale character detection
train a good model. Fortunately, in recent years, many text
via scanning-window templates trained with linear SVM
datasets with character-level annotations have been pub-
and Histogram of Oriented Gradients (HOG) [3] features.
lished [4, 12, 13, 21]. Combining the character samples
For each character, we train a binary classifier with five
from these datasets gives a large set for training mixture
rounds of hard negative mining. The initial round of train-
models. For each character, we split the samples into 10
ing uses the character samples of the target class as positive
clusters using K-means. Similar to Zhu’s [24] experiments,
training data and the other classes as negative training data.
we append the aspect ratio to the HOG features, and use
In subsequent rounds, we mine hard negative patches by
PCA to reduce the feature dimensionality prior to cluster-
running the previously trained model on images from the
ing.
Flickr dataset [6] and add top-scoring detections to the neg-
ative sets. To ensure the quality of the negative pool, we 2.2. Pictorial Structures and Word Rescoring
manually examined each image in the Flickr dataset and re-
We follow Wang’s Pictorial Structures [20] approach for
moved ones containing text.
constructing the word detections from character bounding
1 http://vision.ucsd.edu/ kai/grocr/ boxes. We also use the same word rescoring technique,
but construct the training set differently. To construct the Linker When the results are temporally smoothed to re-
training set for the rescoring SVM, we randomly select 100 move false positives, we proceed to link the per-frame de-
frames that contain text from each video in the training set. tections into tracks. For each detection in frame t, we search
We run the system on this set and label each returned word backwards in a buffer of 10 prior frames. We consider the
positive if matched with a ground truth and negative other- following features:
wise.
• the overlap ratio between the current bounding box and
2.3. Exploiting Temporal Redundancy the candidate in the prior frame,
Lienhart [11] suggests in his survey that one can exploit • the edit distance between the word and the candidate
temporal redundancy in video to remove false positives in word,
individual frames and recover missed detections. In the fol-
lowing sections, we describe three modules—the temporal • the temporal distance between the current detection
smoothing module, the linker, and the post-processing mod- and the candidate, measured in number of frames.
ule—that leverage temporal properties to improve the per-
We train a linear classifier to determine whether the
formance further.
bounding box detection in previous frames is a match for
the current bounding box. We only allow one match per
Temporal Smoothing Due to the local nature of previous frame. If there is more than one, we choose the
scanning-window approaches, false positives are often per- one with the highest score. We assign the current bounding
vasive, especially in a system that favors recall over preci- box the identifier that is present in the majority of matched
sion. This problem was also observed in [14]. Removing frames.
character false positives at this step is crucial as it reduces To establish a training set, we consider each detection in
the search space for the word detection step. each frame and the set of candidate bounding boxes in the
Given a bounding box detection at a given frame, we previous 10 frames. We label the bounding boxes with the
compute the overlap ratio between it and all detections in same track identifier as positives and the rest as negatives.
the preceeding and following N frames. If a sufficient We feed these labels and the computed features to a standard
amount of the neighboring frames contains detections that SVM package2 .
satisfy the overlap condition with the target frame, we keep
the detection in the current frame. Otherwise we discard it Post Processing Due to motion blur and other artifacts,
as a false positive. More formally, let bfi be the i-th bound- detections in some frames within a track might be miss-
ing box in frame f , |bfi | be the number of pixels in this ing, causing temporal fragmentations. To mitigate this ef-
region, Dt be the number of detections in time t and α be a fect, we linearly interpolate both the word scores and the
real number between 0 and 1, and define word bounding boxes between the detected frames to re-
cover missing detections. Finally, we remove all tracks with
 T
1 W Dt |bfi bjf −t | less than 10 frames.
S >α
I(bfi , t) = j=1 |bfi bjf −t | (1)

0 otherwise 3. Experiments
Next, we define a function H that takes an i-th bounding 3.1. Dataset
box at frame f , bfi , and returns 1 for a true positive or 0 for ICDAR-VIDEO In the ICDAR 2013 Robust Reading
a false positive. The function is defined as Competition Challenge 3 [7], a new video dataset was pre-
sented in an effort to address the problem of text detection

N

N
! in videos. The dataset consists of 28 videos in total: 13
X X videos for the training set and 15 for the test set. The sce-
H(bfi ) = I(bfi , j) ≥ βN ∨ I(bfi , −k) ≥ βN
narios in the videos include walking outdoor, shopping in
j=1 k=1
(2) grocery stores, driving and searching for directions within
a building. Each video is around 10 seconds to 1 minute
where β is a real number between 0 and 1.
long capturing scenes from real-life situations using differ-
This module has three hyperparameters: (1) N controls
ent types of cameras.
the number of frames to search forward and backward, (2) α
To construct the lexicon for a video, we extract all
specifies the overlap ratio condition, and (3) β controls the
ground truth words in the dataset to form a vocabulary. We
fraction of neighboring frames needed to decide whether
then assign a lexicon for each video by taking a union of its
the current detection is a true positive. We perform a grid
search on a validation set to tune these parameters. 2 http://www.csie.ntu.edu.tw/˜cjlin/liblinear/
ground truth words and a random subset of 500 “distractor” The first step is to establish a one-to-one mapping be-
words sampled from the vocabulary. tween the ground truth and the system output tracks. We
In the ICDAR-VIDEO dataset, the annotators assigned follow the mapping process described in [8].
each word a quality, which can be “LOW”, “MEDIUM”, Once the mapping between detection tracks and ground
or “HIGH”. During the competition, the “LOW” quality truth is established, the Sequence Track Detection Accuracy
was not considered. More specifically, misses on the low (STDA) is defined as
quality words did not penalize the system and detections of
NM P t t
the low quality words did not improve the score. For the X
t m(Gi , Di )
sake of simplicity of the evaluation framework, we decided ST DA = (3)
i=1
NGi ∪Di 6=0
to include words with all qualities in our evaluation.
where NM is the number of correspondences in the map-
YouTube Video Text We introduce YouTube Video Text 3 ping M, NGi ∪Di 6=0 is the number of frames where ei-
(YVT) dataset harvested from YouTube. The text content ther Gi or Di exist and m(Gti , Dit ) takes the value 1 if
in the dataset can be divided into two categories, overlay overlap(Gti , Dit ) > 0.5 and 0 otherwise.
text (e.g., captions, songs title, logos) and scene text (e.g. The Average Tracking Accuracy is then calculated as,
street signs, business signs, words on shirt). Figure 1 shows ST DA
examples of text content in the YVT dataset. AT A = NG +ND (4)
We downloaded a large number of videos by crawling 2

YouTube. We split each downloaded video into 15-second where NG and ND are the number of ground truth tracks
segments and created Amazon Mechanical Turk tasks to fil- and the the number of detection tracks.
ter out segments without text content. For each segment
that contained text, we annotated the locations of text re-
gions using VATIC [19]. We instructed the annotators to Video Precision and Recall Precision-recall curves are
draw tight bounding boxes only for readable English words. used widely in the literature of text recognition in the im-
Next, we asked the annotators to go through each frame in age domain. This metric demonstrates the tradeoff of the
the ground truth track and type the word contained in this system and facilitates the selection of operating points. We
bounding box. propose to extend this to a sequence-level precision and re-
The dataset contains a total of 30 videos, 15 in the train- call performance metric. In particular, ground truths are
ing set and 15 in the testing set. Each video has HD 720p matched at sequence-level similar to the computation for
quality, 30 frames per second, and 15-second duration. We ATA. We, however, place additional restrictions. We define
constructed the lexicon in the same way as for the ICDAR- m(Gti , Dit ) to be 1 if the overlap ratio is greater than 0.5 and
VIDEO dataset. the words in frame t match (ignoring cases).
A ground truth track Gi and a detection track Di are con-
3.2. Performance Metrics sidered a match if and only if
There are many evaluation frameworks proposed for m(Gti , Dit )
P
multiple object tracking systems [8, 9]. In this paper, we overlap = t > 0.5 (5)
NGi ∪Di 6=0
use ATA metrics from the VACE framework [8] to measure
the performance of the detection systems. We also propose The threshold at 50% is arbitrary but reasonable. This
a video precision-recall metric to measure the recognition extra restriction requires that the detected track must suf-
performance. ficiently fit the ground truth track at least half of the time.
Once the one-to-one mapping is established, the matched
detections are considered true positives, the unmatched de-
VACE Metrics The Average Tracking Accuracy (ATA) in
tection tracks are false positives and the unmatched ground
the VACE framework provides a spatio-temporal detection
truth tracks are false negatives. We then use the conven-
measure that penalizes fragmentations both in temporal and
tional definitions of precision, and recall to evaluate the per-
spatial dimensions.
formance of the detections.
For every frame t, a text tracking system outputs a
set of bounding box detections {dt1 , . . . , dtn }. We denote 3.3. Character Detection
the ground truth at frame t as {g1t , . . . , gnt }. Using their
unique identifiers, we can group these per-frame detections We begin with the evaluations of 4 different charac-
and ground truth into separate tracks, {D1 , . . . , Dr } and ter detection models. Wang et al. [20] synthesizes char-
{G1 , . . . , Gq }. acters with different fonts, backgrounds, rotation angles,
and noises. (1) We train character models (SYNTH) from
3 http://vision.ucsd.edu/datasetsAll these synthetic data. (2) The second set of models is
Figure 3: Character detection performance (F2-score) comparing different methods.

trained with unprocessed real characters obtained from pub- 1

No Smoothing [0.34]
lic datasets (REAL). (3) The third set is trained with pre- With Smoothing [0.44]
0.9
processed data as described in section 2.1 (CLEAN). Fi-
nally, (4) we train mixture models by clustering prepro- 0.8

cessed data (CLEAN+MIX). Real characters are collected 0.7

from the SVT-CHAR [21, 14], Weinman’s dataset [23], and
0.6
the English characters of Chars74K [4]. For all character
Precision

detection experiments, we benchmark the character results 0.5

on the ICDAR 2003 Robust Reading [12] test set. 0.4

As character detection is an early step in the pipeline, 0.3
we favor recall over precision as it is often easier to remove
0.2
false positives than to recover false negatives. We use F2-
score, a modified version of the F-score, defined as Fβ = 0.1
(1 + β 2 ). β 2precision×recall
×precision+recall , with β = 2. 0
0 0.1 0.2 0.3 0.4 0.5
Figure 3 shows the F2-score for each character detec- Recall

tion methods. From this plot, we observe the following. (1)

Models trained from “dirty” data (REAL) produce the worst Figure 4: Character detection performance (F2-score) com-
results. This could be explained by the misalignments of the paring the bounding box results with (red line) and without
data and the labeling errors. (2) “Clean” real training data (blue line) temporal smoothing. The F2-score is in square
outperforms synthetic data. Even though synthetic data is bracket next to the method name.
different across fonts, angles, noises, and blurriness, this
variation is not enough to describe the entire appearance
space of real-scene characters. (3) In general, mixture mod-
els yield small performance improvements (1-3%).

3.4. Temporal Smoothing

locations for the character “A”. Despite its relatively small
We also evaluate the effectiveness of temporal smooth- size, this set makes possible a preliminary experiment to
ing in the task of removing false positives from the charac- demonstrate the effectiveness of temporal smoothing at the
ter detection step. Currently there is no public video text character level. Figure 4 shows the precision and recall per-
dataset that has character-level annotations. Since produc- formance with and without temporal smoothing. This fig-
ing a video dataset with this level of annotation is expensive, ures shows the effectiveness of temporal smoothing, espe-
we choose to evaluate the method on a smaller scale. We an- cially at lower thresholds. This allows the character detec-
notated a small set of consecutive frames from the ICDAR- tion modules to operate at a lower threshold and achieve
VIDEO dataset at character-level. We selected 30 consecu- high recall without introducing an undue amount of false
tive frames from 6 videos and annotated the bounding box positives.
3.5. Word Detection and Recognition read (even for humans). Temporal smoothing and linear in-
terpolation help to mitigate these effects.
Our main experiment consists of evaluating end-to-end
word detection and recognition for both ICDAR-VIDEO
and YVT datasets.
YVT Evaluation For the YVT dataset, we can only com-
pare the performances of PLEX and DR because the origi-
ICDAR-VIDEO Evaluation The first two pipelines we nal implementations of TextSpotter and ABBYY baseline
consider are ABBYY and TextSpotter. Even though the ac- are not publicly available. Figures 6b and 5b show the
tual implementations of these methods are not published, ATA metrics and the video precision/recall curves for the
we obtained the detection results by analyzing the Javascript YVT dataset. Even though, DR still outperforms PLEX, the
structure of the ICDAR 2013 Robust Reading Competition gap between PLEX and DR is smaller than in the ICDAR-
website4 . VIDEO dataset. One possible explanation is the presence
Next, we apply Wang’s original implementation (PLEX) of overlay text. Overlay text often appears with standard
on each frame and group per-frame detections into tracks in fonts, no occlusion, and has a high level of contrast with
a manner that similar to that of ABBYY baseline. More the backgrounds. These appearance characteristics resem-
specifically, after obtaining the word detections for each ble the natures of synthetic training data. Again temporal
frame, each detected word is assigned an identifier of a pre- smoothing improves the performance dramatically for both
viously detected word with the best overlap (at least 0.5) in methods. Comparing performance on the YVT dataset and
the buffer of 7 previous frames. Words are removed from the ICDAR-VIDEO dataset reveals the relative difficulties
detection unless there is a matching word detected in the between the two. This is expected as YVT’s videos are of
immediate previous frame. HD720p quality and contain overlay text.
We also run PLEX with temporal enhancements, Figure 7 shows qualitative results of DR+T on example
PLEX+T. This method uses temporal smoothing after the frames from both datasets. This figure offers insights into
character detection and word detection step, links the per- the successes and failures of the system. Failure cases are
frame detections using a trained linker and uses the post- often caused by perspective distortions and unseen, chal-
processing as described in section 2.3. lenging fonts. Phan et al. [18] has recently reported promis-
Finally, we apply detection pipeline (DR) on every ing results in the problem of recognizing scene text with
frame using ABBYY heuristic to connect the frames. This perspective distortions.
pipeline differs from PLEX at the character detection step.
Instead of using models trained from synthetic data, DR 4. Conclusion
uses mixtures models trained from processed real data. The
last pipeline is DR+T where we use the temporal techniques In this paper, we extend an end-to-end text detection and
to improve the performance of DR. Since PLEX and DR recognition solution to the video domain. Our work high-
output a score for each word in every frame, we can com- lights the importance of local character detection models to
pute a track score by averaging the scores of all words in the system as a whole. We also show the effectiveness of
the track. exploiting temporal redundancy to remove false positives.
Since ABBYY baseline and TextSpotter detections are We propose a video precision and recall metric for bench-
produced without a lexicon, the performance results are pro- marking text recognition in video. By performing different
vided mainly as references as opposed to strict comparisons. detection and recognition experiments, we reveal the cur-
Figure 6a shows the box plots for the ATA scores for rent state of text detection and recognition in video. Clearly,
each method. Figure 5a shows the video precision/recall there is plenty of room for improvement in performance. In
curves. We generate the video precision-recall curves for Figure 5, the best observed sequence recalls are around 15%
PLEX, PLEX+T, DR, DR+T by varying a threshold on the and 23% for the ICDAR-VIDEO and the YVT datasets re-
track score. From these figures, we observe the following: spectively. This means that less than one out of every four
(1) better local models lead to better detection and recog- tracks is recognized correctly. We hope our datasets and
nition performance as DR only differs from PLEX at the performance results will serve as a baseline for future stud-
character detection step, however, final results differs sig- ies in text detection and recognition in the video domain.
nificantly. (2) Temporal smoothing significantly improves
the performance. With a relatively large lexicon, false pos- 5. Acknowledgement
itives are unavoidable. Furthermore, text in frames that are
affected by motion blur and video artifacts is very hard to We thank Seung-Hoon Han and Alessandro Bissacco for
valuable feedbacks during the project. Additional support
4 http://dag.cvc.uab.es/icdar2013competition/?ch=3 was provided by Google Focused Research Award.
1 1
ABBYY PLEX
TextSpotter PLEX+T
0.9 0.9
PLEX DR
PLEX+T DR+T
0.8 DR 0.8
DR+T
0.7 0.7

0.6 0.6
Precision

Precision
0.5 0.5

0.4 0.4

0.3 0.3

0.2 0.2

0.1 0.1

0 0
0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0 0.05 0.1 0.15 0.2 0.25
Recall Recall

(a) ICDAR-VIDEO (b) YVT

Figure 5: Video precision-recall performance of the ABBYY baseline, TextSpotter, PLEX, and our detection-from-
recognition pipeline (DR). Methods with temporal enhancements are denoted with +T.

0.25 0.5

0.45
0.2
0.4

0.15 0.35

0.3
0.1

0.25

0.05
0.2

0.15
0

ABBYY TextSpotter Plex Plex+T DR DR+T Plex Plex+T DR DR+T

(a) ICDAR-VIDEO (b) YVT

Figure 6: ATA metrics for the baseline methods and our detection-from-recognition pipeline (DR), see text for details.

References [6] M. J. Huiskes and M. S. Lew. The MIR Flickr retrieval eval-
uation. In MIR, 2008.
[1] X. Chen and A. L. Yuille. Detecting and reading text in nat- [7] D. Karatzas, F. Shafait, S. Uchida, M. Iwamura, S. R. Mestre,
ural scenes. In CVPR, 2004. J. Mas, D. F. Mota, J. A. Almazan, and L. P. de las Heras.
[2] A. Coates, B. Carpenter, C. Case, S. Satheesh, B. Suresh, Icdar 2013 robust reading competition. In ICDAR, 2013.
T. Wang, D. J. Wu, and A. Y. Ng. Text detection and char- [8] R. Kasturi, D. Goldgof, P. Soundararajan, V. Manohar,
acter recognition in scene images with unsupervised feature J. Garofolo, R. Bowers, M. Boonstra, V. Korzhova, and
learning. In ICDAR, 2011. J. Zhang. Framework for performance evaluation of face,
[3] N. Dalal and B. Triggs. Histograms of oriented gradients for text, and vehicle detection and tracking in video: Data, met-
human detection. In CVPR, 2005. rics, and protocol. PAMI, 2009.
[4] T. E. de Campos, B. R. Babu, and M. Varma. Character [9] B. Keni and S. Rainer. Evaluating multiple object tracking
recognition in natural images. In ICCVTA, 2009. performance: the clear mot metrics. EURASIP JIVP, 2008.
[5] W. Huang, Z. Lin, J. Yang, and J. Wang. Text localization [10] S. Lee, M. S. Cho, K. Jung, and J. H. Kim. Scene text ex-
in natural images using stroke feature transform and text co- traction with edge constraint and text collinearity. In ICPR,
variance descriptors. In ICCV, 2013. 2010.
Figure 7: Selected results for the DR+T method. The detections are shown in green while the ground truth is shown in red.

[11] R. Lienhart. Video OCR: a survey and practitioners guide. ICCV, 2013.
In Video mining. 2003. [19] C. Vondrick, D. Patterson, and D. Ramanan. Efficiently scal-
[12] S. M. Lucas, A. Panaretos, L. Sosa, A. Tang, S. Wong, and ing up crowdsourced video annotation. IJCV, 2012.
R. Young. ICDAR 2003 robust reading competitions. In [20] K. Wang, B. Babenko, and S. Belongie. End-to-end scene
ICDAR, 2003. text recognition. In ICCV, 2011.
[13] A. Mishra, K. Alahari, and C. Jawahar. Top-down and [21] K. Wang and S. Belongie. Word spotting in the wild. In
bottom-up cues for scene text recognition. In CVPR, 2012. ECCV, 2010.
[14] A. Mishra, K. Alahari, C. Jawahar, et al. Scene text recogni- [22] J. J. Weinman and E. Learned-Miller. Improving recognition
tion using higher order language priors. In BMVC, 2012. of novel input with similarity. In CVPR, 2006.
[15] L. Neumann and J. Matas. A method for text localization [23] J. J. Weinman, E. Learned-Miller, and A. R. Hanson. Scene
and recognition in real-world images. In ACCV, 2010. text recognition using similarity and a lexicon with sparse
[16] L. Neumann and J. Matas. Real-time scene text localization belief propagation. PAMI, 2009.
and recognition. In CVPR, 2012. [24] X. Zhu, C. Vondrick, D. Ramanan, and C. Fowlkes. Do we
[17] L. Neumann and J. Matas. Scene text localization and recog- need more training data or better models for object detec-
nition with oriented stroke detection. In ICCV, 2013. tion?. In BMVC, 2012.
[18] T. Q. Phan, P. Shivakumara, S. Tian, and C. L. Tan. Recog-
nizing text with perspective distortion in natural scenes. In

Video Text Detection by Tong Lu, Shivakumara Palaiahnakote, Chew Lim Tan
No ratings yet
Video Text Detection by Tong Lu, Shivakumara Palaiahnakote, Chew Lim Tan
272 pages
Mca1414garbybaby 170131175855
No ratings yet
Mca1414garbybaby 170131175855
44 pages
Flickr30k Entities: Collecting Region-to-Phrase Correspondences For Richer Image-to-Sentence Models
No ratings yet
Flickr30k Entities: Collecting Region-to-Phrase Correspondences For Richer Image-to-Sentence Models
22 pages
Paper 3
No ratings yet
Paper 3
4 pages
New Tampered Features For Scene and Caption Text Classification in Video Frame
No ratings yet
New Tampered Features For Scene and Caption Text Classification in Video Frame
6 pages
Paper - 3
No ratings yet
Paper - 3
33 pages
Learning English With Peppa Pig
No ratings yet
Learning English With Peppa Pig
15 pages
Visapp Vocr
No ratings yet
Visapp Vocr
6 pages
CLIP4Clip: An Empirical Study of CLIP For End To End Video Clip Retrieval
No ratings yet
CLIP4Clip: An Empirical Study of CLIP For End To End Video Clip Retrieval
14 pages
Wiamis 04
No ratings yet
Wiamis 04
4 pages
Detection of Text From Lecture Video Images
No ratings yet
Detection of Text From Lecture Video Images
5 pages
Incorporating Visual Information Into Natural Language Processing
No ratings yet
Incorporating Visual Information Into Natural Language Processing
151 pages
Extraction of Overlayed Text From TV Video Sequences
No ratings yet
Extraction of Overlayed Text From TV Video Sequences
9 pages
Detection and Identification of Un-Uniformed Shape Text From Blurred Video Frames
No ratings yet
Detection and Identification of Un-Uniformed Shape Text From Blurred Video Frames
11 pages
Visual Attention Models For Scene Text Recognition: Suman K. Ghosh and Ernest Valveny Andrew D. Bagdanov
No ratings yet
Visual Attention Models For Scene Text Recognition: Suman K. Ghosh and Ernest Valveny Andrew D. Bagdanov
6 pages
6454 31248 1 PB
No ratings yet
6454 31248 1 PB
7 pages
Dense Video
No ratings yet
Dense Video
35 pages
Enhancing Text Spotting With A Language Model and Visual Context Information
No ratings yet
Enhancing Text Spotting With A Language Model and Visual Context Information
10 pages
Temporal Integration For Word-Wise Caption and Scene Text Identification
No ratings yet
Temporal Integration For Word-Wise Caption and Scene Text Identification
6 pages
JournalNX - Textual Content Video Stream
No ratings yet
JournalNX - Textual Content Video Stream
5 pages
Scene Text Detection in Video by Learning Locally and Globally
No ratings yet
Scene Text Detection in Video by Learning Locally and Globally
7 pages
Paper 1
No ratings yet
Paper 1
13 pages
Scenetextrecognition
No ratings yet
Scenetextrecognition
15 pages
Implementation of A Video Text Detection System
No ratings yet
Implementation of A Video Text Detection System
5 pages
Language Synthesis Using Image Placement
No ratings yet
Language Synthesis Using Image Placement
9 pages
Multi-Dimensional LSTM
No ratings yet
Multi-Dimensional LSTM
10 pages
Text Detection OCR Reseacrh Paper
No ratings yet
Text Detection OCR Reseacrh Paper
26 pages
Tagging Based Efficient Web Video Event Categorization
No ratings yet
Tagging Based Efficient Web Video Event Categorization
5 pages
Miriam Leon, Veronica Vilaplana, Antoni Gasull, Ferran Marques (Veronica - Vilaplana, Antoni - Gasull, Ferran - Marques) @upc - Edu
No ratings yet
Miriam Leon, Veronica Vilaplana, Antoni Gasull, Ferran Marques (Veronica - Vilaplana, Antoni - Gasull, Ferran - Marques) @upc - Edu
4 pages
This Manuscript Is Currently Submitted To Computer Vision and Image Understanding Journal
No ratings yet
This Manuscript Is Currently Submitted To Computer Vision and Image Understanding Journal
34 pages
Extraction Text From Camera Images
No ratings yet
Extraction Text From Camera Images
14 pages
Ijarcce 208
No ratings yet
Ijarcce 208
3 pages
Text Extraction in Video: Ankur Srivastava, Dhananjay Kumar, Om Prakash Gupta, Amit Maurya, MR - Sanjay Kumar Srivastava
No ratings yet
Text Extraction in Video: Ankur Srivastava, Dhananjay Kumar, Om Prakash Gupta, Amit Maurya, MR - Sanjay Kumar Srivastava
6 pages
TS2-Net: Token Shift and Selection Transformer For Text-Video Retrieval
No ratings yet
TS2-Net: Token Shift and Selection Transformer For Text-Video Retrieval
23 pages
Multi-Script-Oriented Text Detection and Recognition in Video/Scene/Born Digital Images
No ratings yet
Multi-Script-Oriented Text Detection and Recognition in Video/Scene/Born Digital Images
18 pages
Extracting Text From Videos
No ratings yet
Extracting Text From Videos
11 pages
Haramaya University Computer Science Student
No ratings yet
Haramaya University Computer Science Student
15 pages
Localizing Text On Videos
No ratings yet
Localizing Text On Videos
13 pages
Long Short-Term Relation Transformer With Global Gating For Video Captioning
No ratings yet
Long Short-Term Relation Transformer With Global Gating For Video Captioning
13 pages
Text To Speech
No ratings yet
Text To Speech
9 pages
Multi-Dimensional Long Short-Term Memory Networks For Artificial Arabic Text Recognition in News Video
No ratings yet
Multi-Dimensional Long Short-Term Memory Networks For Artificial Arabic Text Recognition in News Video
10 pages
Video Description: A Survey of Methods, Datasets, and Evaluation Metrics
No ratings yet
Video Description: A Survey of Methods, Datasets, and Evaluation Metrics
37 pages
Jaderberg 16
No ratings yet
Jaderberg 16
20 pages
Paper 22-NF-SAVO Neuro-Fuzzy System For Arabic Video OCR
No ratings yet
Paper 22-NF-SAVO Neuro-Fuzzy System For Arabic Video OCR
9 pages
Cam2Pdf .
No ratings yet
Cam2Pdf .
6 pages
10 1109@tetci 2019 2892755
No ratings yet
10 1109@tetci 2019 2892755
16 pages
Arabic Text Recognition in Video Sequences: Mohamed Ben Halima, Hichem Karray and Adel M. Alimi
No ratings yet
Arabic Text Recognition in Video Sequences: Mohamed Ben Halima, Hichem Karray and Adel M. Alimi
6 pages
Top-Down and Bottom-Up Cues For Scene Text Recognition: Anand Mishra Karteek Alahari C. V. Jawahar
No ratings yet
Top-Down and Bottom-Up Cues For Scene Text Recognition: Anand Mishra Karteek Alahari C. V. Jawahar
8 pages
Latest Base Paper
No ratings yet
Latest Base Paper
4 pages
Image Captioning Using R-CNN & LSTM Deep Learning Model
No ratings yet
Image Captioning Using R-CNN & LSTM Deep Learning Model
4 pages
Long2021 Article SceneTextDetectionAndRecogniti
No ratings yet
Long2021 Article SceneTextDetectionAndRecogniti
24 pages
A Robust and Fast Text Extraction in Images and Video Frames
No ratings yet
A Robust and Fast Text Extraction in Images and Video Frames
7 pages
Irjet V11i617
No ratings yet
Irjet V11i617
7 pages
Ancel's Intel Hidden Bios Guide
No ratings yet
Ancel's Intel Hidden Bios Guide
8 pages
A Detailed Study On E - Payment Modes and Its Impact
100% (1)
A Detailed Study On E - Payment Modes and Its Impact
53 pages
20 Coding Patterns To Master MAANG Interviews
No ratings yet
20 Coding Patterns To Master MAANG Interviews
22 pages
CH 1 Python Revision Tour - I
No ratings yet
CH 1 Python Revision Tour - I
60 pages
TAFJ JBC Remote Debugger
No ratings yet
TAFJ JBC Remote Debugger
10 pages
D1.5 Analysis of Hard and Software Requirements
No ratings yet
D1.5 Analysis of Hard and Software Requirements
59 pages
DRTECH API Manual For EVS Detectors
No ratings yet
DRTECH API Manual For EVS Detectors
74 pages
A Seminar Report ON Direct-To-Home Television (DTH)
100% (1)
A Seminar Report ON Direct-To-Home Television (DTH)
32 pages
PMBOK 6th Edition 2020 - NarayanDas Ch06
No ratings yet
PMBOK 6th Edition 2020 - NarayanDas Ch06
97 pages
688966705-At-T-Mobility-Llc-Iphone-12-2 11.13.02 Am
No ratings yet
688966705-At-T-Mobility-Llc-Iphone-12-2 11.13.02 Am
1 page
ICT Kalam Reet - Task - Booklet - 01
No ratings yet
ICT Kalam Reet - Task - Booklet - 01
5 pages
Imo PS S-VDR
No ratings yet
Imo PS S-VDR
5 pages
Comenzi Cisco
No ratings yet
Comenzi Cisco
3 pages
004N - UG EVO 3 IP ENG 15 - 04 - 2021 - Compressed
No ratings yet
004N - UG EVO 3 IP ENG 15 - 04 - 2021 - Compressed
52 pages
A CNN-LSTM Model For Gold Price Time Series Forecasting NCA
No ratings yet
A CNN-LSTM Model For Gold Price Time Series Forecasting NCA
12 pages
HX Je
100% (1)
HX Je
1 page
CoreJAVA Practicals
No ratings yet
CoreJAVA Practicals
2 pages
Incident Report (CASE#28912) - RCBC Bankard - Recording Cannot Be Seen On Interactions
No ratings yet
Incident Report (CASE#28912) - RCBC Bankard - Recording Cannot Be Seen On Interactions
3 pages
An Introduction To Policy Search Methods: Thomas Furmston
No ratings yet
An Introduction To Policy Search Methods: Thomas Furmston
33 pages
FINAL JOINING KIT COMPLETE - Employees 2
No ratings yet
FINAL JOINING KIT COMPLETE - Employees 2
17 pages
Maintenance Planning and Scheduling Laboratory Assessment 1
No ratings yet
Maintenance Planning and Scheduling Laboratory Assessment 1
4 pages
An Overview of The LTE Physical Layer Part I
No ratings yet
An Overview of The LTE Physical Layer Part I
6 pages
MMW1 - 4
No ratings yet
MMW1 - 4
50 pages
Assignment Guidelines-July'24 Session
No ratings yet
Assignment Guidelines-July'24 Session
2 pages
The Cine
No ratings yet
The Cine
35 pages
BE Mechatronics Brochure 2019 Final
No ratings yet
BE Mechatronics Brochure 2019 Final
2 pages
2960 Switch Cisco Catalyst 48 Port Switch
No ratings yet
2960 Switch Cisco Catalyst 48 Port Switch
1 page
Difference Between Microkernel and Exokernel
No ratings yet
Difference Between Microkernel and Exokernel
4 pages
Sop - Vor
No ratings yet
Sop - Vor
3 pages
Quiz 2
No ratings yet
Quiz 2
2 pages
Learning OpenCV 3 Application Development
From Everand
Learning OpenCV 3 Application Development
Samyak Datta
No ratings yet
Machine Learning - Advanced Concepts
From Everand
Machine Learning - Advanced Concepts
Derrick Mwiti
No ratings yet
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
From Everand
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
César Pérez López
No ratings yet
DEEP LEARNING TECHNIQUES: CLUSTER ANALYSIS and PATTERN RECOGNITION with NEURAL NETWORKS. Examples with MATLAB
From Everand
DEEP LEARNING TECHNIQUES: CLUSTER ANALYSIS and PATTERN RECOGNITION with NEURAL NETWORKS. Examples with MATLAB
César Pérez López
No ratings yet
Java/J2EE Design Patterns Interview Questions You'll Most Likely Be Asked: Second Edition
From Everand
Java/J2EE Design Patterns Interview Questions You'll Most Likely Be Asked: Second Edition
Vibrant Publishers
No ratings yet
Bag of Words Model: Unlocking Visual Intelligence with Bag of Words
From Everand
Bag of Words Model: Unlocking Visual Intelligence with Bag of Words
Fouad Sabry
No ratings yet
Pedestrian Detection: Please, suggest a subtitle for a book with title 'Pedestrian Detection' within the realm of 'Computer Vision'. The suggested subtitle should not have ':'.
From Everand
Pedestrian Detection: Please, suggest a subtitle for a book with title 'Pedestrian Detection' within the realm of 'Computer Vision'. The suggested subtitle should not have ':'.
Fouad Sabry
No ratings yet

Video Text WACV

Uploaded by

Video Text WACV

Uploaded by

Video Text Detection and Recognition: Dataset and Benchmark

Phuc Xuan Nguyen Kai Wang Serge Belongie

This paper focuses on the problem of text detection and

purpose of providing a common benchmark for video text

trained with unprocessed real characters obtained from pub- 1

cessed data (CLEAN+MIX). Real characters are collected 0.7

detection experiments, we benchmark the character results 0.5

on the ICDAR 2003 Robust Reading [12] test set. 0.4

tion methods. From this plot, we observe the following. (1)

3.4. Temporal Smoothing

(a) ICDAR-VIDEO (b) YVT

ABBYY TextSpotter Plex Plex+T DR DR+T Plex Plex+T DR DR+T

(a) ICDAR-VIDEO (b) YVT

You might also like