0% found this document useful (0 votes)
22 views15 pages

Actions in The Eye - Dynamic Gaze Datasets and Learnt Saliency Models For Visual Recognition

1. The document introduces novel datasets of human eye movements collected while viewing large-scale video datasets annotated for visual action recognition tasks. These are the first large datasets that record human eye fixations and saccades during a visual recognition task. 2. The datasets provide insights into differences between how humans and computer vision systems sample images spatially and temporally. Models trained on human eye data achieve state-of-the-art performance for visual recognition. 3. Analyses of the eye movement data show remarkable consistency across subjects in patterns of visual search, indicating stability in how humans approach these tasks. The datasets enable building computer vision systems informed by human visual processing.

Uploaded by

siva shankaran
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views15 pages

Actions in The Eye - Dynamic Gaze Datasets and Learnt Saliency Models For Visual Recognition

1. The document introduces novel datasets of human eye movements collected while viewing large-scale video datasets annotated for visual action recognition tasks. These are the first large datasets that record human eye fixations and saccades during a visual recognition task. 2. The datasets provide insights into differences between how humans and computer vision systems sample images spatially and temporally. Models trained on human eye data achieve state-of-the-art performance for visual recognition. 3. Analyses of the eye movement data show remarkable consistency across subjects in patterns of visual search, indicating stability in how humans approach these tasks. The datasets enable building computer vision systems informed by human visual processing.

Uploaded by

siva shankaran
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

1

Actions in the Eye: Dynamic Gaze Datasets and


Learnt Saliency Models for Visual Recognition
Stefan Mathe, Member, IEEE, Cristian Sminchisescu, Member, IEEE

Abstract—Systems based on bag-of-words models from image features collected at maxima of sparse interest point operators have
been used successfully for both computer visual object and action recognition tasks. While the sparse, interest-point based approach to
recognition is not inconsistent with visual processing in biological systems that operate in ‘saccade and fixate’ regimes, the methodology
and emphasis in the human and the computer vision communities remains sharply distinct. Here, we make three contributions
aiming to bridge this gap. First, we complement existing state-of-the art large scale dynamic computer vision annotated datasets
like Hollywood-2 [1] and UCF Sports [2] with human eye movements collected under the ecological constraints of the visual action
arXiv:1312.7570v1 [cs.CV] 29 Dec 2013

recognition task. To our knowledge these are the first large human eye tracking datasets to be collected and made publicly available
for video, vision.imar.ro/eyetracking (497,107 frames, each viewed by 16 subjects), unique in terms of their (a) large scale
and computer vision relevance, (b) dynamic, video stimuli, (c) task control, as opposed to free-viewing. Second, we introduce novel
sequential consistency and alignment measures, which underline the remarkable stability of patterns of visual search among subjects.
Third, we leverage the significant amount of collected data in order to pursue studies and build automatic, end-to-end trainable computer
vision systems based on human eye movements. Our studies not only shed light on the differences between computer vision spatio-
temporal interest point image sampling strategies and the human fixations, as well as their impact for visual recognition performance,
but also demonstrate that human fixations can be accurately predicted, and when used in an end-to-end automatic system, leveraging
some of the advanced computer vision practice, can lead to state of the art results.

Index Terms—visual action recognition, human eye-movements, consistency analysis, saliency prediction, large scale learning

1 I NTRODUCTION scale datasets that would provide recordings of the workings of


the human visual system, in the context of a visual recognition
R ECENT progress in computer visual recognition, in par-
ticular image classification, object detection and segmen-
tation or action recognition heavily relies on machine learning
task, at different levels of interpretations including neural
systems or eye movements. The human eye movement level,
methods trained on large scale human annotated datasets. The defined by image fixations and saccades, is potentially the less
level of annotation varies, spanning a degree of detail from controversial to measure and analyze. It is sufficiently ‘high-
global image or video labels to bounding boxes or precise level’ or ‘behavioral’ for the computer vision community to
segmentations of objects [3]. However, the annotations are rule-out, to some degree at least, open-ended debates on where
often subjectively defined, primarily by the high-level visual and what should one record, as could be the case, for instance
recognition tasks generally agreed upon by the computer with neural systems in different brain areas [4]. Besides, our
vision community. While such data has made advances in goals in this context are pragmatic: fixations provide a suffi-
system design and evaluation possible, it does not necessarily ciently high-level signal that can be precisely registered with
provide insights or constraints into those intermediate levels of the image stimuli, for testing hypotheses and for training visual
computation, or deep structure, that are perceived as ultimately feature extractors and recognition models quantitatively. It can
necessary in order to design highly reliable computer vision potentially foster links with the human vision community, in
systems. This is noticeable in the accuracy of state of the particular researchers developing biologically plausible models
art systems trained with such annotations, which still lags of visual attention, who would be able to test and quantitatively
significantly behind human performance on similar tasks. Nor analyze their models on shared large scale datasets [4], [5], [6].
does existing data make it immediately possible to exploit Some of the most successful approaches to action recogni-
insights from an existing working system–the human eye–to tion employ bag-of words representations based on descriptors
potentially derive better features, models or algorithms. computed at spatial-temporal video locations, obtained at the
The divide is well epitomized by the lack of matching large maxima of an interest point operator biased to fire over non-
trivial local structure (space-time ‘corners’ or spatial-temporal
• Stefan Mathe is with the Institute of Mathematics at the Romanian
interest points [7]). More sophisticated image representations
Academy and the Computer Science Department at the University of based on objects and their relations, as well as multiple kernels
Toronto. Email: [email protected]. have been employed with a degree of success [8], although it
• Cristian Sminchisescu is with the Department of Mathematics, Faculty
appears still difficult to detect a large variety of useful objects
of Engineering, Lund University and the Institute of Mathematics at the reliably in challenging video footage. Although human pose
Romanian Academy. Email: [email protected] (c.a.) estimation could greatly disambiguate the interations between
actors and manipulated objects, it is a difficult problem even
2

in a controlled setting due to the large number of local minima


in the search space [9]. The dominant role of sparse spatial-
frame 25 frame 50 frame 90 frame 150
temporal interest point operators as front end in computer
vision systems raises the question whether computational
insights from a working system like the human visual system
can be used to improve performance. The sparse approach to frame 15 frame 40 frame 60 frame 90
computer visual recognition is not inconsistent to the one of
biological systems, but the degree of repeatability and the ef-
fect of using human fixations with computer vision algorithms
in the context of action recognition have not been yet explored. frame 10 frame 35 frame 70 frame 85
In this paper we make the following contributions:
1) We undertake a significant effort of recording and
analyzing human eye movements in the context of frame 3 frame 10 frame 18 frame 22
dynamic visual action recognition tasks for two ex-
isting computer vision datasets, Hollywood 2 [1]
and UCF-Sports [2]. This dynamic data is made
publicly available to the research community at
vision.imar.ro/eyetracking. frame 3 frame 7 frame 15 frame 19
2) We introduce novel consistency models and algorithms,
as well as relevance evaluation measures adapted for
video. Our findings (see §4) suggest a remarkable degree
of sequential consistency–both spatial and sequential–in frame 4 frame 12 frame 25 frame 31
the fixation patterns of human subjects but also underline
a less extensive influence of task on dynamic fixations Fig. 1: Heat maps generated from the fixations of 16 human
than previously believed, at least within the class of the subjects viewing 6 videos selected from the Hollywood-2 and
datasets and actions we studied. UCF Sports datasets. Fixated locations are generally tightly
3) By using our large scale training set of human fixations clustered. This suggests a significant degree of consistency
and by leveraging static and dynamic image features among human subjects in terms of the spatial distribution of
based on color, texture, edge distributions (HoG) or their visual attention. See fig. 3 for quantitative studies.
motion boundary histograms (MBH), we introduce novel
saliency detectors and show that these can be trained
effectively to predict human fixations as measured under are semantically meaningful, which we illustrate in §5. This
both average precision (AP), and as Kullblack-Leibler naturally suggests that human fixations could provide useful
spatial comparison measures. See §8 and table 6 for information to support automatic action recognition systems.
results. In section §6 we introduce our action recognition pipeline
4) We show that training an end-to-end automatic visual which we shall use through the remainder of the paper. Section
action recognition system based on our learned saliency §7 explores the action recognition potential of several interest
interest operator (point 3), and using advanced computer point operators derived from ground truth human fixations
vision descriptors and fusion methods, leads to state and visual saliency maps. In §8 we turn our attention to the
of the art results in the Hollywood-2 and UCF-Sports problem of human visual saliency prediction, and introduce
action datasets. This is, we argue, one of the first a novel spatio-temporal human fixation detector trained using
demonstrations of a successful symbiosis of computer our human gaze dataset. Section §9 illustrates how predicted
vision and human vision technology, within the context saliency maps can be integrated into a modern state-of-the-
of a very challenging dynamic visual recognition task. art end-to-end action recognition system. We draw our final
It shows the potential of interest point operators learnt conclusions in §10.
from human fixations for computer vision. Models and
experiments appear in §9, results in table 4. This paper
extends our prior work in [10]. 2 R ELATED W ORK
The paper is organized as follows. In §2 we briefly review The study of gaze patterns in humans has long received
existing studies on human visual attention and saliency, as significant interest by the human vision community [11].
well as the state-of-the art computational models for automatic Whether visual attention is driven by purely bottom-up cues
action recognition from videos. Our dataset and data collection [12] or a combination of top-down and bottom-up influences
methodology are introduced in §3. In §4 we analyze inter- [13] is still open to debate (see [13], [14] for comprehensive
subject agreement and introduce two novel metrics for mea- reviews). One of the oldest theories of visual attention has been
suring spatial and sequential visual consistency in the video that bottom-up features guide vision towards locations of high
domain. In addition to showing remarkable visual consistency, saliency [13], [14]. Early computational models of attention
human subjects also tend to fixate image structures that [12], [15] assume that human fixations are driven by maxima
3

inside a topographicals map that encodes the saliency of each models [35], [36] and stuctured output SVMs [37]. Currently
point in the visual field. the most successful systems remain the ones dominated by
Models of saliency can be either pre-specified or learned complex features extracted at interesting locations, bagged and
from eye tracking data. In the former category falls the basic fused using advanced kernel combination techniques [1], [8].
saliency model [15] that combines information from multiple This study is driven, primarily, by our computer vision inter-
channels into a single saliency map. Information maximization ests, yet leverages data collection and insights from human
[16] provides an alternative criterion for building saliency vision. While in this paper we focus on bag-of-words spatio-
maps. These can be learned from low-level features [17] or temporal computer action recognition pipelines, the scope for
from a combination of low, mid and high-level ones [5], [18], study and the structure in the data are broader. We do not see
[19]. Saliency maps have been used for scene classification this investigation as a terminus, but rather as a first step in
[20], object localization and recognition [21], [22], [23], [24], exploring some of the most advanced data and models that
[25] and action recognition [26], [27]. Comparatively little human vision and computer vision can offer at the moment.
attention has been devoted to computational models of saliency
maps for the dynamic domain. Bottom-up saliency models for
3 L ARGE S CALE H UMAN E YE M OVEMENT
static images have been extended to videos by incorporating
motion and flicker channels [28], [29]. All these models are DATA C OLLECTION IN V IDEO
pre-specified. One exception is the work of Kienzle et al. An objective of this work is to introduce additional annotations
[27], who train an interest point detector using fixation data in the form of eye recordings for two large scale video data
collected from human subjects in a free viewing (rather than sets for action recognition.
specific) task. The Hollywood-2 Movie Dataset: Introduced in [1], it is
Datasets containing human gaze pattern annotations of one of the largest and most challenging available datasets for
images have emerged from studies carried out by the human real world actions. It contains 12 classes: answering phone,
vision community, some of which are publicly available [5], driving a car, eating, fighting, getting out of a car, shaking
[15], [19], [22], [30] and some that are not [27] (see [31] for hands, hugging, kissing, running, sitting down, sitting up and
an overview). Most of these datasets have been designed for standing up. These actions are collected from a set of 69
small quantitative studies, consisting of at most a few hundred Hollywood movies. The data set is split into a training set of
images or videos, usually recorded under free-viewing, in 823 sequences and a test set of 884 sequences. There is no
sharp contrast with the data we provide, which is large scale, overlap between the 33 movies in the training set and the 36
dynamic, and task controlled. These studies [6], [13], [15], movies in the test set. The data set consists of about 487k
[19], [22], [27], [32], [33] could however benefit from larger frames, totaling about 20 hours of video.
scale natural datasets, and from studies that emphasize the
task, as we pursue. The UCF Sports Action Dataset: This high resolution
The problem of visual attention and the prediction of visual dataset [2] was collected mostly from broadcast television
saliency have long been of interest in the human vision channels. It contains 150 videos covering 9 sports action
community [13], [14], [15]. Recently there was a growing classes: diving, golf swinging, kicking, lifting, horseback
trend of training visual saliency models based on human riding, running, skateboarding, swinging and walking.
fixations, mostly in static images (with the notable exception Human subjects: We have collected data from 16 human
of [27]), and under subject free-viewing conditions [5], [18]. volunteers (9 male and 7 female) aged between 21 and 41. We
While visual saliency models can be evaluated in isolation split them into an active group, which had to solve an action
under a variety of measures against human fixations, for recognition task, and a free viewing group, which was not
computer vision, their ultimate test remains the demonstration required to solve any specific task while being presented with
of relevance within an end-to-end automatic visual recognition the videos in the two datasets. There were 12 active subjects
pipeline. While such integrated systems are still in their (7 male and 5 female) and 4 free viewing subjects (2 male
infancy, promising demonstrations have recently emerged for and 2 female). None of the free viewers was aware of the
computer vision tasks like scene classification [20], verifying task of the active group and none was a cognitive scientist.
correlations with object (pedestrian) detection responses [21], We chose the two groups such that no pair of subjects from
[22]. An interesting early biologically inspired recognition different groups were acquainted with each other, in order to
system was presented by Kienzle et al. [27], who learn a limit biases on the free viewers.
fixation operator from human eye movements collected under
Recording environment: Eye movements were recorded
video free-viewing, then learn action classification models for
using an SMI iView X HiSpeed 1250 tower-mounted eye
the KTH dataset with promising results. Recently, under the
tracker, with a sampling frequency of 500Hz. The head of
constraint of a first person perspective, Fathi et. al. [30] have
the subject was placed on a chin-rest located at 60 cm from
shown that human fixations can be predicted and used to
the display. Viewing conditions were binocular and gaze data
enhance action recognition performance.
was collected from the right eye of the participant.1 The LCD
In contrast, in computer vision, interest point detectors have
display had a resolution 1280 × 1024 pixels, with a physical
been successfully used in the bag-of-visual-words framework
for action classification and event detection [1], [7], [8], [34], 1. None of our subjects exhibited a strongly dominant left eye, as deter-
but a variety of other methods exists, including random field mined by the two index finger method.
4

p n p
dU u U
an + R nd
+ rso + S s
screen size of 47.5 x 29.5 cm. The calibration procedure

S t n ta
un e e is
R htP ak + K
t ar ne

a t n

s r e
Ea veC Pho
was carried out at the beginning of each block. The subject

R s son

Fi dS son
H tOu rso
H ndS Car
Ki gPe ak

H gPe p
tU n
u h

g h
e e

an r
ri r

H ndU
Si ow
D we

G htP

St p
had to follow a target that was placed sequentially at 13

tD
un
s

a
g

u
An

Si
Fi
AnswerPhone 0.9
locations evenly distributed across the screen. Accuracy of DriveCar
0.8
Eat
the calibration was then validated at 4 of these calibrated FightPerson 0.7
locations. If the error in the estimated position was greater GetOutCar
HandShake 0.6
than 0.75◦ of visual angle, the experiment was stopped and

human label
HugPerson
Kiss 0.5
calibration restarted. At the end of each block, validation was Run
SitDown 0.4

carried out again, to account for fluctuations in the recording SitUp


0.3
StandUp
environment. If the validation error exceeded 0.75◦ of visual HugPerson + Kiss 0.2
HandShake + StandUp
angle, the data acquired during the block was deemed noisy FightPerson + Run 0.1
Run + StandUp
and discarded from further analysis. Because the resolution ground truth 0

varies across the datasets, each video was rescaled to fit the
screen, preserving the original aspect ratio. The visual angles Fig. 2: Action recognition perfomance by humans on the
subtended by the stimuli were 38.4◦ in the horizontal plane Hollywood-2 database. The confusion matrix includes the 12
and ranged from 13.81◦ to 26.18◦ in the vertical plane. action labels plus the 4 most frequent combinations of labels
Recording protocol: Before each video sequence was shown, in the groundc truth.
participants in the active group were required to fixate the
center of the screen. Display would proceed automatically
using the trigger area-of-interest feature provided by the iView usually involves semantically related actions, e.g. DriveCar
X software. Participants had to identify the actions in each and GetOutOfCar or Kiss and Hug.
video sequence. Their multiple choice answers were recorded
through a set of check-boxes displayed at the end of each 4.2 Spatial Consistency Among Subjects
video, which the subject manipulated using a mouse.2 Partici- In this section, we investigate how well the regions fixated
pants in the free viewing group underwent a similar protocol, by human subjects agree on a frame by frame basis, by
the only difference being that the questionnaire answering step generalizing to video data the procedure used by Ehinger et.
was skipped. To avoid fatigue, we split the set of stimuli al. [22] in the case of static stimuli.
into 20 sessions (for the active group) and 16 sessions (for
the free viewing group), each participant undergoing no more Evaluation Protocol: For the task of locating people in a
than one session per day. Each session consisted of 4 blocks, static image, [22] have evaluated how well one can predict
each designed to take approximately 8 minutes to complete the regions fixated by a particular subject from the regions
(calibration excluded), with 5-minute breaks between blocks. fixated by the other subjects on the same image. For cross-
Overall, it took approximately 1 hour for a participant to stimulus control, this measure is however not meaningful in
complete one session. The video sequences were shown to itself, as part of the inter-subject agreement is due to bias in the
each subject in a different random order. stimulus itself (e.g. photographer’s bias) or due to the tendency
of humans to fixate more often at the center of the screen [14].
Therefore, one can address this issue by checking how well
4 S PATIAL AND S EQUENTIAL C ONSISTENCY the fixation of a subject on one simulus can be predicted from
4.1 Action Recognition by Humans those of the other subjects on a different, unrelated, stimulus.
Our goal is to create a data set that captures the gaze Normally, the average precision when predicting fixations on
patterns of humans solving a recognition task. Therefore, it the same stimulus is much greater than on different stimuli.
is important to ensure that our subjects are successful at this We generalize this protocol for video, by randomly choosing
task. Fig.4.1 shows the confusion matrix between the answers frames from our videos and checking inter-subject correlation
of human subjects and the ground truth. For Hollywood-2, on them. We test each subject in turn with respect to the other
there can be multiple labels associated with the same video. (training) subjects. An empirical saliency map is generated
We show, apart from the 12 action labels, the 4 most common by adding a Dirac impulse at each pixel fixated by the
combinations of labels occurring in the ground truth and omit training subjects in that frame, followed by the application of a
less common ones. The analysis reveals, apart from near- Gaussian blur filter. We then consider this map as a confidence
perfect performance, the types of errors humans are prone map for predicting the test subject’s fixation. There is a label of
to make. The most frequent human errors are omissions of 1 at the test subjects’ fixation and 0 elsewhere. The area under
one of the actions co-occurring in a video. False positives the curve (AUC) score for this classification problem is then
are much less frequent. The third type of error of mislabeling computed for the test subject. We average score over all test
a video entirely, almost never happens, and when it does it subjects defines the final consistency metric. This value ranges
from 0.5 – when no consistency or bias effects are present in
2. While the representation of actions is an open problem, we relied on the the data – and 1 – when all subjects fixate the same pixel and
datasets and labeling of the computer vision community, as a first study, and there is no measurement noise. For cross-stimulus control, we
to maximize impact on current research. In the long run, weakly supervised
learning could be better suited to map persistent structure to higher level repeat this process for pairs of frames chosen from different
semantic action labels. videos and attempt to predict the fixation of each test subject
5

1 1 Dive, Lift and Swing.


0.8 0.8

4.3 The Influence of Task on Eye Movements


Detection rate

Detection rate
0.6 0.6

We next evaluate the impact of the task on eye movements for


0.4 0.4
this data set. For a given video frame, we compute the fixation
0.2
Inter−Subject Agreement
0.2
Inter−Subject Agreement
probability distribution using data from all active subjects.
0
Cross−Stimulus Control
0
Cross−Stimulus Control Then, for each free viewer, we compute the p-statistic of the
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
False alarm rate False alarm rate fixated location with respect to this distribution. We repeat this
Hollywood-2 UCF Sports Actions process for 1000 randomly sampled frames and compute the
average p-value for each subject. Somewhat surprisingly, we
Fig. 3: Spatial inter-subject agreement. The ROC curves cor- find that none of our free viewers exhibits a fixation pattern
respond to predicting the fixations of one subject from the that deviates significantly from that of our active group (see
fixations of the other subjects on the same video frame (blue) Table 2). Since in the Hollywood-2 dataset several actions can
or on a different video (green) randomly selected from the be present in a video, either simultaneously or sequentially,
dataset. this rules out initial habituation effects and further neglect (free
viewing) to some degree.3
Similar lack of discriminability between tasks has been
on the first frame from the fixations of the other subjects on remarked by Greene et al. [39] for static stimuli. Their
the other frame. study shows that classifiers trained on human scanpaths can
Unlike in the procedure followed in [22], who considered successufully discriminate observers or stimuli, but cannot
several fixations per subject for each exposed image, we classify the tasks of the observers. Although our results may
only consider the single fixation, if any, that a subject seem to support the generalization of such observations to
made on that frame. The reason is that our stimulus is the video domain, we mention here two reasons why we
dynamic and the spatial positions of future fixations are believe no definite conclusion can be drawn at this point.
bound to be altered by changes in the stimulus itself. In First of all, we compare the action recognition task to a
our experiment, we set the width of the Gaussian blur no-task condition, rather than discriminate between several
kernel to match a visual angle span of 1.5o . We draw 1000 specific task conditions. Second, in [39] the free viewers have
samples for both the same-stimulus and different stimulus a relatively large amount of time (10s) to solve the task.
predictions. We disregard the first 200ms from the beginning Although the authors analyze whether the first 2 seconds of
of each video to remove bias due to the initial central fixation. viewing are discriminative for the task – and find it not to
Findings: The ROC curves for inter-subject agreement and be the case – it is still unclear to what extent the somewhat
cross-stimulus control are shown in fig.3. For the Hollywood- lax time constraints will direct the observer to exhibit task-
2 dataset, the AUC score is 94.8% for inter-subject agreement specific eye movements or to focus such behavior early on
and 72.3% for cross-stimulus control. For UCF Sports, we during exposure. On the other hand, the video domain has an
obtain values of 93.2% and 69.2%. These values are consistent intrinsic and variable exposure time induced by the changing
with the results reported for static stimuli by [22], with slightly scene, which may, in principle, evoke task-specific behavior
higher cross-stimulus control. This suggests that shooter’s bias during episodes of intense dynamic activity.
is stronger in artistic datasets (movies) than in natural scenes,
a trend that has been observed by [38] for human observers 4.4 Sequential Consistency Among Subjects
free viewing Hollywood movie trailers as opposed to video Our spatial inter-subject agreement analysis shows that the
shoots of outdoor scenes. spatial distribution of fixations in video is highly consistent
We also analyze inter-subject agreement on subsets of across subjects. It does not however reveal whether there is
videos corresponding to each action class and for 4 significant significant consistency in the order in which subjects fixate
labelings considered in fig.4.1. On each of these subsets, inter- among these locations. To our knowledge, there are no agreed
subject consistency remains strong, as illustrated in Table 1a,b. upon sequential consistency measures in the community at the
Interestingly, there is significant variation in the cross-stimulus moment [40]. In this section, we propose two metrics that are
control across these classes, especially in the UCF Sports sensitive to the temporal ordering among fixations and evaluate
dataset. The setup being identical, we conjecture that part of consistency under them. We first model the scanpath made
this variation is due to the different ways in which various by each subject as a sequence of discrete symbols and show
categories of scenes are filmed and the way the director aims to how this representation can be produced automatically. We
present the actions to the viewer. For example, in TV footage
for sports events, the actions are typically shot from a frontal 3. Our findings do not assume or imply that free-viewing subjects may not
be recognizing actions. However we did not ask them to perform a task, nor
pose, the peformer is centered almost perfectly and there is where they aware of the purpose of the experiment, or the interface presented
limited background clutter. These factors lead to a high degree to subjects given a task. While this is one approach to analyze task influence,
of similarity among the stimuli within the class and makes it it is not the only possible. For instance, subjects may be asked to focus on
different tasks (e.g. actions versus general scene recognition), although this
easier to extrapolate subject fixations across videos, explaining setting may induce biases due to habituation with stimuli presented at least
the unusually high values of this metric for the actions like twice.
6

TABLE 1: Spatial and Sequential Consistency Analysis

spatial consistency temporal AOI alignment AOI Markov Dynamics


Label inter-subject cross-stimulus inter-subject random inter-subject random
agreement control agreement control agreement control
(a) (b) (c) (d) (e) (f)
AnswerPhone 94.9% 69.5% 71.1% 51.2% 65.7% 15.3%
DriveCar 94.3% 69.9% 70.4% 53.6% 78.5% 8.7%
Eat 93.0% 69.7% 73.2% 50.2% 79.9% 7.9%
FightPerson 96.1% 74.9% 66.2% 48.3% 78.5% 9.8%
GetOutCar 94.5% 69.3% 72.5% 52.0% 68.1% 13.8%
HandShake 93.7% 72.6% 69.3% 50.6% 68.4% 14.3%
HugPerson 95.7% 75.4% 71.7% 53.2% 71.7% 12.9%
Kiss 95.6% 72.1% 68.8% 49.9% 71.5% 12.7%
Run 95.9% 72.3% 76.9% 54.9% 69.6% 13.8%
SitDown 93.5% 68.4% 68.3% 49.7% 69.7% 12.7%
SitUp 95.4% 72.0% 71.9% 54.1% 64.0% 16.1%
StandUp 94.3% 69.2% 67.3% 53.9% 65.1% 15.1%
HugPerson + Kiss 92.8% 70.6% 68.2% 52.1% 71.2% 13.1%
HandShake + StandUp 91.3% 60.9% 70.5% 51.0% 65.7% 15.0%
FightPerson + Run 96.1% 66.6% 73.4% 52.3% 72.1% 12.5%
Run + StandUp 92.8% 66.4% 68.3% 52.4% 67.4% 14.4%
Any 94.8% 72.3% 70.8% 51.8% 70.2% 12.7%
Hollywood-2

Spatial and sequential consistency analysis results measured both globally and for each action class. Columns (a-b) represent the areas under the ROC curves
for spatial inter-subject consistencies and the corresponding cross-stimulus control. The classes marked by ∗ show significant shooter’s bias due to the very
similar filming conditions of all the videos within them (same shooting angle and position, no clutter). Good sequential consistency is revealed by the match
scores obtained by temporal alignment (c-d) and Markov dynamics (e-f).

TABLE 2: Spatial Consistency Between Active and Free- can be identified, the sequence of fixations belonging to
Viewing Subjects a subject can be represented discretely by assigning each
free-viewing p-value p-value fixation to the closest AOI. For example, from the video
subject Hollywood-2 UCF Sports
1 0.67 0.55 depicted in fig.4-left, we identify six AOIs: the bumper of
2 0.67 0.55 the car, its windshield, the passenger and the handbag he
3 0.60 0.60
4 0.65 0.64 carries, the driver and the side mirror. We then trace the
Mean 0.65 0.58 scan path of each subject through the AOIs based on spatial
(a) proximity, as shown in fig.4-right. Each fixation gets assigned
Action Class p-value a label. For subject 2 shown in the example, this results in
AnswerPhone 0.66
DriveCar 0.65
Action Class p-value the sequence [bumper, windshield, driver, mirror, driver,
Dive 0.68
Eat 0.65
GolfSwing 0.67
handbag]. Notice that AOIs are semantically meaningful and
FightPerson 0.65
GetOutCar 0.65
Kick 0.58 tend to correspond to physical objects. Interestingly, this
Lift 0.57
HandShake 0.68
RideHorse 0.51
supports recent computer vision strategies based on object
HugPerson 0.63
Kiss 0.63
Run 0.54 detectors for action recognition [8], [41], [42], [43].
Skateboard 0.60
Run 0.65
SitDown 0.61
Swing 0.59 Automatically Finding AOIs: Defining areas of interest
Walk 0.51
SitUp 0.65
Mean 0.58 manually is labour intensive, especially in the video domain.
StandUp 0.64
Mean 0.65 UCF Sports Actions Therefore, we introduce an automatic method for determining
Hollywood-2 their locations based on clustering the fixations of all subjects
(b) in a frame. We start by running k-means with 1 cluster and we
We measure consistency as the p-value associated with predicting each free- successively increase their number until the sum of squared
viewer from the saliency map derived from active subjects, averaged over 1000 errors drops below a threshold. We then link centroids from
randomly chosen video frames. We find that none of the scanpaths belonging successive frames into tracks, as long as they are closely
to our free-viewing subjects (a) deviates significanly from ground truth data.
Likewise, no significant differences are found when restricting the analysis to located spatially. For robustness, we allow for a temporal
videos belonging to specific action classes (b). gap during the track building process. Each resulting track
becomes an AOI, and each fixation is assigned to the closest
AOI at the time of its creation.
then define two metrics, AOI Markov dynamics and temporal AOI Markov Dynamics: In order to capture the dynamics
AOI alignment, and show how they can be computed for this of eye movements, and due to data sparsity, we represent the
representation. After we define a baseline for our evaluation transitions of human visual attention between AOIs by means
we conclude with a discussion of the results. of a first-order Markov process. Given a set of human fixation
Scanpath Representation: Human fixations tend to be tightly strings fi , where the j th fixation of subject i is encoded by
clustered spatially at one or more locations in the image. the index fij ∈ 1, A of the corresponding AOI, we estimate
Assuming that such regions, called areas of interest (AOIs), the probability p(st = b | st−1 = a) of transitioning to
7

subject 2 Handbag
subject 4
subject 10
mirror

driver

frame 10 frame 30 frame 90


windshield

passenger

bumper

frame 145 frame 215 frame 230 0 50 100 150 200 250
time (frames)

Fig. 4: Areas of interest are obtained automatically by clustering the fixations of subjects. Left: Heat maps illustrating the
assignments of fixations to AOIs. The colored blobs have been generated by pooling together all fixations belonging to the same
AOI. Right: Scan path through automatically generated AOIs (colored boxes) for three subjects. Arrows illustrate saccades.
Semantic labels have been manually assigned and illustrates the existance of cognitive routines centered at semantically
meaningful objects.

AOI b at time t given that AOI a was fixated at time t − 1 must obey the time ordering relations among AOIs (e.g. the
by counting transition frequencies. We regularize the model passenger is not visible until the second half of the video in
using Laplace smoothing to account for data sparsity. The fig.4). Second, our automatic AOIs are derived from subject
probability of a novel fixation sequence g under this model fixations and are biased by their gaze preference. The lifespan
is j p(st = g j | st−1 = g j−1 ) assuming the first state in
Q
of an AOI will not be initiated until at least one subject has
the model, the central fixation, has probability 1. We measure fixated it, even if the corresponding object is already visible.
the consistency among a set of subjects by considering each To remove some of the resulting bias from our evaluation,
subject in turn, computing the probability of his scanpath we extend each AOI both forward and backwards in time,
with respect to the model trained from the fixations of the until the image patch at its center has undergone significant
other subjects and normalizing by the number of fixations appearance changes, and use these extended AOIs when
in his scanpath. The final consistency score is the average generating our random baselines.
probability over all subjects. Findings: For the Hollywod-2 dataset, we find that the average
Temporal AOI Alignment: Another approach to evaluate transition probability of each subject’s fixations under AOI
sequential consistency is by measuring how pairs of AOI Markov dynamics is 70%, compared to 13% for the random
strings corresponding to different subjects can be globally baseline (table 1). We also find that, across all videos, 71% of
aligned, based on their content. Although not modeling the AOI symbols are successfully aligned, compared to only
transitions explicitly, a sequence alignment has the advantage 51% for the random baseline. We notice similar gaps in the
of being able to handle gaps and missing elements. An UCF Sports dataset. These results indicate a high degree of
efficient algorithm having these properties due to Needleman- consistency in human eye movement dynamics across the two
Wunsch [44] uses dynamic programming to find the optimal datasets. Alignment scores vary to some extent across classes.
match between two sequences f 1:n and g 1:m , by allowing
for the insertion of gaps in either sequence. It recursively 5 S EMANTICS OF F IXATED S TIMULUS PAT-
computes the alignment score hi,j between subsequences
TERNS
f 1:i and g 1:j by considering the alternative costs of a
match between f i and g j versus the insertion of a gap into Having shown that visual attention is consistently drawn
either sequence. The final consistency metric is the average to particular locations in the stimulus, a natural question is
alignment score over all pairs of distinct subjects, normalized whether the patterns at these locations are repeatable and
by the length of the longest sequence in each pair. We have semantic structure. In this section, we investigate this
set the similarity metric to 1 for matching AOI symbols by building vocabularies over image patches collected from
and to 0 otherwise, and assume no penalty is incurred for the locations fixated by our subjects when viewing videos of
inserting gaps in either sequence. This setting gives the score the various actions in the Hollywood-2 data set.
a semantic meaning: it is the average percentage of symbols Protocol: Each patch spans 1o away from the center of
that can be matched when determinig the longest common fixation to simulate the extent of high foveal accuity. We then
subsequence of fixations among pairs of subjects. represent each patch using HoG descriptors with a spatial grid
Baselines: In order to provide a reference for our consistency resolution of 3 × 3 cells and 9 orientation bins. We cluster
evaluation, we generate 10 random AOI strings per video and the resulting descriptors using k-means into 500 clusters.4
compute the consistency on these strings under our metrics.
We note however that the dynamics of the stimulus places 4. We have found that using 500 clusters provides a good tradeoff between
constraints on the sampling process. First, a random string the semantic under and over-segmentation of the image patch space.
8

one of its features, onto their fovea. Overall, the vocabularies


seem to capture semantically relevant aspects of the action
classes. This suggests that human fixations provide a degree of
object and person repeatability that could be used to boost the
performance of computer-based action-recognition methods, a
problem we address in the following sections.

6 E VALUATION P IPELINE
Answerphone DriveCar
For all computer action recognition models, we use the
same processing pipeline, consisting of an interest point
operator (computer vision or biologically derived), descriptor
extraction, bag of visual words quantization and an action
classifier.
Interest Point Operator: We experiment with various
interest point operators, both computer vision based (see
§7.1) and biologically derived (see §7.1, §9). Each interest
point operator takes as input a video and generates a set of
Eat GetOutCar spatio-temporal coordinates, with associated spatial and, with
the exception of the 2D fixation operator presented in §7.1,
temporal scales.
Descriptors: We obtain features by extracting descriptors
at the spatiotemporal locations returned by of our interest
point operator. For this purpose, we use the spacetime
generalization of the HoG descriptor as described in [45] as
well as the MBH descriptor [46] computed from optical flow.
We consider 7 grid configurations and extract both types
of descriptors for each configuration, and end up with 14
Run Kiss features for classification. We use the same grid configurations
in all our experiments. We have selected them from a wider
Fig. 5: Sampled entries from visual vocabularies obtained
candidate set based on their individual performance for
by clustering fixated image regions in the space of HoG
action recognition. Each HoG and MBH block is normalized
descriptors, for several action classes from the Hollywood-2
using a L2-Hys normalization scheme with the recommended
dataset.
threshold of 0.7 [47].
Visual Dictionaries/Second Order Pooling: We cluster the
Findings: In Fig. 5 we illustrate image patches that have resulting descriptors using k-means into visual vocabularies
been assigned high probability by the mixture of Gaussians of 4000 visual words. For computational reasons, we only
model underlying k-means. Each row contains the top 5 most use 500, 000 randomly sampled descriptors as input to the
probable patches from a cluster (in decreasing order of their clustering step, and represent each video by the L1 normalized
probability). The patches in a row are restricted to come from histogram of its visual words. Alternatively, in section §9 we
different videos, i.e. we remove any patches for which there is also experiment with a different encoding scheme based on
a higher probability patch coming from the same video. Note second order pooling [59], which provides a tradeoff between
that clusters are not entirely semantically homogeneous, in part computational cost and classification accuracy.
due to the limited descriptive power of HoG features (e.g. a Classifiers: From histograms we compute kernel matrices
driving wheel is clustered together with a plate, see row 4 of using the RBF-χ2 kernel and combine the 14 resulting kernel
the vocabulary for class Eat). Nevertheless, fixated patches do matrices by means of a Multiple Kernel Learning (MKL)
include semantically meaningful image regions of the scene, framework [48]. We train one classifier for each action label
including actors in various poses (people eating or running or in a one-vs-all fashion. We determine the kernel scaling
getting out of vehicles), objects being manipulated or related parameter, the SVM loss parameter C and the σ regularization
to the action (dishes, telephones, car doors) and, to a lesser parameter of the MKL framework by cross-validation. We
extent, context of the surroundings (vegetation, street signs). perform grid search over the range [10−4 , 104 ] × [10−4 , 104 ] ×
Fixations fall almost always on objects or parts of objects [10−2 , 10−4 ], with a multiplicative step of 100.5 . We draw 20
and almost never on unstructured parts of the image. Our folds and we run 3 cross-validation trials at each grid point
analysis also suggests that subjects generally avoid focusing and select the parameter setting giving the best cross validation
on object boundaries unless an interaction is being shown average precision to train our final action classifier. We report
(e.g. kissing), otherwise preferring to center the object, or the average precision on the test set for each action class,
9

percent of human recognition average precision


which has become the standard evaluation metric for action action fixated spacetime spacetime spacetime spatial
recognition on the Hollywood-2 data set [1], [34]. Harris corners Harris fixations fixations
(a) (b) (c) (d)
AnswerPhone 6.2% 16.4% 16.0% 14.6%
7 H UMAN F IXATION S TUDIES DriveCar 5.8% 85.4% 79.4% 85.9%
Eat 6.4% 59.1% 54.1% 55.8%
In this section we explore the action recognition potential of FightPerson 4.6% 71.1% 66.5% 73.9%
computer vision operators derived from ground truth human GetOutCar 6.1% 36.1% 31.7% 35.4%
HandShake 6.3% 18.2% 14.9% 17.5%
fixations, using the evaluation pipeline introduced in §6. HugPerson 4.6% 33.8% 35.1% 28.2%
Kiss 4.8% 58.3% 61.3% 64.3%
Run 6.0% 73.2% 78.5% 78.0%
7.1 Human vs. Computer Vision Operators SitDown 6.2% 54.0% 41.9% 51.8%
The visual information found in the fixated regions could SitUp 6.3% 26.1% 16.3% 22.7%
StandUp 6.0% 57.0% 50.4% 46.3%
potentially aid automated recognition of human actions. One Mean 5.8% 49.1% 45.5% 47.9%
way to capture this information is to extract descriptors
from these regions, which is equivalent to using them as TABLE 3: Harris Spacetime Corners vs. Fixations. Only a
interest points. Following this line of thought, we evaluate small percentage of spatio-temporal Harris corners are fixated
the degree to which human fixations are correlated with by subjects across the videos in each action class (a). Classifi-
the widely used Harris spatiotemporal cornerness operator cation average precision with various interest point operators:
[45]. Then, under the assumption that fixations are available (b) spacetime Harris corner detectors, (c) human fixations,
at testing time, we define two interest point operators one interest point per fixated frame, and (d) human fixations,
that fire at the centers of fixation, one spatially on a one interest point per fixation. Using fixations as interest
frame by frame basis and one at a spatio-temporal scale. point operators does not lead to improvements in recognition
We compare the performance of these two operators to that performance, compared to the spatio-temporal Harris operator.
of the Harris operator for computer based action classification.
Experimental Protocol: We start our experiment by running
the spatio-temporal Harris corner detector over each video in relevant information for action recognition might lie in a
the dataset. Assuming an angular radius of 1.5o for the human subregion of this area. Research of human attention suggests
fovea (mapped to a foveated image disc, by considering the that humans are capable of attending to only a subregion
geometry of the eye-capturing setting, e.g. distance from of the fovea at a time, to which the mental processing
screen, its size, average human eye structure), we estimate resources are directed, the so called covert attention [14].
the probability that a corner will be fixated by the fraction of Along these lines, we design an experiment in which we
interest points that fall onto the fovea of at least one human generate finer-scaled interest points in the area fixated by the
observer. We then define two operators based on ground truth subjects. Given enough samples, we expect to also represent
human fixations. The first operator generates, for each human the area to which the covert attention of a human subject
fixation, one 2D interest point at the foveated position during was directed at any particular moment through the fixation.
the lifetime of the fixation. The second operator, generates To drive this sampling process, we derive a saliency map
one 3D interest point for each fixation, with the temporal from human fixations. The map estimates the probability for
scale proportional to the length of the fixation. We run the each spatio-temporal location in the video to be foveated by
Harris operator and the two fixation operators though our a human subject. We then define an interest point operator
classification pipeline (see §6). We determine the optimal that randomly samples spatio-temporal locations from this
spatial scale for the human fixation operators – which can be probability distribution and compare its performance for
interpreted as an assumed fovea radius – by optimizing the action recognition with two baselines: the spatio-temporal
cross-validation average precision over the range [0.75, 4]o Harris operator [45] and an interest point operator that
with a step of 1.5o . fires randomly with equal probability over the entire spatio-
temporal volume of the video. If subsets of the foveated
Findings: We find low correlation between the locations at regions are indeed informative, we expect our saliency-based
which classical interest point detectors fire and the human interest point operator to have superior performance to both
fixations. The probability that a spatio-temporal Harris corner baselines.
will be fixated by a subject is approximately 6%, with little
variability across actions (Table 3a). This result is in agreement Experimental Protocol: Let If denote a frame belonging to
with our findings that human subjects do not generally fixate some video v. We label each pixel x ∈ If with the number
on object boundaries (§5). In addition, our results show that of human fixations centered at that pixel. We then blur
none of our fixation-derived interest point operators improve the resulting label image by convolution with an isotropic
recognition performance compared to the Harris-based interest Gaussian filter having a standard deviation σ equal to the
point operator. assumed radius of the human fovea. We obtain a map mf
over the frame f . The probabilitity for each spatio temporal
m (x)
7.2 Impact of Human Saliency Maps for Computer pixel to be fixated is pfix (x, f ) = T1 · P f mf (x) where
x∈If
Visual Action Recognition T is the total number of frames in the video. To obtain
Although our findings suggest that the entire foveated area the saliency distribution, we regularize this probability
is not informative, this does not rule out the hypothesis that to account for observation sparsity, by adding a uniform
10

distribution over the video frame, weighted by a parameter α, The ROC curve is computed for each image and the average
psal = (1 − α)pfix + αpunif . We now define an interest point area the area under the curve over the whole set of testing
operator that randomly samples spatio-temporal locations images gives the final score. This measure emphasizes the
from the ground truth probability distribution psal , both at capacity of a saliency map to rank pixels higher when they are
training and testing time, and associates them with random fixated then when they are not. This does not imply, however,
spatio-temporal scales uniformly distributed in the range that the normalized probability distribution associated with
[2, 8]. We train classifiers for each action by feeding the the saliency map is close to the ground truth saliency map
output of this operator through our pipeline (see §6). By doing for the image. A better suited way to compare probability
so, we build vocabularies from descriptors sampled from distributions is the spatial Kullback-Leibler (KL) divergence,
saliency maps derived from ground truth human fixations. which we propose as our second evaluation measure, defined
We determine the optimal values for the α regularization as:
parameter and the fovea radius σ by cross-validation. We
X p(x)
DKL (p, s) = p(x) log (1)
also run two baselines: the spatio-temporal Harris operator s(x)
x∈I
(see §7.1) and the operator that samples locations from the
uniform probability distribution, which can be obtained by where p(x) is the value of the normalized saliency prediction
setting α = 1 in psal . In order to make the comparison at pixel x, and s is the ground truth saliency map. A small
meaningful, we set the number of interest points sampled by value of this metric implies that random samples drawn from
our saliency-based and the uniform random operators in each the predicted saliency map p are likely to approximate well
frame to match the firing rate of the Harris corner detector. the ground truth saliency s.
Findings: We find that ground truth saliency sampling (Table Saliency Predictors: Having established evaluation criteria,
4e) outperforms both the Harris and the uniform sampling we now run several saliency map predictors on our dataset,
operators significantly, at equal interest point sparsity rates. which we describe below.
Our results indicate that saliency maps encoding only the Baselines: We also provide three baselines for saliency map
weak surface structure of fixations (no time ordering is used), comparison. The first one is the uniform saliency map,
can be used to boost the accuracy of contemporary methods that assigns the same fixation probability to each pixel of
and descriptors used for computer action recognition. Up to the video frame. Second, we consider the center bias (CB)
this point, we have relied on the availability of ground truth feature, which assigns each pixel with the distance to the
saliency maps at test time. A natual question is whether it center of the frame, regardless of its visual contents. This
is possible to reliably predict saliency maps, to a degree that feature can capture both the tendency of human subjects to
still preserves the benefits of action classification accuracy. fixate near the center of the screen and the preference of the
This will be the focus of the next section. photographer to center objects into the field of view. At the
other end of the spectrum lies the human saliency baseline,
8 S ALIENCY M AP P REDICTION which derives a saliency map from half of our human subjects
Motivated by the findings presented in the previous section, and is evaluated with respect to fixations of the remaining
we now show that we can effectively predict saliency maps. ones.
We start by introducing two evaluation measures for saliency Static Features (SF): We also include features used by the
prediction. The first is the area-under-the-curve (AUC), human vision community for saliency prediction in the image
which is widely used in the human vision community. The domain [5], which can be classified into three categories:
second measure is inspired by our application of saliency low, mid and high-level. The four low level features used are
maps to action recognition. In the pipeline we proposed color information, steerable pyramid subbands, the feature
in §7.2, ground truth saliency maps drive the random maps used as input by Itti&Koch’s model [15] and the output
sampling process of our interest point operator. We will of the saliency model described by Oliva and Torralba [50]
use the spatial Kullback-Leibler (KL) divergence measure and Rosenholtz [51]. We run a Horizon detector [50] as our
to compare the predicted and the ground truth saliencies. mid-level feature. Object detectors are used as high level
We also propose and study several features for saliency features, which comprise faces [52], persons and cars [53].
map prediction, both static and motion based. Our analysis
Motion Features (MF): We augment our set of predictors
includes features derived directly from low, mid and high
with five novel (in the context of saliency prediction) feature
level image information. In addition, we train a HoG-MBH
maps, derived from motion or space-time information.
detector that fires preferentially at fixated locations, using the
vast amount of eye movement data available in the dataset. Flow: We extract optical flow from each frame using a
We evaluate all these features and their combinations on our state of the art method [54] and compute the magnitude of
dataset, and find that our detector gives the best performance the flow at each location. Using this feature, we wish to
under the KL divergence measure. investigate whether regions with significant optical changes
Saliency Map Comparison Measures: The most commonly attract human gaze.
used measure for evaluating saliency maps in the image Pb with flow: We run the PB edge detector [55] with both
domain [5], [22], [49], the AUC measure, interprets saliency image intensity and the flow field as inputs. This detector
maps as predictors for separating fixated pixels from the rest. fires both at intensity and motion boundaries.
11

TABLE 4: Action Recognition Performance on the Hollywood-2 Data Set

interest points trajectories + interest points


central predicted ground truth predicted ground truth
action Harris uniform bias saliency saliency trajectories uniform saliency saliency
corners sampling sampling sampling sampling only sampling sampling sampling
(a) (b) (c) (d) (e) (f) (g) (h) (i)
AnswerPhone 16.4% 21.3% 23.3% 23.7% 28.1% 32.6% 24.5% 25.0% 32.5%
DriveCar 85.4% 92.2% 92.4% 92.8% 57.9% 88.0% 93.6% 93.6% 96.2%
Eat 59.1% 59.8% 58.6% 70.0% 67.3% 65.2% 69.8% 75.0% 73.6%
FightPerson 71.1% 74.3% 76.3% 76.1% 80.6% 81.4% 79.2% 78.7% 83.0%
GetOutCar 36.1% 47.4% 49.6% 54.9% 55.1% 52.7% 55.2% 60.7% 59.3%
HandShake 18.2% 25.7% 26.5% 27.9% 27.6% 29.6% 29.3% 28.3% 26.6%
HugPerson 33.8% 33.3% 34.6% 39.5% 37.8% 54.2% 44.7% 45.3% 46.1%
Kiss 58.3% 61.2% 62.1% 61.3% 66.4% 65.8% 66.2% 66.4% 69.5%
Run 73.2% 76.0% 77.8% 82.2% 85.7% 82.1% 82.1% 84.2% 87.2%
SitDown 54.0% 59.3% 62.1% 69.0% 62.5% 62.5% 67.2% 70.4% 68.1%
SitUp 26.1% 20.7% 20.9% 29.7% 30.7% 20.0% 23.8% 34.1% 32.9%
StandUp 57.0% 59.8% 61.3% 63.9% 58.2% 65.2% 64.9% 69.5% 66.0%
Mean 49.1% 52.6% 53.7% 57.6% 57.9% 58.3% 58.4% 61.0% 61.7%
Columns a-e: Action recognition performance on the Hollywood-2 data set when interest points are sampled randomly across the spatio-temporal volumes of
the videos from various distributions (b-e), with the Harris corner detector as baseline (a). Average precision is shown for the uniform (b), central bias (c) and
ground truth (e) distributions, and for the output (d) of our HoG-MBH detector. All pipelines use the same number of interest points per frame as generated
by the Harris spatio-temporal corner detector (a). Columns f-i: Significant improvement over the state of the art [34] (f) can be achieved by augmenting the
method with the channels derived from interest points sampled from the predicted saliency map (g) and ground truth saliency (h), but not when using the
classical uniform sampling scheme (e).

TABLE 5: Action Recognition Performance on the UCF Sports Actions Data Set

in rd
ck ng

ck ng
1 1

e
ar

sw boa
rs

rs
ki wi

ki wi
bo
ho

ho
s

s
g

g
te

te
lf−

lf−
in

k
0.9 0.9
ve

ve
e

e
a

a
al

al
n

n
sw
go

rid

go

rid
t

sk

sk
ru

ru
di

di
lif

lif

w
method distribution accuracy dive 0.8 dive 0.8

Harris corners (a) 86.6% golf−swing 0.7 golf−swing 0.7

uniform sampling (b) 84.6% kick 0.6 kick 0.6

interest points central bias sampling (c) 84.8% lift lift

ground truth

ground truth
0.5 0.5

predicted saliency sampling (d) 91.3% ride horse ride horse


0.4 0.4
run run
ground truth saliency sampling (e) 90.9% 0.3 0.3
skateboard skateboard
trajectories [34] only (f) 88.2% 0.2 0.2
swing swing
trajectories +
predicted saliency sampling (h) 91.5% walk 0.1 walk 0.1
interest points prediction 0 prediction 0

trajectories predicted saliency map sampling

Left: Performance comparison among several classification methods (see table 4 for description) on the UCF Sports Dataset. Right: Confusion matrices
obtained using dense trajectories [34] and interest points sampled sparsely from saliency maps predicted by our HoG-MBH detector.

Flow bimodality: We wish to investigate how often people to exploit the structure present at these locations and train a
fixate on motion edges, where the flow field typically has a detector for human fixations. Our detector uses both static
bimodal distribution. To do this, for a neighbourhood centered (HoG) and motion (MBH) descriptors centered at fixations.
at a given pixel x, we run K-Means, first with 1 and then We run our detector in a sliding window fashion across the
with 2 modes, obtaining sum-of-squared-error values of s1 (x) entire video and obtain a saliency map.
and s2 (x) respectively. We weight the distance between the Feature combinations: We linearly combine various subsets of
centers of the two modes by a factor inversely proportional our feature maps for better saliency prediction. We investigate
to exp 1 − ss21 (x)
(x)
, to enforce a high response at positions the predictive power of static features and motion features
where the optical flow distribution is strongly bimodal and alone and in combination, with and without central bias.
its mode centers are far apart from each other.
Experimental Protocol: We use 106 examples to train our
Harris: This feature encodes the spatio-temporal Harris detector, half of which are positive and half of which are
cornerness measure as defined in [7]. negative. At each of these locations, we extract spatio-temporal
HoG-MBH detector: The saliency models we have considered HoG and MBH descriptors. We opt for 3 grid configurations,
so far access higher level image structure by means of namely 1x1x1, 2x2x1 and 3x3x1 cells. We have experimented
pre-trained object detectors. This approach does not prove with higher temporal grid resolutions, but found only modest
effective on our dataset, due to the high variability in pose and improvements in detector performance at a high increase
illumination. On the other hand, our dataset provides a rich in computational cost. We concatenate all 6 descriptors and
set of human fixations. We observe that fixated image regions lift the resulting vector into a higher dimensional space by
are often semantically meaningful, sometimes corresponding employing an order 3 χ2 kernel approximation using the
to objects or object parts. Inspired by this insight, we aim approach of [56]. We train an SVM using the LibLinear
12

package [57] to obtain our HoG-MBH detector. TABLE 6: Evaluation of Individual Feature Maps and Com-
For combining feature maps, we train a linear predictor on binations for Human Saliency Prediction.
500 randomly selected frames from the Hollywood-2 training baselines our motion features (MF)
feature AUC KL feature AUC KL
set, using our fixation annotations. We exclude the first 8 (a) (b) (a) (b)
frames of each video from the sampling process, in order uniform baseline 0.500 18.63 flow magniture 0.626 18.57
central bias (CB) 0.840 15.93 pb edges with flow 0.582 17.74
to avoid the effects of the initial central fixation in our data human 0.936 10.12 flow bimodality 0.637 17.63
Harris cornerness 0.619 17.21
collection setup. We also randomly select 250 and 500 frames static features (SF)
HOG-MBH detector 0.743 14.95
color features [5] 0.644 17.90
for validation and testing, respectively. To avoid correlations, subbands [58] 0.634 17.75 feature combinations
the video sets used to sample training, validation and testing Itti&Koch channels [15] 0.598 16.98 SF [5] 0.789 16.16
saliency map [50] 0.702 17.17 SF + CB [5] 0.861 15.96
frames are disjoint. horizon detector [50] 0.741 15.45 MF 0.762 15.62
face detector [52] 0.579 16.43 MF + CB 0.830 15.97
Findings: When evaluated at 106 random locations, half of car detector [53] 0.500 18.40 SF + MF 0.812 15.94
person detector [53] 0.566 17.13 SF + MF + CB 0.871 15.89
which were fixated by the subjects and half not, the average
We show area under the curve (AUC) and KL divergence. AUC and KL
precision of our detector is 76.7%, when both MBH and HoG induce different saliency map rankings, but for visual recognition measures
descriptors are used. HoG descriptors used in isolation perform that emphasize spatial localization are essential (see also table 4 for action
better (73.4% average precision) than MBH descriptors alone recognition results and fig.6 for illustration).
(70.5%), indicating that motion structure contributes less to
detector performance than does static image structure. There
is, however, significant advantage in combining both sources Experimental Protocol: We first run our HoG-MBH detector
of information. over the entire Hollywood-2 data set and obtain our automatic
When evaluated under the AUC metric, combining predic- saliency maps. We then configure our recognition pipeline with
tors always improves performance (Table 6). As a general an interest point operator that samples locations using these
trend, low-level features are better predictors than high level saliency maps as probability distributions. We also run the
ones. The low level motion features (flow, pb edges with flow, pipeline of [34] and combine the four kernel matrices produced
flow bimodality, Harris cornerness), provide similar perfor- in the final stage of their classifier with the ones we obtain
mance to static low-level features. Our HoG-MBH detector for our 14 descriptors, sampled from the saliency maps, using
is comparable to the best static feature, the Horizon detector, MKL.
under the AUC metric. We also test our recognition pipeline on the UCF Sports
Interestingly, when evaluated according to KL divergence, dataset, which is substantially different in terms of action
the ranking of the saliency maps changes: the HoG-MBH classes, scene clutter, shooting conditions and the evaluation
detector performs best and the only other predictor that procedure. Unlike Hollywood-2, this database provides no
significantly outperforms central bias is the horizon detector. training and test sets, and classifiers are generally evaluated
Under this metric, combining features does not always im- by cross-validation. We follow the standard procedure by first
prove performance, as the linear combination method of [5] extending the dataset with horizontally flipped versions of each
optimizes pixel-level classification accuracy, and as such is not video. For each cross-validation fold, we leave out one original
able to account for the inherent competition that takes place video and its flipped version and train a multi-class classifier.
among these predictions due to image-level normalization. We We test on the original video, but not its flipped version. We
conclude by noticing that fusing our predicted maps as well as compute the confusion matrix and report the average accuracy
our static and dynamic features gives the highest results under over all classes.
AUC metrics. Moreover, the HoG-MBH detector, trained using Our experimental procedure for the UCF Sports dataset
our eye movement data is the best predictor of visual saliency closely follows the one we use for Hollywood-2. We re-train
from our candidate set, under the probabilistic measure of our HoG-MBH detector on a subset of 50 video pairs (original
matching the spatial distribution of human fixations. and flipped), while we use the rest of 100 pairs for testing.
The average precision of our detector is 92.5% for the training
set and 93.1% on our test set, which confirms that our detector
9 AUTOMATIC V ISUAL ACTION R ECOGNITION does not overfit the data. We use the re-trained detector to run
We next investigate action recognition performance when the same pipelines and baselines as for Hollywood-2.
interest points are sampled from the saliency maps predicted Results: On both datasets, our saliency map based pipelines
by our HoG-MBH detector, which we choose because it best – both predicted and ground truth – perform markedly better
approximates the ground truth saliency map spatially, under than the pipeline sampling interest points uniformly (Tables
the KL divergence metric. Apart from sampling from the uni- 4c and 5). Although central bias is a relatively close approx-
form and ground truth distributions, as a second baseline, we imation of human visual saliency on our datasets, it does not
also sample interest points using the central bias saliency map, lead to performance that closely matches the one produced
which was also shown to approximate to a reasonable extent by these maps. Our automatic pipeline based on predicted
human fixations under the less intuitive AOI measure (Table saliency maps and our ground-truth based pipeline have similar
6). We also investigate whether our end-to-end recognition performance, with a slight advantage being observed for
system can be combined with the state-of-the-art approach of predicted saliency in the case of Hollywood-2 and for ground
[34] to obtain better performance. truth saliency for UCF Sports (Tables 4d,e and 5).
13

(a) original image (b) ground truth saliency (c) CB (d) flow magnitude (e) pb edges with flow (f) flow bimodality

(g) Harris cornerness (h) HoG-MBH detector (i) MF (j) SF (k) SF + MF (l) SF + MF + CB

Fig. 6: Saliency predictions for a video frame (a), both motion-based features in isolation (d-h) and combinations (i-l). HoG-
MBH detector maps are closest to the ground truth (b), consistent with Table 6b.

image ground truth/CB HoG-MBH detector/CB image ground truth/CB HoG-MBH detector/CB

image ground truth/CB HoG-MBH detector/CB image ground truth/CB HoG-MBH detector/CB

Fig. 7: Note that ground truth saliency maps (cyan) and output of our HoG-MBH detector (yellow) are similar to each other, but
qualitatively different from the central bias map (gray). This gives visual intuition for the significantly higher performance of
the HoG-MBH detector, over the central bias saliency sampling, when used in an end-to-end computer visual action recognition
system (Table 4c,d,e).

Our results confirm that approximations produced by the TABLE 7: Recognition results using second order pooling [59]
HoG-MBH detector are qualitatively different from a central on the UCF Sports Actions Data Set
bias distribution, focusing on local image structure that fre-
accuracy
quently co-occurs in its training set of fixations, disregarding method distribution
mean stdev
on whether it is close to the image center or not (see Fig.7). Harris corners (a) 84.3% 0.00%
uniform sampling (b) 83.9% 0.57%
These structures are highly likely to be informative for predict- interest points central bias sampling (c) 83.7% 0.80%
ing the actions in the dataset (e.g. the frames containing the predicted saliency sampling (d) 86.3% 0.66%
ground truth saliency sampling (e) 86.1% 0.63%
eat and hug actions in Fig.7). Therefore, the detector will tend trajectories [34] only (f) 85.4% 0.00%
to emphasize these locations as opposed to less relevant ones. trajectories +
predicted saliency sampling (h) 87.5% 0.59%
interest points
This also explains why our predicted saliency maps can be as
informative for action recognition as ground truth maps, even Performance comparison among several classification methods (see table 4 for
description) on the UCF Sports Dataset using second order pooling. Statistics
exceeding their performance on certain action classes: while are computed using 10 random seeds.
humans will also fixate on structures not relevant for action
recognition, fixated structures that are relevant to this task will
occur at a higher frequency in our datasets. Hence, they will and 5). This demonstrates that an end-to-end automatic system
be well approximated by a detector trained in a bottom-up incorporating both human and computer vision technology can
manner. This can explain why the performance ballance is even deliver high performance on a challenging problem such as
more inclined towards predicted saliency maps on the UCF action recognition in unconstrained video.
Sports Actions dataset, where motion and image patterns are Action recognition using second order pooling: Our saliency
more stable and easier to predict compared to the Hollywood-2 based interest point operators are defined by randomly sam-
dataset. pling specific spatio-temporal probability distributions. One
Finally, we note that even though our pipeline is sparse, it way to estimate the variability induced in the recognition
achieves near state of the art performance when compared to a performance it to run the pipelines many times for different
pipeline that uses dense trajectories. When the sparse descrip- random seeds. Unfortunately, this is not practical, due to the
tors obtained from our automatic pipeline were combined with high cost of training an end-to-end recognition pipeline in the
the kernels associated to a dense trajectory representation [34] bag-of-visual words framework. Consequently, results reported
using an MKL framework with 18 kernels (14 kernels asso- in tables 4 and 5 are obtained for a single random seed.
ciated with sparse features sampled from predicted saliency However, due to the large number of random samples
maps + 4 kernels associated to dense trajectories from [34]), used to train the system (e.g. 28 million interest points for
we were able to go beyond the state-of-the-art (Tables 4f,i Hollywood-2), we expect little variance in performance for
14

these datasets, with somewhat higher variability for UCF R EFERENCES


Sports due to its smaller size. To verify this intuition, we [1] M. Marszalek, I. Laptev, and C. Schmid, “Actions in context,” in IEEE
experiment with a faster version of the pipeline, in which International Conference on Computer Vision and Pattern Recognition,
we replace the vocabulary building and binning steps (the 2009.
[2] M. D. Rodriguez, J. Ahmed, and M. Shah, “Action mach a spatio-
most computationally expensive) with second order pooling as temporal maximum average correlation height filter for action recogni-
described in [59]. To encode a particular video, we compute tion,” in IEEE International Conference on Computer Vision and Pattern
the covariance matrix of its descriptors and apply the matrix Recognition, 2008.
[3] M. Everingham, L. V. Gool, C. Williams, J. Winn, and A. Zisserman,
logarithm and power scaling operators, with an exponent of “The Pascal visual object classes (VOC) challenge,” International Jour-
0.5, as used in [59]. These operators map our feature sets into nal on Computer Vision, 2010.
a high dimensional descriptor space and make the application [4] E. Rolls, Memory, Attention, and Decision-Making: A Unifying Compu-
tational Neuroscience Approach. Oxford University Press, 2008.
of additional non-linear kernels unnecessary. For additional [5] T. Judd, K. Ehinger, F. Durand, and A. Torralba, “Learning to predict
speed, we concatenate descriptors obtained for various HoG where humans look,” in International Conference on Computer Vision,
and MBH configurations into a single representation (as op- 2009.
[6] H. Larochelle and G. Hinton, “Learning to combine foveal glimpses with
posed to applying MKL) and use a fixed value of 10 for the a third-order boltzmann machine,” in Advances in Neural Information
SVM C penalty parameter. and Processing Systems, 2010.
We re-run each pipeline on the UCF Sports dataset with 10 [7] I. Laptev, “On space-time interest points,” in International Journal on
Computer Vision, 2005.
different random seeds and compute the mean and standard [8] D. Han, L. Bo, and C. Sminchisescu, “Selection and context for action
deviation of the leave-one-out classification accuracy measure. recognition,” in International Conference on Computer Vision, 2009.
Our results, shown in Table 7, indicate that the standard [9] C. Sminchisescu and B. Triggs, “Building roadmaps of minima and
transitions in visual models,” International Journal on Computer Vision,
deviation of the accuracy for the UCF Sports Actions dataset vol. 61, no. 1, 2005.
is less than 0.8% for all pipelines. All the trends we found [10] S. Mathe and C. Sminchisescu, “Dynamic eye movement datasets
for the bag-of-visual words pipeline (Table 5) are confirmed, and learnt saliency models for visual action recognition,” in European
Conference on Computer Vision, 2012.
with recognition results being somewhat lower, mainly due [11] A. Yarbus, Eye Movements and Vision. New York Plenum Press, 1967.
to the removal of the expensive MKL step. We conclude [12] L. Itti, C. Koch, and E. Niebur, “A model of saliency-based visual at-
that randomness has little impact on the performance of the tention for rapid scene analysis,” IEEE Transactions on Pattern Analysis
and Machine Intelligence, vol. 20, 1998.
pipelines presented in this paper. [13] L. Itti, G. Rees, and J. K. Tsotsos, Eds., Neurobiology of Attention.
Academic Press, 2005.
[14] M. F. Land and B. W. Tatler, Looking and Acting. Oxford University
10 C ONCLUSIONS Press, 2009.
[15] L. Itti and C. Koch, “A saliency-based search mechanism for overt and
We have presented experimental and computational modelling covert shifts of visual attention,” Vision Research, vol. 40, 2000.
work at the incidence of human visual attention and com- [16] N. Bruce and J. Tsotsos, “Saliency based on information maximization,”
puter vision, with emphasis on action recognition in video. in Advances in Neural Information and Processing Systems, 2005.
[17] W. Kienzle, F. Wichmann, B. Scholkopf, and M. Franz, “A nonpara-
Inspired by earlier psychophysics and visual attention findings, metric approach to bottom-up visual saliency,” in Advances in Neural
not validated quantitatively at large scale until now and not Information and Processing Systems, 2006.
pursued for video, we have collected, and made available to [18] A. Torralba, A. Oliva, M. Castelhano, and J. Henderson, “Contextual
guidance of eye movements and attention in real-world scenes: The role
the research community, a set of comprehensive human eye- of global features in object search,” Psychological Review, vol. 113,
tracking annotations for Hollywood-2 and UCF Sports, some 2006.
of the most challenging, recently created action recognition [19] T. Judd, F. Durand, and A. Torralba, “Fixations on low resolution
images,” in International Conference on Computer Vision, 2009.
datasets in the computer vision community. Besides the collec- [20] A. Borji and L. Itti, “Scene classification with a sparse set of salient
tion of large datasets, we have performed quantitative analysis regions,” in IEEE International Conference on Robotics and Automation,
and introduced novel models for evaluating the spatial and 2011.
[21] L. Elazary and L. Itti, “A Bayesian model for efficient visual search and
the sequential consistency of human fixations across different recognition,” Vision Research, vol. 50, 2010.
subjects, videos and actions. [22] K. A. Ehinger, B. Sotelo, A. Torralba, and A. Oliva, “Modeling search
We have also performed a large scale analysis of automatic for people in 900 scenes: A combined source model of eye guidance,”
Visual Cognition, vol. 17, 2009.
visual saliency models and end-to-end automatic visual action [23] T. Serre, L. Wolf, S. Bileschi, M. Riesenhuber, and T. Poggio, “Robust
recognition systems. Our studies are performed with particular object recognition with cortex-like mechanisms,” IEEE Transactions on
focus on computer vision techniques and interest point opera- Pattern Analysis and Machine Intelligence, vol. 29, 2007.
[24] D. Walther, U. Rutishauser, C. Koch, and P. Perona, “Selective visual
tors and descriptors. In particular, we propose new accurate attention enables learning and recognition of multiple objects in cluttered
saliency operators that can be effectively trained based on scenes,” Computer Vision and Image Understanding, vol. 100, 2005.
human fixations. Finally, we show that such automatic saliency [25] S. Han and N. Vasconcelos, “Biologically plausible saliency mechanisms
improve feedforward object recognition,” Vision Research, 2010.
predictors can be used within end-to-end computer visual [26] H. Jhuang, T. Serre, L. Wolf, and T. Poggio, “A biologically inspired
action recognition systems to achieve state of the art results system for action recognition,” in International Conference on Computer
in some of the hardest benchmarks in the field. Vision, 2007.
[27] W. Kienzle, B. Scholkopf, F. Wichmann, and M. Franz, “How to find
interesting locations in video: a spatiotemporal interest point detector
learned from human eye movements,” in Lecture Notes in Computer
ACKNOWLEDGMENTS Science (DAGM), 2007.
[28] S. Marat, M. Guironnet, and D. Pellerin, “Video summarization using
This work was supported by CNCS-UEFICSDI, under PNII a visual attention model,” in European Signal Processing Conference,
RU-RC-2/2009 and PCE-2011-3-0438. 2007.
15

[29] O. Le Meur, P. Le Callet, and D. Barba, “Predicting visual fixations on [57] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin,
video based on low-level visual features,” Vision Research, vol. 47, pp. “LIBLINEAR: A library for large linear classification,” JMLR, vol. 9,
2483–2498, 2007. 2008.
[30] A. Fathi, Y. Li, and J. M. Rehg, “Learning to recognize daily actions [58] E. Simocelli and W. Freeman, “The steerable pyramid: A flexible archi-
using gaze,” in European Conference on Computer Vision, 2012. tecture for multi-scale derivative computation,” in IEEE International
[31] S. Winkler and R. Subramanian, “Overview of eye tracking datasets,” in Conference on Image Processing, 1995.
International Workshop on Quality of Multimedia Experience (QoMEX), [59] J. Carreira, R. Caseiro, J. Batista, and C. Sminchisescu, “Semantic
2013. segmentation with second-order pooling,” in European Conference on
[32] L. Fei-Fei, A. Iyer, C. Koch, and P. Perona, “What do we perceive in a Computer Vision, 2012.
glance of a real-world scene?” Journal of Vision, 2007.
[33] H. Jhuang, T. Serre, L. Wolf, and T. Poggio, “A biologically inspired
system for action recognition,” in International Conference on Computer
Vision, 2007.
[34] H. Wang, A.Klaser, C.Schmid, and C. Liu, “Action recognition by dense
trajectories,” in IEEE International Conference on Computer Vision and
Pattern Recognition, 2011.
[35] J. Yamato, J. Ohya, and K. Ishii, “Recognizing human action in time-
squential images using hidden markov model,” in IEEE International
Conference on Computer Vision and Pattern Recognition, 1992.
[36] W.Li, Z.Zhang, and Z.Liu, “Expandable data-driven graphical modeling
of human actions based on salient postures,” IEEE TCSVT, vol. 18, 2008. Stefan Mathe has obtained his MSc. degree in
[37] M. Hoai and F. De la Torre, “Max-margin early event detectors,” in IEEE Computer Science from the Technical University
International Conference on Computer Vision and Pattern Recognition, of Cluj Napoca, Romania. He is currently a Re-
2012. search Assistant at the Institute of Mathematics
[38] M. Dorr, T. Martinetz, K. Gegenfurtner, and E. Barth, “Variability of eye of the Romanian Academy and a Phd student in
movements when viewing dynamic natural scenes,” Journal of Vision, the Artificial Intelligence Laboratory at the Uni-
vol. 10, 2010. versity of Toronto. His research interests span
[39] M. R. Greene, T. Liu, and J. M. Wolfe, “Reconsidering yarbus: A failure the areas of both low level and high level com-
to predict observers’ task from eye movement patterns,” Vision Research, puter vision, machine learning, natural language
vol. 62, 2012. understanding and computer graphics. His Phd
[40] O. Le Meur and T. Baccino, “Methods for comparing scanpaths and research is focused on the problems of action
saliency maps: strengths and weaknesses,” Behavior Research Methods, recognition, with emphasis of developing higher level invariant features
2012. and exploiting contextual information.
[41] B. Yao and L. Fei-Fei, “Modeling mutual context of object and human
pose in human-object interaction activities,” in IEEE International
Conference on Computer Vision and Pattern Recognition, 2010.
[42] A. Prest, C. Schmid, and V. Ferrari, “Weakly supervised learning of
interactions between humans and objects,” IEEE Transactions on Pattern
Analysis and Machine Intelligence, 2011.
[43] A. Hwang, H. Wang, and M. Pomplun, “Semantic guidance of eye
movements in real-world scenes,” Vision Research, 2011.
[44] B. Needleman and C. Wunsch, “A general method applicable to the
search for similarities in the amino acid sequence of two proteins,”
Journal of Molecular Biology, 1970.
[45] I. Laptev and T. Lindeberg, “Space-time interest points,” in International Cristian Sminchisescu has obtained a doc-
Conference on Computer Vision, 2003. torate in Computer Science and Applied Math-
[46] N. Dalal, B.Triggs, and C.Schmid, “Human detection using oriented his- ematics with an emphasis on imaging, vision
tograms of flow and appearance,” in European Conference on Computer and robotics at INRIA, France, under an Eiffel
Vision, 2006. excellence doctoral fellowship, and has done
[47] N. Dalal and B. Triggs, “Histograms of oriented gradients for human postdoctoral research in the Artificial Intelligence
detection,” in IEEE International Conference on Computer Vision and Laboratory at the University of Toronto. He is a
Pattern Recognition, 2005. member in the program committees of the main
[48] B. Hariharan, L. Zelnik-Manor, S. V. N. Vishwanathan, and M. Varma, conferences in computer vision and machine
“Large scale max-margin multi-label classification with priors,” in learning (CVPR, ICCV, ECCV, NIPS, AISTATS),
International Conference on Machine Learning, 2010. area chair for ICCV07-13, and a member of
[49] T. Judd, F. Durand, and A. Torralba, “A benchmark of computational the Editorial Board (Associate Editor) of IEEE Transactions on Pattern
models of saliency to predict human fixations,” MIT, Tech. Rep. 1, 2012. Analysis and Machine Intelligence (PAMI). He has given more than 50
[50] A. Oliva and A. Torralba, “Modeling the shape of the scene: A invited talks and presentations and has offered tutorials on 3d tracking,
holistic representation of the spatial envelope,” International Journal recognition and optimization at ICCV and CVPR, the Chicago Machine
on Computer Vision, no. 42, 2001. Learning Summer School, the AERFAI Vision School in Barcelona and
[51] R. Rosenholtz, “A simple saliency model predicts a number of motion the Computer Vision Summer School (VSS) in Zurich. Sminchisescu’s
popout phenomena,” Vision Research, no. 39, 1999. research goal is to train computers to see. His research interests are
[52] P. Viola and M. Jones, “Robust real-time object detection.” International in the area of computer vision (articulated objects, 3d reconstruction,
Journal on Computer Vision, 2001. segmentation, and object and action recognition) and machine learning
[53] P. Felzenswalb, D. McAllester, and D. Ramanan, “A discriminatively (optimization and sampling algorithms, structured prediction, sparse
trained, multiscale, deformable part model,” in IEEE International approximations and kernel methods).
Conference on Computer Vision and Pattern Recognition, 2008.
[54] D. Sun, S. Roth, and M. J. Black, “Secrets of optical flow and their
principles,” in IEEE International Conference on Computer Vision and
Pattern Recognition, 2010.
[55] P. Arbelaez, M. Maire, C. Fowlkes, and J. Malik, “Occlusion bound-
ary detection and figure/ground assignment from optical flow,” IEEE
Transactions on Pattern Analysis and Machine Intelligence, 2011.
[56] A. Vedaldi and A. Zisserman, “Efficient additive kernels via explicit
feature maps,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, 2011.

You might also like