Actions in The Eye - Dynamic Gaze Datasets and Learnt Saliency Models For Visual Recognition
Actions in The Eye - Dynamic Gaze Datasets and Learnt Saliency Models For Visual Recognition
Abstract—Systems based on bag-of-words models from image features collected at maxima of sparse interest point operators have
been used successfully for both computer visual object and action recognition tasks. While the sparse, interest-point based approach to
recognition is not inconsistent with visual processing in biological systems that operate in ‘saccade and fixate’ regimes, the methodology
and emphasis in the human and the computer vision communities remains sharply distinct. Here, we make three contributions
aiming to bridge this gap. First, we complement existing state-of-the art large scale dynamic computer vision annotated datasets
like Hollywood-2 [1] and UCF Sports [2] with human eye movements collected under the ecological constraints of the visual action
arXiv:1312.7570v1 [cs.CV] 29 Dec 2013
recognition task. To our knowledge these are the first large human eye tracking datasets to be collected and made publicly available
for video, vision.imar.ro/eyetracking (497,107 frames, each viewed by 16 subjects), unique in terms of their (a) large scale
and computer vision relevance, (b) dynamic, video stimuli, (c) task control, as opposed to free-viewing. Second, we introduce novel
sequential consistency and alignment measures, which underline the remarkable stability of patterns of visual search among subjects.
Third, we leverage the significant amount of collected data in order to pursue studies and build automatic, end-to-end trainable computer
vision systems based on human eye movements. Our studies not only shed light on the differences between computer vision spatio-
temporal interest point image sampling strategies and the human fixations, as well as their impact for visual recognition performance,
but also demonstrate that human fixations can be accurately predicted, and when used in an end-to-end automatic system, leveraging
some of the advanced computer vision practice, can lead to state of the art results.
Index Terms—visual action recognition, human eye-movements, consistency analysis, saliency prediction, large scale learning
inside a topographicals map that encodes the saliency of each models [35], [36] and stuctured output SVMs [37]. Currently
point in the visual field. the most successful systems remain the ones dominated by
Models of saliency can be either pre-specified or learned complex features extracted at interesting locations, bagged and
from eye tracking data. In the former category falls the basic fused using advanced kernel combination techniques [1], [8].
saliency model [15] that combines information from multiple This study is driven, primarily, by our computer vision inter-
channels into a single saliency map. Information maximization ests, yet leverages data collection and insights from human
[16] provides an alternative criterion for building saliency vision. While in this paper we focus on bag-of-words spatio-
maps. These can be learned from low-level features [17] or temporal computer action recognition pipelines, the scope for
from a combination of low, mid and high-level ones [5], [18], study and the structure in the data are broader. We do not see
[19]. Saliency maps have been used for scene classification this investigation as a terminus, but rather as a first step in
[20], object localization and recognition [21], [22], [23], [24], exploring some of the most advanced data and models that
[25] and action recognition [26], [27]. Comparatively little human vision and computer vision can offer at the moment.
attention has been devoted to computational models of saliency
maps for the dynamic domain. Bottom-up saliency models for
3 L ARGE S CALE H UMAN E YE M OVEMENT
static images have been extended to videos by incorporating
motion and flicker channels [28], [29]. All these models are DATA C OLLECTION IN V IDEO
pre-specified. One exception is the work of Kienzle et al. An objective of this work is to introduce additional annotations
[27], who train an interest point detector using fixation data in the form of eye recordings for two large scale video data
collected from human subjects in a free viewing (rather than sets for action recognition.
specific) task. The Hollywood-2 Movie Dataset: Introduced in [1], it is
Datasets containing human gaze pattern annotations of one of the largest and most challenging available datasets for
images have emerged from studies carried out by the human real world actions. It contains 12 classes: answering phone,
vision community, some of which are publicly available [5], driving a car, eating, fighting, getting out of a car, shaking
[15], [19], [22], [30] and some that are not [27] (see [31] for hands, hugging, kissing, running, sitting down, sitting up and
an overview). Most of these datasets have been designed for standing up. These actions are collected from a set of 69
small quantitative studies, consisting of at most a few hundred Hollywood movies. The data set is split into a training set of
images or videos, usually recorded under free-viewing, in 823 sequences and a test set of 884 sequences. There is no
sharp contrast with the data we provide, which is large scale, overlap between the 33 movies in the training set and the 36
dynamic, and task controlled. These studies [6], [13], [15], movies in the test set. The data set consists of about 487k
[19], [22], [27], [32], [33] could however benefit from larger frames, totaling about 20 hours of video.
scale natural datasets, and from studies that emphasize the
task, as we pursue. The UCF Sports Action Dataset: This high resolution
The problem of visual attention and the prediction of visual dataset [2] was collected mostly from broadcast television
saliency have long been of interest in the human vision channels. It contains 150 videos covering 9 sports action
community [13], [14], [15]. Recently there was a growing classes: diving, golf swinging, kicking, lifting, horseback
trend of training visual saliency models based on human riding, running, skateboarding, swinging and walking.
fixations, mostly in static images (with the notable exception Human subjects: We have collected data from 16 human
of [27]), and under subject free-viewing conditions [5], [18]. volunteers (9 male and 7 female) aged between 21 and 41. We
While visual saliency models can be evaluated in isolation split them into an active group, which had to solve an action
under a variety of measures against human fixations, for recognition task, and a free viewing group, which was not
computer vision, their ultimate test remains the demonstration required to solve any specific task while being presented with
of relevance within an end-to-end automatic visual recognition the videos in the two datasets. There were 12 active subjects
pipeline. While such integrated systems are still in their (7 male and 5 female) and 4 free viewing subjects (2 male
infancy, promising demonstrations have recently emerged for and 2 female). None of the free viewers was aware of the
computer vision tasks like scene classification [20], verifying task of the active group and none was a cognitive scientist.
correlations with object (pedestrian) detection responses [21], We chose the two groups such that no pair of subjects from
[22]. An interesting early biologically inspired recognition different groups were acquainted with each other, in order to
system was presented by Kienzle et al. [27], who learn a limit biases on the free viewers.
fixation operator from human eye movements collected under
Recording environment: Eye movements were recorded
video free-viewing, then learn action classification models for
using an SMI iView X HiSpeed 1250 tower-mounted eye
the KTH dataset with promising results. Recently, under the
tracker, with a sampling frequency of 500Hz. The head of
constraint of a first person perspective, Fathi et. al. [30] have
the subject was placed on a chin-rest located at 60 cm from
shown that human fixations can be predicted and used to
the display. Viewing conditions were binocular and gaze data
enhance action recognition performance.
was collected from the right eye of the participant.1 The LCD
In contrast, in computer vision, interest point detectors have
display had a resolution 1280 × 1024 pixels, with a physical
been successfully used in the bag-of-visual-words framework
for action classification and event detection [1], [7], [8], [34], 1. None of our subjects exhibited a strongly dominant left eye, as deter-
but a variety of other methods exists, including random field mined by the two index finger method.
4
p n p
dU u U
an + R nd
+ rso + S s
screen size of 47.5 x 29.5 cm. The calibration procedure
S t n ta
un e e is
R htP ak + K
t ar ne
a t n
s r e
Ea veC Pho
was carried out at the beginning of each block. The subject
R s son
Fi dS son
H tOu rso
H ndS Car
Ki gPe ak
H gPe p
tU n
u h
g h
e e
an r
ri r
H ndU
Si ow
D we
G htP
St p
had to follow a target that was placed sequentially at 13
tD
un
s
a
g
u
An
Si
Fi
AnswerPhone 0.9
locations evenly distributed across the screen. Accuracy of DriveCar
0.8
Eat
the calibration was then validated at 4 of these calibrated FightPerson 0.7
locations. If the error in the estimated position was greater GetOutCar
HandShake 0.6
than 0.75◦ of visual angle, the experiment was stopped and
human label
HugPerson
Kiss 0.5
calibration restarted. At the end of each block, validation was Run
SitDown 0.4
varies across the datasets, each video was rescaled to fit the
screen, preserving the original aspect ratio. The visual angles Fig. 2: Action recognition perfomance by humans on the
subtended by the stimuli were 38.4◦ in the horizontal plane Hollywood-2 database. The confusion matrix includes the 12
and ranged from 13.81◦ to 26.18◦ in the vertical plane. action labels plus the 4 most frequent combinations of labels
Recording protocol: Before each video sequence was shown, in the groundc truth.
participants in the active group were required to fixate the
center of the screen. Display would proceed automatically
using the trigger area-of-interest feature provided by the iView usually involves semantically related actions, e.g. DriveCar
X software. Participants had to identify the actions in each and GetOutOfCar or Kiss and Hug.
video sequence. Their multiple choice answers were recorded
through a set of check-boxes displayed at the end of each 4.2 Spatial Consistency Among Subjects
video, which the subject manipulated using a mouse.2 Partici- In this section, we investigate how well the regions fixated
pants in the free viewing group underwent a similar protocol, by human subjects agree on a frame by frame basis, by
the only difference being that the questionnaire answering step generalizing to video data the procedure used by Ehinger et.
was skipped. To avoid fatigue, we split the set of stimuli al. [22] in the case of static stimuli.
into 20 sessions (for the active group) and 16 sessions (for
the free viewing group), each participant undergoing no more Evaluation Protocol: For the task of locating people in a
than one session per day. Each session consisted of 4 blocks, static image, [22] have evaluated how well one can predict
each designed to take approximately 8 minutes to complete the regions fixated by a particular subject from the regions
(calibration excluded), with 5-minute breaks between blocks. fixated by the other subjects on the same image. For cross-
Overall, it took approximately 1 hour for a participant to stimulus control, this measure is however not meaningful in
complete one session. The video sequences were shown to itself, as part of the inter-subject agreement is due to bias in the
each subject in a different random order. stimulus itself (e.g. photographer’s bias) or due to the tendency
of humans to fixate more often at the center of the screen [14].
Therefore, one can address this issue by checking how well
4 S PATIAL AND S EQUENTIAL C ONSISTENCY the fixation of a subject on one simulus can be predicted from
4.1 Action Recognition by Humans those of the other subjects on a different, unrelated, stimulus.
Our goal is to create a data set that captures the gaze Normally, the average precision when predicting fixations on
patterns of humans solving a recognition task. Therefore, it the same stimulus is much greater than on different stimuli.
is important to ensure that our subjects are successful at this We generalize this protocol for video, by randomly choosing
task. Fig.4.1 shows the confusion matrix between the answers frames from our videos and checking inter-subject correlation
of human subjects and the ground truth. For Hollywood-2, on them. We test each subject in turn with respect to the other
there can be multiple labels associated with the same video. (training) subjects. An empirical saliency map is generated
We show, apart from the 12 action labels, the 4 most common by adding a Dirac impulse at each pixel fixated by the
combinations of labels occurring in the ground truth and omit training subjects in that frame, followed by the application of a
less common ones. The analysis reveals, apart from near- Gaussian blur filter. We then consider this map as a confidence
perfect performance, the types of errors humans are prone map for predicting the test subject’s fixation. There is a label of
to make. The most frequent human errors are omissions of 1 at the test subjects’ fixation and 0 elsewhere. The area under
one of the actions co-occurring in a video. False positives the curve (AUC) score for this classification problem is then
are much less frequent. The third type of error of mislabeling computed for the test subject. We average score over all test
a video entirely, almost never happens, and when it does it subjects defines the final consistency metric. This value ranges
from 0.5 – when no consistency or bias effects are present in
2. While the representation of actions is an open problem, we relied on the the data – and 1 – when all subjects fixate the same pixel and
datasets and labeling of the computer vision community, as a first study, and there is no measurement noise. For cross-stimulus control, we
to maximize impact on current research. In the long run, weakly supervised
learning could be better suited to map persistent structure to higher level repeat this process for pairs of frames chosen from different
semantic action labels. videos and attempt to predict the fixation of each test subject
5
Detection rate
0.6 0.6
Spatial and sequential consistency analysis results measured both globally and for each action class. Columns (a-b) represent the areas under the ROC curves
for spatial inter-subject consistencies and the corresponding cross-stimulus control. The classes marked by ∗ show significant shooter’s bias due to the very
similar filming conditions of all the videos within them (same shooting angle and position, no clutter). Good sequential consistency is revealed by the match
scores obtained by temporal alignment (c-d) and Markov dynamics (e-f).
TABLE 2: Spatial Consistency Between Active and Free- can be identified, the sequence of fixations belonging to
Viewing Subjects a subject can be represented discretely by assigning each
free-viewing p-value p-value fixation to the closest AOI. For example, from the video
subject Hollywood-2 UCF Sports
1 0.67 0.55 depicted in fig.4-left, we identify six AOIs: the bumper of
2 0.67 0.55 the car, its windshield, the passenger and the handbag he
3 0.60 0.60
4 0.65 0.64 carries, the driver and the side mirror. We then trace the
Mean 0.65 0.58 scan path of each subject through the AOIs based on spatial
(a) proximity, as shown in fig.4-right. Each fixation gets assigned
Action Class p-value a label. For subject 2 shown in the example, this results in
AnswerPhone 0.66
DriveCar 0.65
Action Class p-value the sequence [bumper, windshield, driver, mirror, driver,
Dive 0.68
Eat 0.65
GolfSwing 0.67
handbag]. Notice that AOIs are semantically meaningful and
FightPerson 0.65
GetOutCar 0.65
Kick 0.58 tend to correspond to physical objects. Interestingly, this
Lift 0.57
HandShake 0.68
RideHorse 0.51
supports recent computer vision strategies based on object
HugPerson 0.63
Kiss 0.63
Run 0.54 detectors for action recognition [8], [41], [42], [43].
Skateboard 0.60
Run 0.65
SitDown 0.61
Swing 0.59 Automatically Finding AOIs: Defining areas of interest
Walk 0.51
SitUp 0.65
Mean 0.58 manually is labour intensive, especially in the video domain.
StandUp 0.64
Mean 0.65 UCF Sports Actions Therefore, we introduce an automatic method for determining
Hollywood-2 their locations based on clustering the fixations of all subjects
(b) in a frame. We start by running k-means with 1 cluster and we
We measure consistency as the p-value associated with predicting each free- successively increase their number until the sum of squared
viewer from the saliency map derived from active subjects, averaged over 1000 errors drops below a threshold. We then link centroids from
randomly chosen video frames. We find that none of the scanpaths belonging successive frames into tracks, as long as they are closely
to our free-viewing subjects (a) deviates significanly from ground truth data.
Likewise, no significant differences are found when restricting the analysis to located spatially. For robustness, we allow for a temporal
videos belonging to specific action classes (b). gap during the track building process. Each resulting track
becomes an AOI, and each fixation is assigned to the closest
AOI at the time of its creation.
then define two metrics, AOI Markov dynamics and temporal AOI Markov Dynamics: In order to capture the dynamics
AOI alignment, and show how they can be computed for this of eye movements, and due to data sparsity, we represent the
representation. After we define a baseline for our evaluation transitions of human visual attention between AOIs by means
we conclude with a discussion of the results. of a first-order Markov process. Given a set of human fixation
Scanpath Representation: Human fixations tend to be tightly strings fi , where the j th fixation of subject i is encoded by
clustered spatially at one or more locations in the image. the index fij ∈ 1, A of the corresponding AOI, we estimate
Assuming that such regions, called areas of interest (AOIs), the probability p(st = b | st−1 = a) of transitioning to
7
subject 2 Handbag
subject 4
subject 10
mirror
driver
passenger
bumper
frame 145 frame 215 frame 230 0 50 100 150 200 250
time (frames)
Fig. 4: Areas of interest are obtained automatically by clustering the fixations of subjects. Left: Heat maps illustrating the
assignments of fixations to AOIs. The colored blobs have been generated by pooling together all fixations belonging to the same
AOI. Right: Scan path through automatically generated AOIs (colored boxes) for three subjects. Arrows illustrate saccades.
Semantic labels have been manually assigned and illustrates the existance of cognitive routines centered at semantically
meaningful objects.
AOI b at time t given that AOI a was fixated at time t − 1 must obey the time ordering relations among AOIs (e.g. the
by counting transition frequencies. We regularize the model passenger is not visible until the second half of the video in
using Laplace smoothing to account for data sparsity. The fig.4). Second, our automatic AOIs are derived from subject
probability of a novel fixation sequence g under this model fixations and are biased by their gaze preference. The lifespan
is j p(st = g j | st−1 = g j−1 ) assuming the first state in
Q
of an AOI will not be initiated until at least one subject has
the model, the central fixation, has probability 1. We measure fixated it, even if the corresponding object is already visible.
the consistency among a set of subjects by considering each To remove some of the resulting bias from our evaluation,
subject in turn, computing the probability of his scanpath we extend each AOI both forward and backwards in time,
with respect to the model trained from the fixations of the until the image patch at its center has undergone significant
other subjects and normalizing by the number of fixations appearance changes, and use these extended AOIs when
in his scanpath. The final consistency score is the average generating our random baselines.
probability over all subjects. Findings: For the Hollywod-2 dataset, we find that the average
Temporal AOI Alignment: Another approach to evaluate transition probability of each subject’s fixations under AOI
sequential consistency is by measuring how pairs of AOI Markov dynamics is 70%, compared to 13% for the random
strings corresponding to different subjects can be globally baseline (table 1). We also find that, across all videos, 71% of
aligned, based on their content. Although not modeling the AOI symbols are successfully aligned, compared to only
transitions explicitly, a sequence alignment has the advantage 51% for the random baseline. We notice similar gaps in the
of being able to handle gaps and missing elements. An UCF Sports dataset. These results indicate a high degree of
efficient algorithm having these properties due to Needleman- consistency in human eye movement dynamics across the two
Wunsch [44] uses dynamic programming to find the optimal datasets. Alignment scores vary to some extent across classes.
match between two sequences f 1:n and g 1:m , by allowing
for the insertion of gaps in either sequence. It recursively 5 S EMANTICS OF F IXATED S TIMULUS PAT-
computes the alignment score hi,j between subsequences
TERNS
f 1:i and g 1:j by considering the alternative costs of a
match between f i and g j versus the insertion of a gap into Having shown that visual attention is consistently drawn
either sequence. The final consistency metric is the average to particular locations in the stimulus, a natural question is
alignment score over all pairs of distinct subjects, normalized whether the patterns at these locations are repeatable and
by the length of the longest sequence in each pair. We have semantic structure. In this section, we investigate this
set the similarity metric to 1 for matching AOI symbols by building vocabularies over image patches collected from
and to 0 otherwise, and assume no penalty is incurred for the locations fixated by our subjects when viewing videos of
inserting gaps in either sequence. This setting gives the score the various actions in the Hollywood-2 data set.
a semantic meaning: it is the average percentage of symbols Protocol: Each patch spans 1o away from the center of
that can be matched when determinig the longest common fixation to simulate the extent of high foveal accuity. We then
subsequence of fixations among pairs of subjects. represent each patch using HoG descriptors with a spatial grid
Baselines: In order to provide a reference for our consistency resolution of 3 × 3 cells and 9 orientation bins. We cluster
evaluation, we generate 10 random AOI strings per video and the resulting descriptors using k-means into 500 clusters.4
compute the consistency on these strings under our metrics.
We note however that the dynamics of the stimulus places 4. We have found that using 500 clusters provides a good tradeoff between
constraints on the sampling process. First, a random string the semantic under and over-segmentation of the image patch space.
8
6 E VALUATION P IPELINE
Answerphone DriveCar
For all computer action recognition models, we use the
same processing pipeline, consisting of an interest point
operator (computer vision or biologically derived), descriptor
extraction, bag of visual words quantization and an action
classifier.
Interest Point Operator: We experiment with various
interest point operators, both computer vision based (see
§7.1) and biologically derived (see §7.1, §9). Each interest
point operator takes as input a video and generates a set of
Eat GetOutCar spatio-temporal coordinates, with associated spatial and, with
the exception of the 2D fixation operator presented in §7.1,
temporal scales.
Descriptors: We obtain features by extracting descriptors
at the spatiotemporal locations returned by of our interest
point operator. For this purpose, we use the spacetime
generalization of the HoG descriptor as described in [45] as
well as the MBH descriptor [46] computed from optical flow.
We consider 7 grid configurations and extract both types
of descriptors for each configuration, and end up with 14
Run Kiss features for classification. We use the same grid configurations
in all our experiments. We have selected them from a wider
Fig. 5: Sampled entries from visual vocabularies obtained
candidate set based on their individual performance for
by clustering fixated image regions in the space of HoG
action recognition. Each HoG and MBH block is normalized
descriptors, for several action classes from the Hollywood-2
using a L2-Hys normalization scheme with the recommended
dataset.
threshold of 0.7 [47].
Visual Dictionaries/Second Order Pooling: We cluster the
Findings: In Fig. 5 we illustrate image patches that have resulting descriptors using k-means into visual vocabularies
been assigned high probability by the mixture of Gaussians of 4000 visual words. For computational reasons, we only
model underlying k-means. Each row contains the top 5 most use 500, 000 randomly sampled descriptors as input to the
probable patches from a cluster (in decreasing order of their clustering step, and represent each video by the L1 normalized
probability). The patches in a row are restricted to come from histogram of its visual words. Alternatively, in section §9 we
different videos, i.e. we remove any patches for which there is also experiment with a different encoding scheme based on
a higher probability patch coming from the same video. Note second order pooling [59], which provides a tradeoff between
that clusters are not entirely semantically homogeneous, in part computational cost and classification accuracy.
due to the limited descriptive power of HoG features (e.g. a Classifiers: From histograms we compute kernel matrices
driving wheel is clustered together with a plate, see row 4 of using the RBF-χ2 kernel and combine the 14 resulting kernel
the vocabulary for class Eat). Nevertheless, fixated patches do matrices by means of a Multiple Kernel Learning (MKL)
include semantically meaningful image regions of the scene, framework [48]. We train one classifier for each action label
including actors in various poses (people eating or running or in a one-vs-all fashion. We determine the kernel scaling
getting out of vehicles), objects being manipulated or related parameter, the SVM loss parameter C and the σ regularization
to the action (dishes, telephones, car doors) and, to a lesser parameter of the MKL framework by cross-validation. We
extent, context of the surroundings (vegetation, street signs). perform grid search over the range [10−4 , 104 ] × [10−4 , 104 ] ×
Fixations fall almost always on objects or parts of objects [10−2 , 10−4 ], with a multiplicative step of 100.5 . We draw 20
and almost never on unstructured parts of the image. Our folds and we run 3 cross-validation trials at each grid point
analysis also suggests that subjects generally avoid focusing and select the parameter setting giving the best cross validation
on object boundaries unless an interaction is being shown average precision to train our final action classifier. We report
(e.g. kissing), otherwise preferring to center the object, or the average precision on the test set for each action class,
9
distribution over the video frame, weighted by a parameter α, The ROC curve is computed for each image and the average
psal = (1 − α)pfix + αpunif . We now define an interest point area the area under the curve over the whole set of testing
operator that randomly samples spatio-temporal locations images gives the final score. This measure emphasizes the
from the ground truth probability distribution psal , both at capacity of a saliency map to rank pixels higher when they are
training and testing time, and associates them with random fixated then when they are not. This does not imply, however,
spatio-temporal scales uniformly distributed in the range that the normalized probability distribution associated with
[2, 8]. We train classifiers for each action by feeding the the saliency map is close to the ground truth saliency map
output of this operator through our pipeline (see §6). By doing for the image. A better suited way to compare probability
so, we build vocabularies from descriptors sampled from distributions is the spatial Kullback-Leibler (KL) divergence,
saliency maps derived from ground truth human fixations. which we propose as our second evaluation measure, defined
We determine the optimal values for the α regularization as:
parameter and the fovea radius σ by cross-validation. We
X p(x)
DKL (p, s) = p(x) log (1)
also run two baselines: the spatio-temporal Harris operator s(x)
x∈I
(see §7.1) and the operator that samples locations from the
uniform probability distribution, which can be obtained by where p(x) is the value of the normalized saliency prediction
setting α = 1 in psal . In order to make the comparison at pixel x, and s is the ground truth saliency map. A small
meaningful, we set the number of interest points sampled by value of this metric implies that random samples drawn from
our saliency-based and the uniform random operators in each the predicted saliency map p are likely to approximate well
frame to match the firing rate of the Harris corner detector. the ground truth saliency s.
Findings: We find that ground truth saliency sampling (Table Saliency Predictors: Having established evaluation criteria,
4e) outperforms both the Harris and the uniform sampling we now run several saliency map predictors on our dataset,
operators significantly, at equal interest point sparsity rates. which we describe below.
Our results indicate that saliency maps encoding only the Baselines: We also provide three baselines for saliency map
weak surface structure of fixations (no time ordering is used), comparison. The first one is the uniform saliency map,
can be used to boost the accuracy of contemporary methods that assigns the same fixation probability to each pixel of
and descriptors used for computer action recognition. Up to the video frame. Second, we consider the center bias (CB)
this point, we have relied on the availability of ground truth feature, which assigns each pixel with the distance to the
saliency maps at test time. A natual question is whether it center of the frame, regardless of its visual contents. This
is possible to reliably predict saliency maps, to a degree that feature can capture both the tendency of human subjects to
still preserves the benefits of action classification accuracy. fixate near the center of the screen and the preference of the
This will be the focus of the next section. photographer to center objects into the field of view. At the
other end of the spectrum lies the human saliency baseline,
8 S ALIENCY M AP P REDICTION which derives a saliency map from half of our human subjects
Motivated by the findings presented in the previous section, and is evaluated with respect to fixations of the remaining
we now show that we can effectively predict saliency maps. ones.
We start by introducing two evaluation measures for saliency Static Features (SF): We also include features used by the
prediction. The first is the area-under-the-curve (AUC), human vision community for saliency prediction in the image
which is widely used in the human vision community. The domain [5], which can be classified into three categories:
second measure is inspired by our application of saliency low, mid and high-level. The four low level features used are
maps to action recognition. In the pipeline we proposed color information, steerable pyramid subbands, the feature
in §7.2, ground truth saliency maps drive the random maps used as input by Itti&Koch’s model [15] and the output
sampling process of our interest point operator. We will of the saliency model described by Oliva and Torralba [50]
use the spatial Kullback-Leibler (KL) divergence measure and Rosenholtz [51]. We run a Horizon detector [50] as our
to compare the predicted and the ground truth saliencies. mid-level feature. Object detectors are used as high level
We also propose and study several features for saliency features, which comprise faces [52], persons and cars [53].
map prediction, both static and motion based. Our analysis
Motion Features (MF): We augment our set of predictors
includes features derived directly from low, mid and high
with five novel (in the context of saliency prediction) feature
level image information. In addition, we train a HoG-MBH
maps, derived from motion or space-time information.
detector that fires preferentially at fixated locations, using the
vast amount of eye movement data available in the dataset. Flow: We extract optical flow from each frame using a
We evaluate all these features and their combinations on our state of the art method [54] and compute the magnitude of
dataset, and find that our detector gives the best performance the flow at each location. Using this feature, we wish to
under the KL divergence measure. investigate whether regions with significant optical changes
Saliency Map Comparison Measures: The most commonly attract human gaze.
used measure for evaluating saliency maps in the image Pb with flow: We run the PB edge detector [55] with both
domain [5], [22], [49], the AUC measure, interprets saliency image intensity and the flow field as inputs. This detector
maps as predictors for separating fixated pixels from the rest. fires both at intensity and motion boundaries.
11
TABLE 5: Action Recognition Performance on the UCF Sports Actions Data Set
in rd
ck ng
ck ng
1 1
e
ar
sw boa
rs
rs
ki wi
ki wi
bo
ho
ho
s
s
g
g
te
te
lf−
lf−
in
k
0.9 0.9
ve
ve
e
e
a
a
al
al
n
n
sw
go
rid
go
rid
t
sk
sk
ru
ru
di
di
lif
lif
w
method distribution accuracy dive 0.8 dive 0.8
ground truth
ground truth
0.5 0.5
Left: Performance comparison among several classification methods (see table 4 for description) on the UCF Sports Dataset. Right: Confusion matrices
obtained using dense trajectories [34] and interest points sampled sparsely from saliency maps predicted by our HoG-MBH detector.
Flow bimodality: We wish to investigate how often people to exploit the structure present at these locations and train a
fixate on motion edges, where the flow field typically has a detector for human fixations. Our detector uses both static
bimodal distribution. To do this, for a neighbourhood centered (HoG) and motion (MBH) descriptors centered at fixations.
at a given pixel x, we run K-Means, first with 1 and then We run our detector in a sliding window fashion across the
with 2 modes, obtaining sum-of-squared-error values of s1 (x) entire video and obtain a saliency map.
and s2 (x) respectively. We weight the distance between the Feature combinations: We linearly combine various subsets of
centers of the two modes by a factor inversely proportional our feature maps for better saliency prediction. We investigate
to exp 1 − ss21 (x)
(x)
, to enforce a high response at positions the predictive power of static features and motion features
where the optical flow distribution is strongly bimodal and alone and in combination, with and without central bias.
its mode centers are far apart from each other.
Experimental Protocol: We use 106 examples to train our
Harris: This feature encodes the spatio-temporal Harris detector, half of which are positive and half of which are
cornerness measure as defined in [7]. negative. At each of these locations, we extract spatio-temporal
HoG-MBH detector: The saliency models we have considered HoG and MBH descriptors. We opt for 3 grid configurations,
so far access higher level image structure by means of namely 1x1x1, 2x2x1 and 3x3x1 cells. We have experimented
pre-trained object detectors. This approach does not prove with higher temporal grid resolutions, but found only modest
effective on our dataset, due to the high variability in pose and improvements in detector performance at a high increase
illumination. On the other hand, our dataset provides a rich in computational cost. We concatenate all 6 descriptors and
set of human fixations. We observe that fixated image regions lift the resulting vector into a higher dimensional space by
are often semantically meaningful, sometimes corresponding employing an order 3 χ2 kernel approximation using the
to objects or object parts. Inspired by this insight, we aim approach of [56]. We train an SVM using the LibLinear
12
package [57] to obtain our HoG-MBH detector. TABLE 6: Evaluation of Individual Feature Maps and Com-
For combining feature maps, we train a linear predictor on binations for Human Saliency Prediction.
500 randomly selected frames from the Hollywood-2 training baselines our motion features (MF)
feature AUC KL feature AUC KL
set, using our fixation annotations. We exclude the first 8 (a) (b) (a) (b)
frames of each video from the sampling process, in order uniform baseline 0.500 18.63 flow magniture 0.626 18.57
central bias (CB) 0.840 15.93 pb edges with flow 0.582 17.74
to avoid the effects of the initial central fixation in our data human 0.936 10.12 flow bimodality 0.637 17.63
Harris cornerness 0.619 17.21
collection setup. We also randomly select 250 and 500 frames static features (SF)
HOG-MBH detector 0.743 14.95
color features [5] 0.644 17.90
for validation and testing, respectively. To avoid correlations, subbands [58] 0.634 17.75 feature combinations
the video sets used to sample training, validation and testing Itti&Koch channels [15] 0.598 16.98 SF [5] 0.789 16.16
saliency map [50] 0.702 17.17 SF + CB [5] 0.861 15.96
frames are disjoint. horizon detector [50] 0.741 15.45 MF 0.762 15.62
face detector [52] 0.579 16.43 MF + CB 0.830 15.97
Findings: When evaluated at 106 random locations, half of car detector [53] 0.500 18.40 SF + MF 0.812 15.94
person detector [53] 0.566 17.13 SF + MF + CB 0.871 15.89
which were fixated by the subjects and half not, the average
We show area under the curve (AUC) and KL divergence. AUC and KL
precision of our detector is 76.7%, when both MBH and HoG induce different saliency map rankings, but for visual recognition measures
descriptors are used. HoG descriptors used in isolation perform that emphasize spatial localization are essential (see also table 4 for action
better (73.4% average precision) than MBH descriptors alone recognition results and fig.6 for illustration).
(70.5%), indicating that motion structure contributes less to
detector performance than does static image structure. There
is, however, significant advantage in combining both sources Experimental Protocol: We first run our HoG-MBH detector
of information. over the entire Hollywood-2 data set and obtain our automatic
When evaluated under the AUC metric, combining predic- saliency maps. We then configure our recognition pipeline with
tors always improves performance (Table 6). As a general an interest point operator that samples locations using these
trend, low-level features are better predictors than high level saliency maps as probability distributions. We also run the
ones. The low level motion features (flow, pb edges with flow, pipeline of [34] and combine the four kernel matrices produced
flow bimodality, Harris cornerness), provide similar perfor- in the final stage of their classifier with the ones we obtain
mance to static low-level features. Our HoG-MBH detector for our 14 descriptors, sampled from the saliency maps, using
is comparable to the best static feature, the Horizon detector, MKL.
under the AUC metric. We also test our recognition pipeline on the UCF Sports
Interestingly, when evaluated according to KL divergence, dataset, which is substantially different in terms of action
the ranking of the saliency maps changes: the HoG-MBH classes, scene clutter, shooting conditions and the evaluation
detector performs best and the only other predictor that procedure. Unlike Hollywood-2, this database provides no
significantly outperforms central bias is the horizon detector. training and test sets, and classifiers are generally evaluated
Under this metric, combining features does not always im- by cross-validation. We follow the standard procedure by first
prove performance, as the linear combination method of [5] extending the dataset with horizontally flipped versions of each
optimizes pixel-level classification accuracy, and as such is not video. For each cross-validation fold, we leave out one original
able to account for the inherent competition that takes place video and its flipped version and train a multi-class classifier.
among these predictions due to image-level normalization. We We test on the original video, but not its flipped version. We
conclude by noticing that fusing our predicted maps as well as compute the confusion matrix and report the average accuracy
our static and dynamic features gives the highest results under over all classes.
AUC metrics. Moreover, the HoG-MBH detector, trained using Our experimental procedure for the UCF Sports dataset
our eye movement data is the best predictor of visual saliency closely follows the one we use for Hollywood-2. We re-train
from our candidate set, under the probabilistic measure of our HoG-MBH detector on a subset of 50 video pairs (original
matching the spatial distribution of human fixations. and flipped), while we use the rest of 100 pairs for testing.
The average precision of our detector is 92.5% for the training
set and 93.1% on our test set, which confirms that our detector
9 AUTOMATIC V ISUAL ACTION R ECOGNITION does not overfit the data. We use the re-trained detector to run
We next investigate action recognition performance when the same pipelines and baselines as for Hollywood-2.
interest points are sampled from the saliency maps predicted Results: On both datasets, our saliency map based pipelines
by our HoG-MBH detector, which we choose because it best – both predicted and ground truth – perform markedly better
approximates the ground truth saliency map spatially, under than the pipeline sampling interest points uniformly (Tables
the KL divergence metric. Apart from sampling from the uni- 4c and 5). Although central bias is a relatively close approx-
form and ground truth distributions, as a second baseline, we imation of human visual saliency on our datasets, it does not
also sample interest points using the central bias saliency map, lead to performance that closely matches the one produced
which was also shown to approximate to a reasonable extent by these maps. Our automatic pipeline based on predicted
human fixations under the less intuitive AOI measure (Table saliency maps and our ground-truth based pipeline have similar
6). We also investigate whether our end-to-end recognition performance, with a slight advantage being observed for
system can be combined with the state-of-the-art approach of predicted saliency in the case of Hollywood-2 and for ground
[34] to obtain better performance. truth saliency for UCF Sports (Tables 4d,e and 5).
13
(a) original image (b) ground truth saliency (c) CB (d) flow magnitude (e) pb edges with flow (f) flow bimodality
(g) Harris cornerness (h) HoG-MBH detector (i) MF (j) SF (k) SF + MF (l) SF + MF + CB
Fig. 6: Saliency predictions for a video frame (a), both motion-based features in isolation (d-h) and combinations (i-l). HoG-
MBH detector maps are closest to the ground truth (b), consistent with Table 6b.
image ground truth/CB HoG-MBH detector/CB image ground truth/CB HoG-MBH detector/CB
image ground truth/CB HoG-MBH detector/CB image ground truth/CB HoG-MBH detector/CB
Fig. 7: Note that ground truth saliency maps (cyan) and output of our HoG-MBH detector (yellow) are similar to each other, but
qualitatively different from the central bias map (gray). This gives visual intuition for the significantly higher performance of
the HoG-MBH detector, over the central bias saliency sampling, when used in an end-to-end computer visual action recognition
system (Table 4c,d,e).
Our results confirm that approximations produced by the TABLE 7: Recognition results using second order pooling [59]
HoG-MBH detector are qualitatively different from a central on the UCF Sports Actions Data Set
bias distribution, focusing on local image structure that fre-
accuracy
quently co-occurs in its training set of fixations, disregarding method distribution
mean stdev
on whether it is close to the image center or not (see Fig.7). Harris corners (a) 84.3% 0.00%
uniform sampling (b) 83.9% 0.57%
These structures are highly likely to be informative for predict- interest points central bias sampling (c) 83.7% 0.80%
ing the actions in the dataset (e.g. the frames containing the predicted saliency sampling (d) 86.3% 0.66%
ground truth saliency sampling (e) 86.1% 0.63%
eat and hug actions in Fig.7). Therefore, the detector will tend trajectories [34] only (f) 85.4% 0.00%
to emphasize these locations as opposed to less relevant ones. trajectories +
predicted saliency sampling (h) 87.5% 0.59%
interest points
This also explains why our predicted saliency maps can be as
informative for action recognition as ground truth maps, even Performance comparison among several classification methods (see table 4 for
description) on the UCF Sports Dataset using second order pooling. Statistics
exceeding their performance on certain action classes: while are computed using 10 random seeds.
humans will also fixate on structures not relevant for action
recognition, fixated structures that are relevant to this task will
occur at a higher frequency in our datasets. Hence, they will and 5). This demonstrates that an end-to-end automatic system
be well approximated by a detector trained in a bottom-up incorporating both human and computer vision technology can
manner. This can explain why the performance ballance is even deliver high performance on a challenging problem such as
more inclined towards predicted saliency maps on the UCF action recognition in unconstrained video.
Sports Actions dataset, where motion and image patterns are Action recognition using second order pooling: Our saliency
more stable and easier to predict compared to the Hollywood-2 based interest point operators are defined by randomly sam-
dataset. pling specific spatio-temporal probability distributions. One
Finally, we note that even though our pipeline is sparse, it way to estimate the variability induced in the recognition
achieves near state of the art performance when compared to a performance it to run the pipelines many times for different
pipeline that uses dense trajectories. When the sparse descrip- random seeds. Unfortunately, this is not practical, due to the
tors obtained from our automatic pipeline were combined with high cost of training an end-to-end recognition pipeline in the
the kernels associated to a dense trajectory representation [34] bag-of-visual words framework. Consequently, results reported
using an MKL framework with 18 kernels (14 kernels asso- in tables 4 and 5 are obtained for a single random seed.
ciated with sparse features sampled from predicted saliency However, due to the large number of random samples
maps + 4 kernels associated to dense trajectories from [34]), used to train the system (e.g. 28 million interest points for
we were able to go beyond the state-of-the-art (Tables 4f,i Hollywood-2), we expect little variance in performance for
14
[29] O. Le Meur, P. Le Callet, and D. Barba, “Predicting visual fixations on [57] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin,
video based on low-level visual features,” Vision Research, vol. 47, pp. “LIBLINEAR: A library for large linear classification,” JMLR, vol. 9,
2483–2498, 2007. 2008.
[30] A. Fathi, Y. Li, and J. M. Rehg, “Learning to recognize daily actions [58] E. Simocelli and W. Freeman, “The steerable pyramid: A flexible archi-
using gaze,” in European Conference on Computer Vision, 2012. tecture for multi-scale derivative computation,” in IEEE International
[31] S. Winkler and R. Subramanian, “Overview of eye tracking datasets,” in Conference on Image Processing, 1995.
International Workshop on Quality of Multimedia Experience (QoMEX), [59] J. Carreira, R. Caseiro, J. Batista, and C. Sminchisescu, “Semantic
2013. segmentation with second-order pooling,” in European Conference on
[32] L. Fei-Fei, A. Iyer, C. Koch, and P. Perona, “What do we perceive in a Computer Vision, 2012.
glance of a real-world scene?” Journal of Vision, 2007.
[33] H. Jhuang, T. Serre, L. Wolf, and T. Poggio, “A biologically inspired
system for action recognition,” in International Conference on Computer
Vision, 2007.
[34] H. Wang, A.Klaser, C.Schmid, and C. Liu, “Action recognition by dense
trajectories,” in IEEE International Conference on Computer Vision and
Pattern Recognition, 2011.
[35] J. Yamato, J. Ohya, and K. Ishii, “Recognizing human action in time-
squential images using hidden markov model,” in IEEE International
Conference on Computer Vision and Pattern Recognition, 1992.
[36] W.Li, Z.Zhang, and Z.Liu, “Expandable data-driven graphical modeling
of human actions based on salient postures,” IEEE TCSVT, vol. 18, 2008. Stefan Mathe has obtained his MSc. degree in
[37] M. Hoai and F. De la Torre, “Max-margin early event detectors,” in IEEE Computer Science from the Technical University
International Conference on Computer Vision and Pattern Recognition, of Cluj Napoca, Romania. He is currently a Re-
2012. search Assistant at the Institute of Mathematics
[38] M. Dorr, T. Martinetz, K. Gegenfurtner, and E. Barth, “Variability of eye of the Romanian Academy and a Phd student in
movements when viewing dynamic natural scenes,” Journal of Vision, the Artificial Intelligence Laboratory at the Uni-
vol. 10, 2010. versity of Toronto. His research interests span
[39] M. R. Greene, T. Liu, and J. M. Wolfe, “Reconsidering yarbus: A failure the areas of both low level and high level com-
to predict observers’ task from eye movement patterns,” Vision Research, puter vision, machine learning, natural language
vol. 62, 2012. understanding and computer graphics. His Phd
[40] O. Le Meur and T. Baccino, “Methods for comparing scanpaths and research is focused on the problems of action
saliency maps: strengths and weaknesses,” Behavior Research Methods, recognition, with emphasis of developing higher level invariant features
2012. and exploiting contextual information.
[41] B. Yao and L. Fei-Fei, “Modeling mutual context of object and human
pose in human-object interaction activities,” in IEEE International
Conference on Computer Vision and Pattern Recognition, 2010.
[42] A. Prest, C. Schmid, and V. Ferrari, “Weakly supervised learning of
interactions between humans and objects,” IEEE Transactions on Pattern
Analysis and Machine Intelligence, 2011.
[43] A. Hwang, H. Wang, and M. Pomplun, “Semantic guidance of eye
movements in real-world scenes,” Vision Research, 2011.
[44] B. Needleman and C. Wunsch, “A general method applicable to the
search for similarities in the amino acid sequence of two proteins,”
Journal of Molecular Biology, 1970.
[45] I. Laptev and T. Lindeberg, “Space-time interest points,” in International Cristian Sminchisescu has obtained a doc-
Conference on Computer Vision, 2003. torate in Computer Science and Applied Math-
[46] N. Dalal, B.Triggs, and C.Schmid, “Human detection using oriented his- ematics with an emphasis on imaging, vision
tograms of flow and appearance,” in European Conference on Computer and robotics at INRIA, France, under an Eiffel
Vision, 2006. excellence doctoral fellowship, and has done
[47] N. Dalal and B. Triggs, “Histograms of oriented gradients for human postdoctoral research in the Artificial Intelligence
detection,” in IEEE International Conference on Computer Vision and Laboratory at the University of Toronto. He is a
Pattern Recognition, 2005. member in the program committees of the main
[48] B. Hariharan, L. Zelnik-Manor, S. V. N. Vishwanathan, and M. Varma, conferences in computer vision and machine
“Large scale max-margin multi-label classification with priors,” in learning (CVPR, ICCV, ECCV, NIPS, AISTATS),
International Conference on Machine Learning, 2010. area chair for ICCV07-13, and a member of
[49] T. Judd, F. Durand, and A. Torralba, “A benchmark of computational the Editorial Board (Associate Editor) of IEEE Transactions on Pattern
models of saliency to predict human fixations,” MIT, Tech. Rep. 1, 2012. Analysis and Machine Intelligence (PAMI). He has given more than 50
[50] A. Oliva and A. Torralba, “Modeling the shape of the scene: A invited talks and presentations and has offered tutorials on 3d tracking,
holistic representation of the spatial envelope,” International Journal recognition and optimization at ICCV and CVPR, the Chicago Machine
on Computer Vision, no. 42, 2001. Learning Summer School, the AERFAI Vision School in Barcelona and
[51] R. Rosenholtz, “A simple saliency model predicts a number of motion the Computer Vision Summer School (VSS) in Zurich. Sminchisescu’s
popout phenomena,” Vision Research, no. 39, 1999. research goal is to train computers to see. His research interests are
[52] P. Viola and M. Jones, “Robust real-time object detection.” International in the area of computer vision (articulated objects, 3d reconstruction,
Journal on Computer Vision, 2001. segmentation, and object and action recognition) and machine learning
[53] P. Felzenswalb, D. McAllester, and D. Ramanan, “A discriminatively (optimization and sampling algorithms, structured prediction, sparse
trained, multiscale, deformable part model,” in IEEE International approximations and kernel methods).
Conference on Computer Vision and Pattern Recognition, 2008.
[54] D. Sun, S. Roth, and M. J. Black, “Secrets of optical flow and their
principles,” in IEEE International Conference on Computer Vision and
Pattern Recognition, 2010.
[55] P. Arbelaez, M. Maire, C. Fowlkes, and J. Malik, “Occlusion bound-
ary detection and figure/ground assignment from optical flow,” IEEE
Transactions on Pattern Analysis and Machine Intelligence, 2011.
[56] A. Vedaldi and A. Zisserman, “Efficient additive kernels via explicit
feature maps,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, 2011.