A Hierarchical Deep Temporal Model For Group Activity Recognition
A Hierarchical Deep Temporal Model For Group Activity Recognition
Mostafa S. Ibrahim∗, Srikanth Muralidharan∗, Zhiwei Deng, Arash Vahdat, Greg Mori
School of Computing Science, Simon Fraser University, Burnaby, Canada
{msibrahi, smuralid, zhiweid, avahdat}@sfu.ca, [email protected]
Abstract
1971
classes of activities. Hence, we develop a novel hierarchical tion to higher group-level interactions. Lan et al. [22] and
deep temporal model that reasons over individual people. Ramanathan et al. [27] explore the idea of social roles, the
Given a set of detected and tracked people, we run temporal expected behaviour of an individual person in the context
deep networks (LSTMs) to analyze each individual person. of group, in fully supervised and weakly supervised frame-
These LSTMs are aggregated over the people in a scene works respectively. Choi and Savarese [3] have unified
into a higher level deep temporal model. This allows the tracking multiple people, recognizing individual actions, in-
deep model to learn the relations between the people (and teractions and collective activities in a joint framework. In
their appearances) that contribute to recognizing a particular other work [5], a random forest structure is used to sample
group activity. discriminative spatio-temporal regions from input video fed
The main contribution of this paper is the proposal of to 3D Markov random field to localize collective activities
a novel deep architecture that models group activities in in a scene. Shu et al. [30] detect group activities from aerial
a principled structured temporal framework. Our 2-stage video using an AND-OR graph formalism. The above-
approach models individual person activities in its first mentioned methods use shallow hand crafted features, and
stage, and then combines person level information to rep- typically adopt a linear model that suffers from representa-
resent group activities. The model’s temporal representa- tional limitations.
tion is based on the long short-term memory (LSTM): re- Sport Video Analysis: Previous work has extended
current neural networks such as these have recently demon- group activity recognition to team activity recognition in
strated successful results in sequential tasks such as im- sport footage. Seminal work in this vein includes Intille
age captioning [9] and speech recognition [10]. Through and Bobick [13], who examined stochastic representations
the model structure, we aim at constructing a representa- of American football plays. Siddiquie et al. [31] proposed
tion that leverages the discriminative information in the hi- sparse multiple kernel learning to select features incorpo-
erarchical structure between individual person actions and rated in a spatio-temporal pyramid. Morariu et al. [24]
group activities. The model can be used in general group track players, infer part locations, and reason about tempo-
activity applications such as video surveillance, sport ana- ral structure in 1-on-1 basketball games. Swears et al. [35]
lytics, and video search and retrieval. used the Granger Causality statistic to automatically con-
To cater the needs of our problem, we also propose a new strain the temporal links of a Dynamic Bayesian Network
volleyball dataset that offers person detections, and both the (DBN) for handball videos. Direkoglu and O’Connor [8]
person action label, as well as the group activity label. The solved a particular Poisson equation to generate a holis-
camera view of the selected sports videos allows us to track tic player location representation. Kwak et al. [20] opti-
the players in the scene. Experimentally, the model is effec- mize based on a rule-based depiction of interactions be-
tive in recognizing the overall team activity based on recog- tween people.
nizing and integrating player actions. Deep Learning: Deep Convolutional Neural Networks
This paper is organized as follows. In Section 2, we (CNNs) have shown impressive performance by unifying
provide a brief overview of the literature related to activity feature and classifier learning and the availability of large
recognition. In Section 3, we elaborate details of the pro- labeled datasets. Successes have been demonstrated on a
posed group activity recognition model. In Section 4, we variety of computer vision tasks including image classifica-
tabulate the performance of approach, and end in Section 5 tion [18, 33] and action recognition [32, 16]. More flexi-
with a conclusion of this work. ble recurrent neural network (RNN) based models are used
for handling variable length space-time inputs. Specifically,
2. Related Work LSTM [12] models are popular among RNN models due
Human activity recognition is an active area of research, to the tractable learning framework that they offer when it
with many existing algorithms. Surveys by Weinland et comes to deep representations. These LSTM models have
al. [40] and Poppe [26] explore the vast literature in activ- been applied to a variety of tasks [9, 10, 25, 38]. For in-
ity recognition. Here, we will focus on the group activ- stance, in Donahue et al. [9], the so-called Long term Recur-
ity recognition problem and recent related advances in deep rent Convolutional network, formed by stacking an LSTM
learning. on top of pre-trained CNNs, is proposed for handling se-
Group Activity Recognition: Group activity recogni- quential tasks such as activity recognition, image descrip-
tion has attracted a large body of work recently. Most pre- tion, and video description. In Karpathy et al. [15], struc-
vious work has used hand-crafted features fed to structured tured objectives are used to align CNNs over image regions
models that represent information between individuals in and bi-directional RNNs over sentences. A deep multi-
space and/or time domains. Lan et al. [23] proposed an modal RNN architecture is used for generating image de-
adaptive latent structure learning that represents hierarchi- scriptions using the deduced alignments.
cal relationships ranging from lower person-level informa- In this work, we aim at building a hierarchical struc-
1972
tured model that incorporates a deep LSTM framework to in a volleyball game a team may move from defence
recognize individual actions and group activities. Previous phase to pass and then attack.
work in the area of deep structured learning includes Tomp-
son et al. [37] for pose estimation, and Zheng et al. [42] Many classic approaches to the group activity recog-
and Schwing et al. [29] for semantic image segmentation. nition problem have modeled these elements in a form
In Deng et al. [7] a similar framework is used for group of structured prediction based on hand crafted features
activity recognition, where a neural network-based hier- [39, 28, 23, 22, 27]. Inspired by the success of deep learn-
archical graphical model refines person action labels and ing based solutions, in this paper, a novel hierarchical deep
learns to predict the group activity simultaneously. While learning based model is proposed that is potentially capable
these methods use neural network-based graphical repre- of learning low-level image features, person-level actions,
sentations, in our current approach, we leverage LSTM- their temporal relations, and temporal group dynamics in a
based temporal modelling to learn discriminative informa- unified end-to-end framework.
tion from time varying sports activity data. In [41], a new Given the sequential nature of group activity analysis,
dataset is introduced that contains dense multiple labels per our proposed model is based on a Recurrent Neural Net-
frame for underlying action, and a novel Multi-LSTM is work (RNN) architecture. RNNs consist of non-linear units
used to model the temporal relations between labels present with internal states that can learn dynamic temporal behav-
in the dataset. ior from a sequential input with arbitrary length. Therefore,
Datasets: Popular datasets for activity recognition in- they overcome the limitation of CNNs that expect constant
clude the Sports-1M dataset [15], UCF 101 database [34], length input. This makes them widely applicable to video
and the HMDB movie database [19]. These datasets started analysis tasks such as activity recognition.
to shift the focus to unconstrained Internet videos that con- Our model is inspired by the success of hierarchical
tain more intra-class variation, compared to a constrained models. Here, we aim to mimic a similar intuition using
dataset. While these datasets continue to focus on indi- recurrent networks. We propose a deep model by stacking
vidual human actions, in our work we focus on recogniz- several layers of RNN-type structures to model a large range
ing more complex group activities in sport videos. Choi et of low-level to high-level dynamics defined on top of people
al. [4] introduced the Collective Activity Dataset consisting and entire groups. We describe the use of these RNN struc-
of real world pedestrian sequences where the task is to find tures for individual and group activity recognition next.
the high level group activity. In this paper, we experiment
with this dataset, but also introduce a new dataset for group 3.1. Temporal Model of Individual Action
activity recognition in sport footage which is annotated with
Given tracklets of each person in a scene, we use long
player pose, location, and group activities to encourage sim-
short-term memory (LSTM) models to represent temporally
ilar research in the sport domain.
the action of each individual person. Such temporal infor-
mation is complementary to spatial features and is critical
3. Proposed Approach
for performance. LSTMs, originally proposed by Hochre-
Our goal in this paper is to recognize activities per- iter and Schmidhuber [12], have been used successfully for
formed by a group of people in a video sequence. The input many sequential problems in computer vision. Each LSTM
to our method is a set of tracklets of the people in a scene. unit consists of several cells with memory that stores infor-
The group of people in the scene could range from players mation for a short temporal interval. The memory content
in a sports video to pedestrians in a surveillance video. In of a LSTM makes it suitable for modeling complex tempo-
this paper we consider three cues that can aid in determining ral relationships that may span a long range.
what a group of people is doing: The content of the memory cell is regulated by several
gating units that control the flow of information in and out
• Person-level actions collectively define a group activ- of the cells. The control they offer also helps in avoiding
ity. Person action recognition is a first step toward rec- spurious gradient updates that can typically happen in train-
ognizing group activities. ing RNNs when the length of a temporal input is large. This
• Temporal dynamics of a person’s action is higher- property enables us to stack a large number of such layers
order information that can serve as a strong signal for in order to learn complex dynamics present in the input in
group activity. Knowing how each person’s action is different ranges.
changing over time can be used to infer the group’s We use a deep Convolutional Neural Network (CNN) to
activity. extract features from the bounding box around the person
in each time step on a person trajectory. The output of the
• Temporal evolution of group activity represents how CNN, represented by xt , can be considered as a complex
a group’s activity is evolving over time. For example, image-based feature describing the spatial region around a
1973
person. Assuming xt as the input of an LSTM cell at time The output of the pooling layer forms our representation for
t, the cell activition can be formulated as : the group activity. The second LSTM network, working on
top of the temporal representation, is used to directly model
it = σ(Wxi xt + Whi ht−1 + bi ) (1) the temporal dynamics of group activity. The LSTM
ft = σ(Wxf xt + Whf ht−1 + bf ) (2) layer of the second network is directly connected to a clas-
ot = σ(Wxo xt + Who ht−1 + bo ) (3) sification layer in order to detect group activity classes in a
video sequence.
gt = φ(Wxc xt + Whc ht−1 + bc ) (4)
Mathematically, the pooling layer can be expressed as
ct = ft ⊙ ct−1 + it ⊙ gt (5) the following:
ht = ot ⊙ φ(ct ) (6)
Ptk = xtk ⊕ htk (7)
Here, σ stands for a sigmoid function, and φ stands for Zt = Pt1 ⋄ Pt2 ... ⋄ Ptk (8)
the tanh function. xt is the input, ht ∈ RN is the hidden
state with N hidden units, ct ∈ RN is the memory cell, In this equation, htk corresponds to the first stage LSTM
it ∈ RN , ft ∈ RN , ot ∈ RN , and, gt ∈ RN are input gate, output, and xtk corresponds to the AlexNet fc7 feature, both
forget gate, output gate, and input modulation gate at time t obtained for the kth person at time t. We concatenate these
respectively. ⊙ represents element-wise multiplication. two features (represented by ⊕) to obtain the temporal fea-
When modeling individual actions, the hidden state ht ture representation Ptk for kth person. We then construct the
could be used to model the action a person is performing frame level feature representation Zt at time t by applying a
at time t. Note that the cell output is evolving over time max pooling operation (represented by ⋄) over the features
based on the past memory content. Due to the deployment of all the people. Finally, we feed the frame level repre-
of gates on the information flow, the hidden state will be sentation to our second LSTM stage that operates similar
formed based on a short-range memory of the person’s past to the person level LSTMs that we described in the pre-
behaviour. Therefore, we can simply pass the output of the vious subsection, and learn the group level dynamics. Zt ,
LSTM cell at each time to a softmax classification layer1 to passed through a fully connected layer, is given to the input
predict individual person-level action for each tracklet. of the second-stage LSTM layer. The hidden state of the
The LSTM layer on top of person trajectories forms the LSTM layer represented by hgroup carries temporal infor-
t
first stage of our hierarchical model. This stage is designed mation for the whole group dynamics. hgroup is fed to a
t
to model person-level actions and their temporal evolu- softmax classification layer to predict group activities.
tion. Our training proceeds in a stage-wise fashion, first
training to predict person level actions, and then pasing the 3.3. Implementation Details
hidden states of the LSTM layer to the second stage for
group activity recognition, as discussed in the next section. We trained our model in two steps. In the first step, the
person-level CNN and the first LSTM layer are trained in
3.2. Hierarchical Model for Group Activity Recog- an end-to-end fashion using a set of training data consist-
nition ing of person tracklets annotated with action labels. We
implement our model using Caffe [14]. Similar to other
At each time step, the memory content of the first LSTM
approaches [9, 7, 38], we initialize our CNN model with
layer contains discriminative information describing the
the pre-trained AlexNet network and we fine-tune the whole
subject’s action as well as past changes in his action. If
network for the first LSTM layer. 9 timesteps and 3000 hid-
the memory content is correctly collected over all people in
den nodes are used for the first LSTM layer and a softmax
the scene, it can be used to describe the group activity in the
layer is deployed for the classification layer in this stage.
whole scene.
After training the first LSTM layer, we concatenate the
Moreover, it can also be observed that direct image-
fc7 layer of AlexNet and the LSTM layer for every person
based features extracted from the spatial domain around a
and pool over all people in a scene. The pooled features,
person carries a discriminative signal for the ongoing activ-
which correspond to frame level features, are fed to the sec-
ity. Therefore, a deep CNN model is used to extract com-
ond LSTM network. This network consists of a 3000-node
plex features for each person in addition to the temporal
fully connected layer followed by a 9-timestep 500-node
features captured by the first LSTM layer.
LSTM layer which is passed to a softmax layer trained to
At this moment, the concatenation of the CNN features
recognize group activity labels.
and the LSTM layer represent temporal features for a per-
For training all our models (that include both the base-
son. Various pooling strategies can be used to aggregate
line models and both the stages of the two-stage model), we
these features over all people in the scene at each time step.
follow the same training protocol. We use a fixed learning
1 More precisely, a fully connected layer fed to softmax loss layer. rate of 0.00001 and a momentum of 0.9. For tracking sub-
1974
Figure 2: Our two-stage model for a volleyball match. Given tracklets of K-players, we feed each tracklet in a CNN, followed
by a person LSTM layer to represent each player’s action. We then pool over all people’s temporal features in the scene. The
output of the pooling layer is feed to the second LSTM network to identify the whole teams activity.
jects in a scene, we used the tracker by Danelljan et al. [6], activity annotations are used in a deep learning model
implemented in the Dlib library [17]. that does not model the temporal aspect of group ac-
tivities. This is very similar to our two-stage model
4. Experiments without the temporal modeling.
In this section, we evaluate our model by comparing 4. Temporal Model with Image Features: This baseline
our results with several baselines and previously published is a temporal extension of the first baseline. It exam-
works on the Collective Activity Dataset [4] and our new ines the idea of feeding image level features directly to
volleyball dataset. First, we describe our baseline mod- a LSTM model to recognize group activities. In this
els. Then, we present our results on the Collective Activity baseline, the AlexNet model is deployed on the whole
Dataset followed by experiments on the volleyball dataset. image and resulting fc7 features are fed to a LSTM
model. This baseline can be considered as a reimple-
4.1. Baselines mentation of Donahue et al. [9].
The following baselines are considered in all our experi-
5. Temporal Model with Person Features: This base-
ments:
line is a temporal extension of the second baseline:
1. Image Classification: This baseline is the basic fc7 features pooled over all people are fed to a LSTM
AlexNet model fine-tuned for group activity recogni- model to recognize group activities.
tion in a single frame. 6. Two-stage Model without LSTM 1: This baseline is
2. Person Classification: In this baseline, the AlexNet a variant of our model, omitting the person-level tem-
CNN model is deployed on each person, fc7 features poral model (LSTM 1). Instead, the person-level clas-
are pooled over all people, and are fed to a softmax sification is done only with the fine-tuned person CNN.
classifier to recognize group activities in each single 7. Two-stage Model without LSTM 2: This baseline is
frame. a variant of our model, omitting the group-level tem-
3. Fine-tuned Person Classification: This baseline is poral model (LSTM 2). In other words, we do the fi-
similar to the previous baseline with one distinction. nal classification based on the outputs of the temporal
The AlexNet model on each player is fine-tuned to models for individual person action labels, but without
recognize person-level actions. Then, fc7 is pooled an additional group-level LSTM.
over all players to recognize group activities in a scene
4.2. Experiments on the Collective Activity Dataset
without any fine-tuning of the AlexNet model. The
rational behind this baseline is to examine a scenario The Collective Activity Dataset [4] has been widely used
where person-level action annotations as well as group for evaluating group activity recognition approaches in the
1975
computer vision literature [1, 7, 2]. This dataset consists of of actions in the scene) which is exactly the way group ac-
44 videos, eight person-level pose labels (not used in our tivities are defined in this dataset.
work), five person level action labels, and five group-level
activities. A scene is assigned a group activity label based 4.2.1 Discussion
on the majority of what people are doing. We follow the
train/test split provided by [11]. In this section, we present The confusion matrix obtained for the Collective Activity
our results on this dataset. Dataset using our two-stage model is shown in Figure 3.
We observe that the model performs almost perfectly for
Method Accuracy the talking and queuing classes, and gets confused between
B1-Image Classification 63.0 crossing, waiting, and walking. Such behaviour is perhaps
B2-Person Classification 61.8 due to a lack of consideration of spatial relations between
B3-Fine-tuned Person Classification 66.3 people in the group, which is shown to boost the perfor-
B4-Temporal Model with Image Features 64.2 mance of previous group activity recognition methods: e.g.
B5-Temporal Model with Person Features 62.2 crossing involves the walking action, but is confined in a
B6-Two-stage Model without LSTM 1 70.1 path which people perform in orderly fashion. Therefore,
B7-Two-stage Model without LSTM 2 76.8 our model that is designed only to learn the dynamic proper-
Two-stage Hierarchical Model 81.5 ties of group activities often gets confused with the walking
action.
It is clear that our two-stage model has improved perfor-
Table 1: Comparison of our method with baseline methods mance with compared to baselines. The temporal informa-
on the Collective Activity Dataset. tion improves performance. Further, finding and describing
the elements of a video (i.e. persons) provides benefits over
utilizing frame level features.
Method Accuracy
Contextual Model [23] 79.1
Deep Structured Model [7] 80.6
Our Two-stage Hierarchical Model 81.5
Cardinality kernel [11] 83.4
1976
(a) (b)
(c) (d)
(e) (f)
(g) (h)
Figure 4: Visualizations of the generated scene labels using our model. Green denotes correct classifications, red denotes
incorrect. The incorrect ones correspond to the confusion between different actions in ambiguous cases (h and j examples),
or in the left and right distinction (i example).
From the tables, we observe that the group activity labels be made publicly available to facilitate future comparisons
2
are relatively more balanced compared to the player action .
labels. This follows from the fact that we often have peo- In Table 5, the classification performance of our pro-
ple present in static actions like standing compared to dy- posed model is compared against the baselines. Similar
namic actions (setting, spiking, etc.). Therefore, our dataset to the performance in the Collective Activity Dataset, our
presents a challenging team activity recognition task, where two-stage LSTM model outperforms the baseline models.
we have interesting actions that can directly determine the
group activity occur rarely in our dataset. The dataset will 2 https://github.com/mostafa-saad/
deep-activity-rec
1977
Action Average No. of atively more critical to the performance of the system, com-
Group No. of Classes Instance per Frame pared to the second layer LSTM (B6 baseline).
Activity Class Instances Waiting 0.30 All the reported experiments use max-pooling as men-
Right set 229 Setting 0.33 tioned above. However, we also tried both sum and average
Right spike 187 Digging 0.57 pooling, but their performance was consistently lower com-
Right pass 267 Falling 0.21 pared to their max-pooling counterpart.
Left pass 304 Spiking 0.28
Left spike 246 Blocking 0.58
Left set 223 Others 9.22
Acknowledgements
In both datasets, an observation from the tables is that
while both LSTMs contribute to overall classification per- This work was supported by grants from NSERC and
formance, having the first layer LSTM (B7 baseline) is rel- Disney Research.
1978
References [17] D. E. King. Dlib-ml: A machine learning toolkit. Journal of
Machine Learning Research, 10:1755–1758, 2009.
[1] M. R. Amer, P. Lei, and S. Todorovic. Hirf: Hierarchi-
[18] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet
cal random field for collective activity recognition in videos.
classification with deep convolutional neural networks. In
In Computer Vision–ECCV 2014, pages 572–585. Springer,
Advances in neural information processing systems, pages
2014.
1097–1105, 2012.
[2] M. R. Amer, D. Xie, M. Zhao, S. Todorovic, and S.-C. Zhu.
[19] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre.
Cost-sensitive top-down/bottom-up inference for multiscale
Hmdb: a large video database for human motion recogni-
activity recognition. In Computer Vision–ECCV 2012, pages
tion. In Computer Vision (ICCV), 2011 IEEE International
187–200. Springer, 2012.
Conference on, pages 2556–2563. IEEE, 2011.
[3] W. Choi and S. Savarese. A unified framework for multi-
[20] S. Kwak, B. Han, and J. H. Han. Multi-agent event detection:
target tracking and collective activity recognition. In Com-
Localization and role assignment. In CVPR, 2013.
puter Vision–ECCV 2012, pages 215–230. Springer, 2012.
[21] T. Lan, L. Sigal, and G. Mori. Social roles in hierarchical
[4] W. Choi, K. Shahid, and S. Savarese. What are they do-
models for human activity recognition. In Computer Vision
ing?: Collective activity classification using spatio-temporal
and Pattern Recognition (CVPR), 2012 IEEE Conference on,
relationship among people. In Computer Vision Workshops
pages 1354–1361. IEEE, 2012.
(ICCV Workshops), 2009 IEEE 12th International Confer-
ence on, pages 1282–1289. IEEE, 2009. [22] T. Lan, L. Sigal, and G. Mori. Social roles in hierarchical
models for human activity recognition. In Computer Vision
[5] W. Choi, K. Shahid, and S. Savarese. Learning context for
and Pattern Recognition (CVPR), 2012.
collective activity recognition. In Computer Vision and Pat-
tern Recognition (CVPR), 2011 IEEE Conference on, pages [23] T. Lan, Y. Wang, W. Yang, S. Robinovitch, and G. Mori. Dis-
3273–3280. IEEE, 2011. criminative latent models for recognizing contextual group
[6] M. Danelljan, G. Häger, F. Shahbaz Khan, and M. Felsberg. activities. IEEE Transactions on Pattern Analysis and Ma-
Accurate scale estimation for robust visual tracking, 2014. chine Intelligence, 34(8):1549–1562, 2012.
BMVC. [24] V. I. Morariu and L. S. Davis. Multi-agent event recogni-
[7] Z. Deng, M. Zhai, L. Chen, Y. Liu, S. Muralidharan, tion in structured scenarios. In Computer Vision and Pattern
M. Roshtkhari, and G. Mori. Deep structured models for Recognition (CVPR), 2011.
group activity recognition. In British Machine Vision Con- [25] J. Y.-H. Ng, M. Hausknecht, S. Vijayanarasimhan,
ference (BMVC), 2015. O. Vinyals, R. Monga, and G. Toderici. Beyond short snip-
[8] C. Direkoglu and N. E. O’Connor. Team activity recognition pets: Deep networks for video classification. CVPR, 2015.
in sports. In Computer Vision–ECCV 2012, pages 69–83. [26] R. Poppe. A survey on vision-based human action recogni-
Springer, 2012. tion. Image and vision computing, 28(6):976–990, 2010.
[9] J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach, [27] V. Ramanathan, B. Yao, and L. Fei-Fei. Social role discovery
S. Venugopalan, K. Saenko, and T. Darrell. Long-term recur- in human events. In Computer Vision and Pattern Recogni-
rent convolutional networks for visual recognition and de- tion (CVPR), 2013 IEEE Conference on, pages 2475–2482.
scription. arXiv preprint arXiv:1411.4389, 2014. IEEE, 2013.
[10] A. Graves and N. Jaitly. Towards end-to-end speech recog- [28] C. Schüldt, I. Laptev, and B. Caputo. Recognizing human
nition with recurrent neural networks. In Proceedings of the actions: a local svm approach. In Pattern Recognition, 2004.
31st International Conference on Machine Learning (ICML- ICPR 2004. Proceedings of the 17th International Confer-
14), pages 1764–1772, 2014. ence on, volume 3, pages 32–36. IEEE, 2004.
[11] H. Hajimirsadeghi, W. Yan, A. Vahdat, and G. Mori. Visual [29] A. G. Schwing and R. Urtasun. Fully connected deep struc-
recognition by counting instances: A multi-instance cardi- tured networks. arXiv preprint arXiv:1503.02351, 2015.
nality potential kernel. CVPR, 2015. [30] T. Shu, D. Xie, B. Rothrock, S. Todorovic, and S.-C. Zhu.
[12] S. Hochreiter and J. Schmidhuber. Long short-term memory. Joint inference of groups, events and human roles in aerial
Neural computation, 9(8):1735–1780, 1997. videos. In CVPR, 2015.
[13] S. S. Intille and A. Bobick. Recognizing planned, multi- [31] B. Siddiquie, Y. Yacoob, and L. Davis. Recognizing plays in
person action. Computer Vision and Image Understanding american football videos. Technical report, Technical report,
(CVIU), 81:414–445, 2001. University of Maryland, 2009.
[14] Y. Jia. Caffe: An open source convolutional [32] K. Simonyan and A. Zisserman. Two-stream convolutional
architecture or fast feature embedding, 2013. networks for action recognition in videos. In Advances
http://caffe.berkeleyvision.org/. in Neural Information Processing Systems, pages 568–576,
[15] A. Karpathy and L. Fei-Fei. Deep visual-semantic align- 2014.
ments for generating image descriptions. CVPR, 2015. [33] K. Simonyan and A. Zisserman. Very deep convolutional
[16] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, networks for large-scale image recognition. arXiv preprint
and L. Fei-Fei. Large-scale video classification with con- arXiv:1409.1556, 2014.
volutional neural networks. In Computer Vision and Pat- [34] K. Soomro, A. R. Zamir, and M. Shah. Ucf101: A dataset
tern Recognition (CVPR), 2014 IEEE Conference on, pages of 101 human actions classes from videos in the wild. arXiv
1725–1732. IEEE, 2014. preprint arXiv:1212.0402, 2012.
1979
[35] E. Swears, A. Hoogs, Q. Ji, and K. Boyer. Complex ac-
tivity recognition using granger constrained dbn (gcdbn) in
sports and surveillance video. In Computer Vision and Pat-
tern Recognition (CVPR), June 2014.
[36] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed,
D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich.
Going deeper with convolutions. In CVPR, 2015.
[37] J. J. Tompson, A. Jain, Y. LeCun, and C. Bregler. Joint
training of a convolutional network and a graphical model
for human pose estimation. In Z. Ghahramani, M. Welling,
C. Cortes, N. Lawrence, and K. Weinberger, editors, Ad-
vances in Neural Information Processing Systems 27, pages
1799–1807. Curran Associates, Inc., 2014.
[38] S. Venugopalan, H. Xu, J. Donahue, M. Rohrbach,
R. Mooney, and K. Saenko. Translating videos to natural lan-
guage using deep recurrent neural networks. arXiv preprint
arXiv:1412.4729, 2014.
[39] H. Wang, A. Kläser, C. Schmid, and C.-L. Liu. Action recog-
nition by dense trajectories. In Computer Vision and Pat-
tern Recognition (CVPR), 2011 IEEE Conference on, pages
3169–3176. IEEE, 2011.
[40] D. Weinland, R. Ronfard, and E. Boyer. A survey of vision-
based methods for action representation, segmentation and
recognition. Computer Vision and Image Understanding,
115(2):224–241, 2011.
[41] S. Yeung, O. Russakovsky, N. Jin, M. Andriluka, G. Mori,
and L. Fei-Fei. Every moment counts: Dense detailed
labeling of actions in complex videos. arXiv preprint
arXiv:1507.05738, 2015.
[42] S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet,
Z. Su, D. Du, C. Huang, and P. H. S. Torr. Conditional ran-
dom fields as recurrent neural networks. In International
Conference on Computer Vision (ICCV), 2015.
1980