A Hierarchical Deep Temporal Model For Group Activity Recognition

This paper presents a hierarchical deep temporal model using LSTM networks for group activity recognition, focusing on the dynamics of individual actions to infer overall group behavior. The model operates in two stages: first analyzing individual actions and then aggregating this information to understand group activities, demonstrating improved performance on datasets including a new volleyball dataset. The approach leverages temporal dynamics and spatial features to effectively recognize complex group activities in video sequences.

Uploaded by

cty1942401351

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views10 pages

A Hierarchical Deep Temporal Model For Group Activity Recognition

Uploaded by

cty1942401351

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

A Hierarchical Deep Temporal Model for Group Activity Recognition

Mostafa S. Ibrahim∗, Srikanth Muralidharan∗, Zhiwei Deng, Arash Vahdat, Greg Mori
School of Computing Science, Simon Fraser University, Burnaby, Canada
{msibrahi, smuralid, zhiweid, avahdat}@sfu.ca, [email protected]

Abstract

In group activity recognition, the temporal dynamics of

the whole activity can be inferred based on the dynamics
of the individual people representing the activity. We build
a deep model to capture these dynamics based on LSTM
(long short-term memory) models. To make use of these ob-
servations, we present a 2-stage deep temporal model for
the group activity recognition problem. In our model, a
LSTM model is designed to represent action dynamics of
individual people in a sequence and another LSTM model
is designed to aggregate person-level information for whole
activity understanding. We evaluate our model over two
datasets: the Collective Activity Dataset and a new vol-
leyball dataset. Experimental results demonstrate that our
proposed model improves group activity recognition perfor- Figure 1: Group activity recognition via a hierarchical
mance compared to baseline methods. model. Each person in a scene is modeled using a temporal
model that captures his/her dynamics, these models are in-
tegrated into a higher-level model that captures scene-level
1. Introduction activity.
What are the people in Figure 1 doing? This question
can be answered at numerous levels of detail – in this paper
we focus on the group activity, a high-level answer such A naive approach to group activity recognition with a
as “team spiking acivity”. We develop a novel hierarchical deep model would be to simply treat an image as an holis-
deep model for group activity recognition. tic input. One could train a model to classify this image
according to the group activity taking place. However, it
A key cue for group activity recognition is the spatio-
isn’t clear if this will work given the redundancy in the
temporal relations among the people in the scene. Deter-
training data: with a dataset of volleyball videos, frames
mining where individual people are in a scene, analyzing
will be dominated by features of volleyball courts. The
their image appearance, and aggregating these features and
differences between the different classes of group activities
their relations can discern which group activity is present.
are about spatio-temporal relations between people, beyond
A volume of research has explored models for this type
just global appearance. Forcing a deep model to learn in-
of reasoning [4, 21, 27, 1]. However, these approaches
variance to translation, to focus on the relations between
have focused on probabilistic or discriminative models built
people, presents a significant challenge to the learning al-
upon hand-crafted features. Since they rely on shallow
gorithm. Similar challenges exist in the object recognition
hand crafted feature representations, they are limited by
literature, and research often focuses on designing pooling
their representational abilities to model a complex learning
operators for deep networks (e.g. [36]) that enable the net-
task. Deep representations have overcome this limitation
work to learn effective classifiers.
and yielded state of the art results in several computer vi-
sion benchmarks [18, 33, 16]. Group activity recognition presents a similar challenge
– appropriate networks need to be designed that allow the
∗ Equal Contribution learning algorithm to focus on differentiating higher-level

1971
classes of activities. Hence, we develop a novel hierarchical tion to higher group-level interactions. Lan et al. [22] and
deep temporal model that reasons over individual people. Ramanathan et al. [27] explore the idea of social roles, the
Given a set of detected and tracked people, we run temporal expected behaviour of an individual person in the context
deep networks (LSTMs) to analyze each individual person. of group, in fully supervised and weakly supervised frame-
These LSTMs are aggregated over the people in a scene works respectively. Choi and Savarese [3] have unified
into a higher level deep temporal model. This allows the tracking multiple people, recognizing individual actions, in-
deep model to learn the relations between the people (and teractions and collective activities in a joint framework. In
their appearances) that contribute to recognizing a particular other work [5], a random forest structure is used to sample
group activity. discriminative spatio-temporal regions from input video fed
The main contribution of this paper is the proposal of to 3D Markov random field to localize collective activities
a novel deep architecture that models group activities in in a scene. Shu et al. [30] detect group activities from aerial
a principled structured temporal framework. Our 2-stage video using an AND-OR graph formalism. The above-
approach models individual person activities in its first mentioned methods use shallow hand crafted features, and
stage, and then combines person level information to rep- typically adopt a linear model that suffers from representa-
resent group activities. The model’s temporal representational limitations.
tion is based on the long short-term memory (LSTM): re- Sport Video Analysis: Previous work has extended
current neural networks such as these have recently demon- group activity recognition to team activity recognition in
strated successful results in sequential tasks such as im- sport footage. Seminal work in this vein includes Intille
age captioning [9] and speech recognition [10]. Through and Bobick [13], who examined stochastic representations
the model structure, we aim at constructing a representa- of American football plays. Siddiquie et al. [31] proposed
tion that leverages the discriminative information in the hi- sparse multiple kernel learning to select features incorpo-
erarchical structure between individual person actions and rated in a spatio-temporal pyramid. Morariu et al. [24]
group activities. The model can be used in general group track players, infer part locations, and reason about tempo-
activity applications such as video surveillance, sport ana- ral structure in 1-on-1 basketball games. Swears et al. [35]
lytics, and video search and retrieval. used the Granger Causality statistic to automatically con-
To cater the needs of our problem, we also propose a new strain the temporal links of a Dynamic Bayesian Network
volleyball dataset that offers person detections, and both the (DBN) for handball videos. Direkoglu and O’Connor [8]
person action label, as well as the group activity label. The solved a particular Poisson equation to generate a holis-
camera view of the selected sports videos allows us to track tic player location representation. Kwak et al. [20] opti-
the players in the scene. Experimentally, the model is effec- mize based on a rule-based depiction of interactions be-
tive in recognizing the overall team activity based on recog- tween people.
nizing and integrating player actions. Deep Learning: Deep Convolutional Neural Networks
This paper is organized as follows. In Section 2, we (CNNs) have shown impressive performance by unifying
provide a brief overview of the literature related to activity feature and classifier learning and the availability of large
recognition. In Section 3, we elaborate details of the pro- labeled datasets. Successes have been demonstrated on a
posed group activity recognition model. In Section 4, we variety of computer vision tasks including image classifica-
tabulate the performance of approach, and end in Section 5 tion [18, 33] and action recognition [32, 16]. More flexi-
with a conclusion of this work. ble recurrent neural network (RNN) based models are used
for handling variable length space-time inputs. Specifically,
2. Related Work LSTM [12] models are popular among RNN models due
Human activity recognition is an active area of research, to the tractable learning framework that they offer when it
with many existing algorithms. Surveys by Weinland et comes to deep representations. These LSTM models have
al. [40] and Poppe [26] explore the vast literature in activ- been applied to a variety of tasks [9, 10, 25, 38]. For in-
ity recognition. Here, we will focus on the group activ- stance, in Donahue et al. [9], the so-called Long term Recur-
ity recognition problem and recent related advances in deep rent Convolutional network, formed by stacking an LSTM
learning. on top of pre-trained CNNs, is proposed for handling se-
Group Activity Recognition: Group activity recogni- quential tasks such as activity recognition, image descrip-
tion has attracted a large body of work recently. Most pre- tion, and video description. In Karpathy et al. [15], struc-
vious work has used hand-crafted features fed to structured tured objectives are used to align CNNs over image regions
models that represent information between individuals in and bi-directional RNNs over sentences. A deep multi-
space and/or time domains. Lan et al. [23] proposed an modal RNN architecture is used for generating image de-
adaptive latent structure learning that represents hierarchi- scriptions using the deduced alignments.
cal relationships ranging from lower person-level informa- In this work, we aim at building a hierarchical struc-

1972
tured model that incorporates a deep LSTM framework to in a volleyball game a team may move from defence
recognize individual actions and group activities. Previous phase to pass and then attack.
work in the area of deep structured learning includes Tomp-
son et al. [37] for pose estimation, and Zheng et al. [42] Many classic approaches to the group activity recog-
and Schwing et al. [29] for semantic image segmentation. nition problem have modeled these elements in a form
In Deng et al. [7] a similar framework is used for group of structured prediction based on hand crafted features
activity recognition, where a neural network-based hier- [39, 28, 23, 22, 27]. Inspired by the success of deep learn-
archical graphical model refines person action labels and ing based solutions, in this paper, a novel hierarchical deep
learns to predict the group activity simultaneously. While learning based model is proposed that is potentially capable
these methods use neural network-based graphical repre- of learning low-level image features, person-level actions,
sentations, in our current approach, we leverage LSTM- their temporal relations, and temporal group dynamics in a
based temporal modelling to learn discriminative informa- unified end-to-end framework.
tion from time varying sports activity data. In [41], a new Given the sequential nature of group activity analysis,
dataset is introduced that contains dense multiple labels per our proposed model is based on a Recurrent Neural Net-
frame for underlying action, and a novel Multi-LSTM is work (RNN) architecture. RNNs consist of non-linear units
used to model the temporal relations between labels present with internal states that can learn dynamic temporal behav-
in the dataset. ior from a sequential input with arbitrary length. Therefore,
Datasets: Popular datasets for activity recognition in- they overcome the limitation of CNNs that expect constant
clude the Sports-1M dataset [15], UCF 101 database [34], length input. This makes them widely applicable to video
and the HMDB movie database [19]. These datasets started analysis tasks such as activity recognition.
to shift the focus to unconstrained Internet videos that con- Our model is inspired by the success of hierarchical
tain more intra-class variation, compared to a constrained models. Here, we aim to mimic a similar intuition using
dataset. While these datasets continue to focus on indi- recurrent networks. We propose a deep model by stacking
vidual human actions, in our work we focus on recogniz- several layers of RNN-type structures to model a large range
ing more complex group activities in sport videos. Choi et of low-level to high-level dynamics defined on top of people
al. [4] introduced the Collective Activity Dataset consisting and entire groups. We describe the use of these RNN struc-
of real world pedestrian sequences where the task is to find tures for individual and group activity recognition next.
the high level group activity. In this paper, we experiment
with this dataset, but also introduce a new dataset for group 3.1. Temporal Model of Individual Action
activity recognition in sport footage which is annotated with
Given tracklets of each person in a scene, we use long
player pose, location, and group activities to encourage sim-
short-term memory (LSTM) models to represent temporally
ilar research in the sport domain.
the action of each individual person. Such temporal infor-
mation is complementary to spatial features and is critical
3. Proposed Approach
for performance. LSTMs, originally proposed by Hochre-
Our goal in this paper is to recognize activities per- iter and Schmidhuber [12], have been used successfully for
formed by a group of people in a video sequence. The input many sequential problems in computer vision. Each LSTM
to our method is a set of tracklets of the people in a scene. unit consists of several cells with memory that stores infor-
The group of people in the scene could range from players mation for a short temporal interval. The memory content
in a sports video to pedestrians in a surveillance video. In of a LSTM makes it suitable for modeling complex tempo-
this paper we consider three cues that can aid in determining ral relationships that may span a long range.
what a group of people is doing: The content of the memory cell is regulated by several
gating units that control the flow of information in and out
• Person-level actions collectively define a group activ- of the cells. The control they offer also helps in avoiding
ity. Person action recognition is a first step toward rec- spurious gradient updates that can typically happen in train-
ognizing group activities. ing RNNs when the length of a temporal input is large. This
• Temporal dynamics of a person’s action is higher- property enables us to stack a large number of such layers
order information that can serve as a strong signal for in order to learn complex dynamics present in the input in
group activity. Knowing how each person’s action is different ranges.
changing over time can be used to infer the group’s We use a deep Convolutional Neural Network (CNN) to
activity. extract features from the bounding box around the person
in each time step on a person trajectory. The output of the
• Temporal evolution of group activity represents how CNN, represented by xt , can be considered as a complex
a group’s activity is evolving over time. For example, image-based feature describing the spatial region around a

1973
person. Assuming xt as the input of an LSTM cell at time The output of the pooling layer forms our representation for
t, the cell activition can be formulated as : the group activity. The second LSTM network, working on
top of the temporal representation, is used to directly model
it = σ(Wxi xt + Whi ht−1 + bi ) (1) the temporal dynamics of group activity. The LSTM
ft = σ(Wxf xt + Whf ht−1 + bf ) (2) layer of the second network is directly connected to a clas-
ot = σ(Wxo xt + Who ht−1 + bo ) (3) sification layer in order to detect group activity classes in a
video sequence.
gt = φ(Wxc xt + Whc ht−1 + bc ) (4)
Mathematically, the pooling layer can be expressed as
ct = ft ⊙ ct−1 + it ⊙ gt (5) the following:
ht = ot ⊙ φ(ct ) (6)
Ptk = xtk ⊕ htk (7)
Here, σ stands for a sigmoid function, and φ stands for Zt = Pt1 ⋄ Pt2 ... ⋄ Ptk (8)
the tanh function. xt is the input, ht ∈ RN is the hidden
state with N hidden units, ct ∈ RN is the memory cell, In this equation, htk corresponds to the first stage LSTM
it ∈ RN , ft ∈ RN , ot ∈ RN , and, gt ∈ RN are input gate, output, and xtk corresponds to the AlexNet fc7 feature, both
forget gate, output gate, and input modulation gate at time t obtained for the kth person at time t. We concatenate these
respectively. ⊙ represents element-wise multiplication. two features (represented by ⊕) to obtain the temporal fea-
When modeling individual actions, the hidden state ht ture representation Ptk for kth person. We then construct the
could be used to model the action a person is performing frame level feature representation Zt at time t by applying a
at time t. Note that the cell output is evolving over time max pooling operation (represented by ⋄) over the features
based on the past memory content. Due to the deployment of all the people. Finally, we feed the frame level repre-
of gates on the information flow, the hidden state will be sentation to our second LSTM stage that operates similar
formed based on a short-range memory of the person’s past to the person level LSTMs that we described in the pre-
behaviour. Therefore, we can simply pass the output of the vious subsection, and learn the group level dynamics. Zt ,
LSTM cell at each time to a softmax classification layer1 to passed through a fully connected layer, is given to the input
predict individual person-level action for each tracklet. of the second-stage LSTM layer. The hidden state of the
The LSTM layer on top of person trajectories forms the LSTM layer represented by hgroup carries temporal infor-
t
first stage of our hierarchical model. This stage is designed mation for the whole group dynamics. hgroup is fed to a
t
to model person-level actions and their temporal evolu- softmax classification layer to predict group activities.
tion. Our training proceeds in a stage-wise fashion, first
training to predict person level actions, and then pasing the 3.3. Implementation Details
hidden states of the LSTM layer to the second stage for
group activity recognition, as discussed in the next section. We trained our model in two steps. In the first step, the
person-level CNN and the first LSTM layer are trained in
3.2. Hierarchical Model for Group Activity Recog- an end-to-end fashion using a set of training data consist-
nition ing of person tracklets annotated with action labels. We
implement our model using Caffe [14]. Similar to other
At each time step, the memory content of the first LSTM
approaches [9, 7, 38], we initialize our CNN model with
layer contains discriminative information describing the
the pre-trained AlexNet network and we fine-tune the whole
subject’s action as well as past changes in his action. If
network for the first LSTM layer. 9 timesteps and 3000 hid-
the memory content is correctly collected over all people in
den nodes are used for the first LSTM layer and a softmax
the scene, it can be used to describe the group activity in the
layer is deployed for the classification layer in this stage.
whole scene.
After training the first LSTM layer, we concatenate the
Moreover, it can also be observed that direct image-
fc7 layer of AlexNet and the LSTM layer for every person
based features extracted from the spatial domain around a
and pool over all people in a scene. The pooled features,
person carries a discriminative signal for the ongoing activ-
which correspond to frame level features, are fed to the sec-
ity. Therefore, a deep CNN model is used to extract com-
ond LSTM network. This network consists of a 3000-node
plex features for each person in addition to the temporal
fully connected layer followed by a 9-timestep 500-node
features captured by the first LSTM layer.
LSTM layer which is passed to a softmax layer trained to
At this moment, the concatenation of the CNN features
recognize group activity labels.
and the LSTM layer represent temporal features for a per-
For training all our models (that include both the base-
son. Various pooling strategies can be used to aggregate
line models and both the stages of the two-stage model), we
these features over all people in the scene at each time step.
follow the same training protocol. We use a fixed learning
1 More precisely, a fully connected layer fed to softmax loss layer. rate of 0.00001 and a momentum of 0.9. For tracking sub-

1974
Figure 2: Our two-stage model for a volleyball match. Given tracklets of K-players, we feed each tracklet in a CNN, followed
by a person LSTM layer to represent each player’s action. We then pool over all people’s temporal features in the scene. The
output of the pooling layer is feed to the second LSTM network to identify the whole teams activity.

jects in a scene, we used the tracker by Danelljan et al. [6], activity annotations are used in a deep learning model
implemented in the Dlib library [17]. that does not model the temporal aspect of group ac-
tivities. This is very similar to our two-stage model
4. Experiments without the temporal modeling.
In this section, we evaluate our model by comparing 4. Temporal Model with Image Features: This baseline
our results with several baselines and previously published is a temporal extension of the first baseline. It exam-
works on the Collective Activity Dataset [4] and our new ines the idea of feeding image level features directly to
volleyball dataset. First, we describe our baseline mod- a LSTM model to recognize group activities. In this
els. Then, we present our results on the Collective Activity baseline, the AlexNet model is deployed on the whole
Dataset followed by experiments on the volleyball dataset. image and resulting fc7 features are fed to a LSTM
model. This baseline can be considered as a reimple-
4.1. Baselines mentation of Donahue et al. [9].
The following baselines are considered in all our experi-
5. Temporal Model with Person Features: This base-
ments:
line is a temporal extension of the second baseline:
1. Image Classification: This baseline is the basic fc7 features pooled over all people are fed to a LSTM
AlexNet model fine-tuned for group activity recogni- model to recognize group activities.
tion in a single frame. 6. Two-stage Model without LSTM 1: This baseline is
2. Person Classification: In this baseline, the AlexNet a variant of our model, omitting the person-level tem-
CNN model is deployed on each person, fc7 features poral model (LSTM 1). Instead, the person-level clas-
are pooled over all people, and are fed to a softmax sification is done only with the fine-tuned person CNN.
classifier to recognize group activities in each single 7. Two-stage Model without LSTM 2: This baseline is
frame. a variant of our model, omitting the group-level tem-
3. Fine-tuned Person Classification: This baseline is poral model (LSTM 2). In other words, we do the fi-
similar to the previous baseline with one distinction. nal classification based on the outputs of the temporal
The AlexNet model on each player is fine-tuned to models for individual person action labels, but without
recognize person-level actions. Then, fc7 is pooled an additional group-level LSTM.
over all players to recognize group activities in a scene
4.2. Experiments on the Collective Activity Dataset
without any fine-tuning of the AlexNet model. The
rational behind this baseline is to examine a scenario The Collective Activity Dataset [4] has been widely used
where person-level action annotations as well as group for evaluating group activity recognition approaches in the

1975
computer vision literature [1, 7, 2]. This dataset consists of of actions in the scene) which is exactly the way group ac-
44 videos, eight person-level pose labels (not used in our tivities are defined in this dataset.
work), five person level action labels, and five group-level
activities. A scene is assigned a group activity label based 4.2.1 Discussion
on the majority of what people are doing. We follow the
train/test split provided by [11]. In this section, we present The confusion matrix obtained for the Collective Activity
our results on this dataset. Dataset using our two-stage model is shown in Figure 3.
We observe that the model performs almost perfectly for
Method Accuracy the talking and queuing classes, and gets confused between
B1-Image Classification 63.0 crossing, waiting, and walking. Such behaviour is perhaps
B2-Person Classification 61.8 due to a lack of consideration of spatial relations between
B3-Fine-tuned Person Classification 66.3 people in the group, which is shown to boost the perfor-
B4-Temporal Model with Image Features 64.2 mance of previous group activity recognition methods: e.g.
B5-Temporal Model with Person Features 62.2 crossing involves the walking action, but is confined in a
B6-Two-stage Model without LSTM 1 70.1 path which people perform in orderly fashion. Therefore,
B7-Two-stage Model without LSTM 2 76.8 our model that is designed only to learn the dynamic proper-
Two-stage Hierarchical Model 81.5 ties of group activities often gets confused with the walking
action.
It is clear that our two-stage model has improved perfor-
Table 1: Comparison of our method with baseline methods mance with compared to baselines. The temporal informa-
on the Collective Activity Dataset. tion improves performance. Further, finding and describing
the elements of a video (i.e. persons) provides benefits over
utilizing frame level features.
Method Accuracy
Contextual Model [23] 79.1
Deep Structured Model [7] 80.6
Our Two-stage Hierarchical Model 81.5
Cardinality kernel [11] 83.4

Table 2: Comparison of our method with previously pub-

lished works on the Collective Activity Dataset.

In Table 1, the classification results of our proposed ar-

chitecture is compared with the baselines. As shown in
the table, our two-stage LSTM model significantly outper-
forms the baseline models. An interesting comparison can
be made between temporal and frame-based counterparts
including B1 vs. B4, B2 vs. B5 and B3 vs. our two-stage
model. It is interesting to observe that adding temporal in- Figure 3: Confusion matrix for the Collective Activity
formation using LSTMs improves the performance of these Dataset obtained using our two-stage model.
baselines.
Table 2 compares our method with state of the art meth-
4.3. Experiments on the Volleyball Dataset
ods for group activity recognition. The performance of our
two-stage model is comparable to the state of the art meth- In order to evaluate the performance of our model for
ods. Note that only Deng et al. [7] is a previously published team activity recognition on sport footage, we collected a
deep learning model. We postulate that there would be a new dataset based on publicly available YouTube volleyball
significant improvement in the relative performance of our videos. We annotated 1525 frames that were handpicked
model if we had a larger dataset for recognizing group activ- from 15 videos with seven player action labels and six team
ities. In contrast, the cardinality kernel approach [11] out- activity labels. We used frames from 2/3rd of the videos
performed our model. It should be noted that this approach for training, and the remaining 1/3rd for testing. The list of
works on hand crafted features fed to a model highly opti- action and activity labels and related statistics are tabulated
mized for a cardinality problem (i.e. counting the number in Tables 3 and 4.

1976
(a) (b)

(e) (f)

(g) (h)
Figure 4: Visualizations of the generated scene labels using our model. Green denotes correct classifications, red denotes
incorrect. The incorrect ones correspond to the confusion between different actions in ambiguous cases (h and j examples),
or in the left and right distinction (i example).

From the tables, we observe that the group activity labels be made publicly available to facilitate future comparisons
2
are relatively more balanced compared to the player action .
labels. This follows from the fact that we often have peo- In Table 5, the classification performance of our pro-
ple present in static actions like standing compared to dy- posed model is compared against the baselines. Similar
namic actions (setting, spiking, etc.). Therefore, our dataset to the performance in the Collective Activity Dataset, our
presents a challenging team activity recognition task, where two-stage LSTM model outperforms the baseline models.
we have interesting actions that can directly determine the
group activity occur rarely in our dataset. The dataset will 2 https://github.com/mostafa-saad/

deep-activity-rec

1977
Action Average No. of atively more critical to the performance of the system, com-
Group No. of Classes Instance per Frame pared to the second layer LSTM (B6 baseline).
Activity Class Instances Waiting 0.30 All the reported experiments use max-pooling as men-
Right set 229 Setting 0.33 tioned above. However, we also tried both sum and average
Right spike 187 Digging 0.57 pooling, but their performance was consistently lower com-
Right pass 267 Falling 0.21 pared to their max-pooling counterpart.
Left pass 304 Spiking 0.28
Left spike 246 Blocking 0.58
Left set 223 Others 9.22

Table 3: Statistics of Table 4: Statistics of

the group activity la- the action labels in the
bels in the volleyball volleyball dataset.
dataset.

However, compared to the baselines, the performance gain

using our model is more modest. This is likely because we
can infer group activity in volleyball by using just a few
frames. Therefore, in the volleyball dataset, our baseline B1
is closer to the actual model’s performance, compared to the
Collective Activity Dataset. Moreover, explicitly modeling
people is necessary for obtaining better performance in this
Figure 5: Confusion matrix for the volleyball dataset ob-
dataset, since the background is rapidly changing due to a
tained using our two-stage hierarchical model.
fast moving camera, and therefore it corrupts the temporal
dynamics of the foreground. This could be verified from the
performance of our baseline model B4, which is a tempo-
ral model that does not consider people explicitly, showing 4.3.1 Discussion
inferior performance compared to the baseline B1, which
is a non-temporal image classification style model. On the Figures 4 and 5 show visualizations of our detected activ-
other hand, baseline model B5, which is a temporal model ities and the confusion matrix obtained for the volleyball
that explicitly considers people, performs comparably to the dataset using our two-stage model. From the confusion ma-
image classification baseline, in spite of the problems that trix, we observe that our model generates consistently ac-
arise due to tracking and motion artifacts. curate high level action labels. Nevertheless, our model has
some confusion between set and pass activities, as these ac-
Method Accuracy tivities often may look similar.
B1-Image Classification 46.7
B2-Person Classification 33.1 5. Conclusion
B3-Fine-tuned Person Classification 35.2 In this paper, we presented a novel deep structured archi-
B4-Temporal Model with Image Features 37.4 tecture to deal with the group activity recognition problem.
B5-Temporal Model with Person Features 45.9 Through a two-stage process, we learn a temporal repre-
B6-Our Two-stage Model without LSTM 1 48.8 sentation of person-level actions and combine the represen-
B7-Our Two-stage Model without LSTM 2 49.7 tation of individual people to recognize the group activity.
Our Two-stage Hierarchical Model 51.1 We also created a new volleyball dataset to train and test
our model, and also evaluated our model on the Collective
Activity Dataset. Results show that our architecture can im-
Table 5: Comparison of the team activity recognition per-
prove upon baseline methods lacking hierarchical consider-
formance of baselines against our model evaluated on the
ation of individual and group activities using deep learning.
volleyball dataset.

Acknowledgements
In both datasets, an observation from the tables is that
while both LSTMs contribute to overall classification per- This work was supported by grants from NSERC and
formance, having the first layer LSTM (B7 baseline) is rel- Disney Research.

1978
References [17] D. E. King. Dlib-ml: A machine learning toolkit. Journal of
Machine Learning Research, 10:1755–1758, 2009.
[1] M. R. Amer, P. Lei, and S. Todorovic. Hirf: Hierarchi-
[18] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet
cal random field for collective activity recognition in videos.
classification with deep convolutional neural networks. In
In Computer Vision–ECCV 2014, pages 572–585. Springer,
Advances in neural information processing systems, pages
2014.
1097–1105, 2012.
[2] M. R. Amer, D. Xie, M. Zhao, S. Todorovic, and S.-C. Zhu.
[19] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre.
Cost-sensitive top-down/bottom-up inference for multiscale
Hmdb: a large video database for human motion recogni-
activity recognition. In Computer Vision–ECCV 2012, pages
tion. In Computer Vision (ICCV), 2011 IEEE International
187–200. Springer, 2012.
Conference on, pages 2556–2563. IEEE, 2011.
[3] W. Choi and S. Savarese. A unified framework for multi-
[20] S. Kwak, B. Han, and J. H. Han. Multi-agent event detection:
target tracking and collective activity recognition. In Com-
Localization and role assignment. In CVPR, 2013.
puter Vision–ECCV 2012, pages 215–230. Springer, 2012.
[21] T. Lan, L. Sigal, and G. Mori. Social roles in hierarchical
[4] W. Choi, K. Shahid, and S. Savarese. What are they do-
models for human activity recognition. In Computer Vision
ing?: Collective activity classification using spatio-temporal
and Pattern Recognition (CVPR), 2012 IEEE Conference on,
relationship among people. In Computer Vision Workshops
pages 1354–1361. IEEE, 2012.
(ICCV Workshops), 2009 IEEE 12th International Confer-
ence on, pages 1282–1289. IEEE, 2009. [22] T. Lan, L. Sigal, and G. Mori. Social roles in hierarchical
models for human activity recognition. In Computer Vision
[5] W. Choi, K. Shahid, and S. Savarese. Learning context for
and Pattern Recognition (CVPR), 2012.
collective activity recognition. In Computer Vision and Pat-
tern Recognition (CVPR), 2011 IEEE Conference on, pages [23] T. Lan, Y. Wang, W. Yang, S. Robinovitch, and G. Mori. Dis-
3273–3280. IEEE, 2011. criminative latent models for recognizing contextual group
[6] M. Danelljan, G. Häger, F. Shahbaz Khan, and M. Felsberg. activities. IEEE Transactions on Pattern Analysis and Ma-
Accurate scale estimation for robust visual tracking, 2014. chine Intelligence, 34(8):1549–1562, 2012.
BMVC. [24] V. I. Morariu and L. S. Davis. Multi-agent event recogni-
[7] Z. Deng, M. Zhai, L. Chen, Y. Liu, S. Muralidharan, tion in structured scenarios. In Computer Vision and Pattern
M. Roshtkhari, and G. Mori. Deep structured models for Recognition (CVPR), 2011.
group activity recognition. In British Machine Vision Con- [25] J. Y.-H. Ng, M. Hausknecht, S. Vijayanarasimhan,
ference (BMVC), 2015. O. Vinyals, R. Monga, and G. Toderici. Beyond short snip-
[8] C. Direkoglu and N. E. O’Connor. Team activity recognition pets: Deep networks for video classification. CVPR, 2015.
in sports. In Computer Vision–ECCV 2012, pages 69–83. [26] R. Poppe. A survey on vision-based human action recogni-
Springer, 2012. tion. Image and vision computing, 28(6):976–990, 2010.
[9] J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach, [27] V. Ramanathan, B. Yao, and L. Fei-Fei. Social role discovery
S. Venugopalan, K. Saenko, and T. Darrell. Long-term recur- in human events. In Computer Vision and Pattern Recogni-
rent convolutional networks for visual recognition and de- tion (CVPR), 2013 IEEE Conference on, pages 2475–2482.
scription. arXiv preprint arXiv:1411.4389, 2014. IEEE, 2013.
[10] A. Graves and N. Jaitly. Towards end-to-end speech recog- [28] C. Schüldt, I. Laptev, and B. Caputo. Recognizing human
nition with recurrent neural networks. In Proceedings of the actions: a local svm approach. In Pattern Recognition, 2004.
31st International Conference on Machine Learning (ICML- ICPR 2004. Proceedings of the 17th International Confer-
14), pages 1764–1772, 2014. ence on, volume 3, pages 32–36. IEEE, 2004.
[11] H. Hajimirsadeghi, W. Yan, A. Vahdat, and G. Mori. Visual [29] A. G. Schwing and R. Urtasun. Fully connected deep struc-
recognition by counting instances: A multi-instance cardi- tured networks. arXiv preprint arXiv:1503.02351, 2015.
nality potential kernel. CVPR, 2015. [30] T. Shu, D. Xie, B. Rothrock, S. Todorovic, and S.-C. Zhu.
[12] S. Hochreiter and J. Schmidhuber. Long short-term memory. Joint inference of groups, events and human roles in aerial
Neural computation, 9(8):1735–1780, 1997. videos. In CVPR, 2015.
[13] S. S. Intille and A. Bobick. Recognizing planned, multi- [31] B. Siddiquie, Y. Yacoob, and L. Davis. Recognizing plays in
person action. Computer Vision and Image Understanding american football videos. Technical report, Technical report,
(CVIU), 81:414–445, 2001. University of Maryland, 2009.
[14] Y. Jia. Caffe: An open source convolutional [32] K. Simonyan and A. Zisserman. Two-stream convolutional
architecture or fast feature embedding, 2013. networks for action recognition in videos. In Advances
http://caffe.berkeleyvision.org/. in Neural Information Processing Systems, pages 568–576,
[15] A. Karpathy and L. Fei-Fei. Deep visual-semantic align- 2014.
ments for generating image descriptions. CVPR, 2015. [33] K. Simonyan and A. Zisserman. Very deep convolutional
[16] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, networks for large-scale image recognition. arXiv preprint
and L. Fei-Fei. Large-scale video classification with con- arXiv:1409.1556, 2014.
volutional neural networks. In Computer Vision and Pat- [34] K. Soomro, A. R. Zamir, and M. Shah. Ucf101: A dataset
tern Recognition (CVPR), 2014 IEEE Conference on, pages of 101 human actions classes from videos in the wild. arXiv
1725–1732. IEEE, 2014. preprint arXiv:1212.0402, 2012.

1979
[35] E. Swears, A. Hoogs, Q. Ji, and K. Boyer. Complex ac-
tivity recognition using granger constrained dbn (gcdbn) in
sports and surveillance video. In Computer Vision and Pat-
tern Recognition (CVPR), June 2014.
[36] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed,
D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich.
Going deeper with convolutions. In CVPR, 2015.
[37] J. J. Tompson, A. Jain, Y. LeCun, and C. Bregler. Joint
training of a convolutional network and a graphical model
for human pose estimation. In Z. Ghahramani, M. Welling,
C. Cortes, N. Lawrence, and K. Weinberger, editors, Ad-
vances in Neural Information Processing Systems 27, pages
1799–1807. Curran Associates, Inc., 2014.
[38] S. Venugopalan, H. Xu, J. Donahue, M. Rohrbach,
R. Mooney, and K. Saenko. Translating videos to natural lan-
guage using deep recurrent neural networks. arXiv preprint
arXiv:1412.4729, 2014.
[39] H. Wang, A. Kläser, C. Schmid, and C.-L. Liu. Action recog-
nition by dense trajectories. In Computer Vision and Pat-
tern Recognition (CVPR), 2011 IEEE Conference on, pages
3169–3176. IEEE, 2011.
[40] D. Weinland, R. Ronfard, and E. Boyer. A survey of vision-
based methods for action representation, segmentation and
recognition. Computer Vision and Image Understanding,
115(2):224–241, 2011.
[41] S. Yeung, O. Russakovsky, N. Jin, M. Andriluka, G. Mori,
and L. Fei-Fei. Every moment counts: Dense detailed
labeling of actions in complex videos. arXiv preprint
arXiv:1507.05738, 2015.
[42] S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet,
Z. Su, D. Du, C. Huang, and P. H. S. Torr. Conditional ran-
dom fields as recurrent neural networks. In International
Conference on Computer Vision (ICCV), 2015.

1980

Thesis Samy
No ratings yet
Thesis Samy
218 pages
Deep Neural Network Model For Group Activity Recognition Using Contextual Relationship
No ratings yet
Deep Neural Network Model For Group Activity Recognition Using Contextual Relationship
8 pages
PMRF Annual Review
No ratings yet
PMRF Annual Review
16 pages
International Journal of Information Technology Decision Making-3
No ratings yet
International Journal of Information Technology Decision Making-3
19 pages
Batch 7
No ratings yet
Batch 7
21 pages
Enhancing Classification in Utd-Mhad Dataset: Utilizing Recurrent Neural Networks in Ensemble-Based Approach For Human Action Recognition
No ratings yet
Enhancing Classification in Utd-Mhad Dataset: Utilizing Recurrent Neural Networks in Ensemble-Based Approach For Human Action Recognition
5 pages
Human Activity Recognition LSTM Report
No ratings yet
Human Activity Recognition LSTM Report
7 pages
FULLTEXT01
No ratings yet
FULLTEXT01
52 pages
Actor-Transformers For Group Activity Recognition
No ratings yet
Actor-Transformers For Group Activity Recognition
10 pages
对时空图卷积网络的几何理解
No ratings yet
对时空图卷积网络的几何理解
16 pages
Real Time Violence Detection in Surveillance Video
No ratings yet
Real Time Violence Detection in Surveillance Video
24 pages
Group Activity Recognition in Computer Vision - AComprehensive Review, Challenges, and Future Perspectives
No ratings yet
Group Activity Recognition in Computer Vision - AComprehensive Review, Challenges, and Future Perspectives
31 pages
Informatics 09 00056
No ratings yet
Informatics 09 00056
13 pages
1 s2.0 S0031320316000169 Main
No ratings yet
1 s2.0 S0031320316000169 Main
14 pages
HAR Documentation
No ratings yet
HAR Documentation
15 pages
A Short Review of Deep Crowd Activities
No ratings yet
A Short Review of Deep Crowd Activities
8 pages
5 Junior P.E and Arts
No ratings yet
5 Junior P.E and Arts
83 pages
Volktek - Solution Catalog For Surveillance Ethernet
No ratings yet
Volktek - Solution Catalog For Surveillance Ethernet
55 pages
Deepanshu Training
No ratings yet
Deepanshu Training
18 pages
Sensors: Human Interaction Classification in Sliding Video Windows Using Skeleton Data Tracking and Feature Extraction
No ratings yet
Sensors: Human Interaction Classification in Sliding Video Windows Using Skeleton Data Tracking and Feature Extraction
20 pages
Human Activity Recognization
No ratings yet
Human Activity Recognization
8 pages
Basic Activity Recognition From Wearable
No ratings yet
Basic Activity Recognition From Wearable
20 pages
MonA02 1
No ratings yet
MonA02 1
6 pages
A Review of Research On Human Behavior Recognition Methods Based On Deep Learning
No ratings yet
A Review of Research On Human Behavior Recognition Methods Based On Deep Learning
5 pages
Skeletonbased Human ActionInteraction Classification in Sparse Image Sequences
No ratings yet
Skeletonbased Human ActionInteraction Classification in Sparse Image Sequences
14 pages
HAR PPT New 123
No ratings yet
HAR PPT New 123
26 pages
Human Activity Recognition Using CNN & LSTM: A. WISDM Dataset
No ratings yet
Human Activity Recognition Using CNN & LSTM: A. WISDM Dataset
6 pages
10.1007@s00371 020 01868 8
No ratings yet
10.1007@s00371 020 01868 8
15 pages
Ensembles of Deep LSTM Learners For Activity Recognition Using Wearables
No ratings yet
Ensembles of Deep LSTM Learners For Activity Recognition Using Wearables
28 pages
Seminar PPT On HAR Depth
No ratings yet
Seminar PPT On HAR Depth
37 pages
Convolution Neural Network For Human Activity
No ratings yet
Convolution Neural Network For Human Activity
5 pages
LSTM-CNN Architecture For Human Activity Recognition
No ratings yet
LSTM-CNN Architecture For Human Activity Recognition
12 pages
APCS Thesis-Proposal
No ratings yet
APCS Thesis-Proposal
18 pages
Fault Analysis and Voltage Control 3
No ratings yet
Fault Analysis and Voltage Control 3
24 pages
Human Action Recognition Using CNN and LSTM-RNN With Attention Model
No ratings yet
Human Action Recognition Using CNN and LSTM-RNN With Attention Model
5 pages
Attention Based Bidirectional Long Short Term Memory For Abnormal Human Activity Detection
No ratings yet
Attention Based Bidirectional Long Short Term Memory For Abnormal Human Activity Detection
12 pages
1 s2.0 S2666307424000214 Main 2
No ratings yet
1 s2.0 S2666307424000214 Main 2
1 page
Human Action Recognition On Raw Depth Maps
No ratings yet
Human Action Recognition On Raw Depth Maps
13 pages
Activity Recognition Based On Spatio-Temporal Features With Transfer Learning
No ratings yet
Activity Recognition Based On Spatio-Temporal Features With Transfer Learning
9 pages
28 - Action Recognition in Australian Rules Football Through Deep Learning
No ratings yet
28 - Action Recognition in Australian Rules Football Through Deep Learning
14 pages
10 Pile Foundation in Road Project
No ratings yet
10 Pile Foundation in Road Project
1 page
Human Activity
No ratings yet
Human Activity
25 pages
Presented By: Dewan Imdadul Islam
No ratings yet
Presented By: Dewan Imdadul Islam
13 pages
Human Activity
No ratings yet
Human Activity
53 pages
SWTF: Sparse Weighted Temporal Fusion For Drone-Based Activity Recognition
No ratings yet
SWTF: Sparse Weighted Temporal Fusion For Drone-Based Activity Recognition
7 pages
Action Classification and Highlighting in Videos
No ratings yet
Action Classification and Highlighting in Videos
12 pages
Object-Level Trajectories
No ratings yet
Object-Level Trajectories
9 pages
SLFLSDFKSFLDKJ
No ratings yet
SLFLSDFKSFLDKJ
3 pages
Hytera+VM780+4G+Body+Worn+Camera+User+Manual+ (HyTalk) +V1.0.00 Eng
No ratings yet
Hytera+VM780+4G+Body+Worn+Camera+User+Manual+ (HyTalk) +V1.0.00 Eng
50 pages
Iarjset 2024 11739
No ratings yet
Iarjset 2024 11739
4 pages
Nostalgia Funny Car Rules V1
No ratings yet
Nostalgia Funny Car Rules V1
5 pages
Sensors: Deep Convolutional and LSTM Recurrent Neural Networks For Multimodal Wearable Activity Recognition
No ratings yet
Sensors: Deep Convolutional and LSTM Recurrent Neural Networks For Multimodal Wearable Activity Recognition
25 pages
Xiao-Song2018 Article ActionRecognitionBasedOnHierar PDF
No ratings yet
Xiao-Song2018 Article ActionRecognitionBasedOnHierar PDF
14 pages
1 s2.0 S2666307424000214 Main 3
No ratings yet
1 s2.0 S2666307424000214 Main 3
1 page
Liu 2020
No ratings yet
Liu 2020
9 pages
Human Activity Reco
No ratings yet
Human Activity Reco
17 pages
Ufc Sports Data
No ratings yet
Ufc Sports Data
10 pages
Human Activity Detection Using Deep - 2-1
No ratings yet
Human Activity Detection Using Deep - 2-1
8 pages
The Language of Actions: Recovering The Syntax and Semantics of Goal-Directed Human Activities
No ratings yet
The Language of Actions: Recovering The Syntax and Semantics of Goal-Directed Human Activities
8 pages
Action Recognition
No ratings yet
Action Recognition
14 pages
Classification of Puck Possession Events in Ice Hockey
No ratings yet
Classification of Puck Possession Events in Ice Hockey
8 pages
Design and Optimization of Spur Gear: Second Review
No ratings yet
Design and Optimization of Spur Gear: Second Review
44 pages
Mosi Debat
No ratings yet
Mosi Debat
8 pages
Flow Over Cylinder
No ratings yet
Flow Over Cylinder
8 pages
LSTM Networks For Mobile Human Activity Recognition: Yuwen Chen, Kunhua Zhong, Ju Zhang, Qilong Sun and Xueliang Zhao
No ratings yet
LSTM Networks For Mobile Human Activity Recognition: Yuwen Chen, Kunhua Zhong, Ju Zhang, Qilong Sun and Xueliang Zhao
4 pages
STS Lesson 1
No ratings yet
STS Lesson 1
8 pages
Deep Neural Network Approachesfor Video Based Human Activity Recognition
No ratings yet
Deep Neural Network Approachesfor Video Based Human Activity Recognition
4 pages
Gradient Local Auto-Correlation Features For Depth Human Action Recognition - SpringerLink
No ratings yet
Gradient Local Auto-Correlation Features For Depth Human Action Recognition - SpringerLink
3 pages
Nba Lab Details May 2014
No ratings yet
Nba Lab Details May 2014
38 pages
Mehdi Belouahchia Resume F
No ratings yet
Mehdi Belouahchia Resume F
2 pages
Whiplash Project
No ratings yet
Whiplash Project
11 pages
Epic Minigeddon2
No ratings yet
Epic Minigeddon2
1 page
Date Reference Description Valuedate Deposit Withdrawal Balance
No ratings yet
Date Reference Description Valuedate Deposit Withdrawal Balance
26 pages
Video Survivallence
No ratings yet
Video Survivallence
3 pages
Yamaha Fzr400swc 89 Parts Catalogue
100% (42)
Yamaha Fzr400swc 89 Parts Catalogue
6 pages
Configuration E3D V5 Folder :: Bltouch Hotend (Stock) : /01 - Mk4 - Hex - Nuts/02 - Bltouch
No ratings yet
Configuration E3D V5 Folder :: Bltouch Hotend (Stock) : /01 - Mk4 - Hex - Nuts/02 - Bltouch
5 pages
Daftar Referensi Jurnal Enzim1
No ratings yet
Daftar Referensi Jurnal Enzim1
7 pages
Grader Operator SOP
100% (1)
Grader Operator SOP
2 pages
I. Module 3: Market Study: Study of Demand Study of Supply Demand-Supply Analysis Study of The Price Marketing Program
No ratings yet
I. Module 3: Market Study: Study of Demand Study of Supply Demand-Supply Analysis Study of The Price Marketing Program
14 pages
Skill Development Under RKVY-2016-17
No ratings yet
Skill Development Under RKVY-2016-17
10 pages
Unit One: Lesson 10 "I'll Always Be Proud of Him"
No ratings yet
Unit One: Lesson 10 "I'll Always Be Proud of Him"
11 pages
MasterCast 222 TDS-974770
No ratings yet
MasterCast 222 TDS-974770
2 pages
Magnetic Flow E&H
No ratings yet
Magnetic Flow E&H
20 pages
3a Index PDF
0% (1)
3a Index PDF
4 pages
Ambulong Climatological Extremes (As of 2016)
No ratings yet
Ambulong Climatological Extremes (As of 2016)
1 page
Eurocode 7 Geotechnical Limit Analysis
No ratings yet
Eurocode 7 Geotechnical Limit Analysis
19 pages
Excel 2013 Shortcuts
No ratings yet
Excel 2013 Shortcuts
2 pages
LN40D550 - Fast Track Troubleshooting Manual PDF
No ratings yet
LN40D550 - Fast Track Troubleshooting Manual PDF
4 pages
Y6 Science 2021
100% (1)
Y6 Science 2021
32 pages
Activity Recognition: Fundamentals and Applications
From Everand
Activity Recognition: Fundamentals and Applications
Fouad Sabry
No ratings yet