Action Recognition With Trajectory-Pooled Deep-Convolutional Descriptors
Action Recognition With Trajectory-Pooled Deep-Convolutional Descriptors
Abstract
1
HMDB51 [15], UCF101 [26]) and contests (e.g. THUMOS extractors, and use them to obtain multi-scale convolutional
[11]). Improved trajectories include several important in- feature maps for each video. Meanwhile, we detect a
gredients in their extraction process. Firstly, these extracted set of point trajectories with the method of improved
trajectories are mainly located at regions with high motion trajectories. Based on convolutional feature maps and
salience, which contain rich and discriminative information improved trajectories, we pool the local ConvNet responses
for action recognition. Secondly, these local descriptors of over the spatiotemporal tubes centered at the trajectories,
the corresponding regions in several successive frames, are where the resulting descriptor is called TDD. Finally, we
aligned and pooled along the trajectories. This trajectory- choose Fisher vector representation to aggregate these local
constrained sampling strategy also takes account of the TDDs over the whole video into a global super vector,
temporal continuity of human action, and is effective and use linear SVM as the classifier to perform action
to deal with the variations of motion speed. However, recognition. We conduct experiments on two public action
these hand-crafted descriptors are not optimized for visual datasets: the HMDB51 dataset [15] and the UCF101 dataset
representation and may lack discriminative capacity for [26]. We show that our TDDs obtain the state-of-the-art
action recognition. performance for action recognition on these challenging
The second type of representations are the deep-learned datasets. Meanwhile, our results demonstrate that our
features, and typical methods include Convolutional RBMs TDDs are complementary to those hand-crafted features
[29], 3D ConvNets [9], Deep ConvNets [12], and Two- (HOG, HOF, and MBH) and the fusion of them is able to
Stream ConvNets [24]. These deep learning methods aim further boost the recognition performance.
to automatically learn the semantic representation from
raw video by using a deep neural network discriminatively 2. Related Works
trained from a large number of labeled data. Two- Hand-crafted features. Local features [7, 16, 33, 39]
Stream ConvNets [24] are probably the most successful have become popular and effective representations in action
architecture at present, and they match the state-of-the-art recognition, as these local features do not require algo-
performance of improved trajectories [31, 32] on UCF101 rithms to detect human body and are robust to background
and HMDB51. They are composed of two neural networks, clutter, illumination changes, and video noise. Space Time
namely spatial nets and temporal nets. Spatial nets mainly Interest Points [16] proposed Harris3D detector to extract
capture the discriminative appearance features for action informative regions, while Cuboid [7] detector relied on
understanding, while temporal nets aim to learn the effec- temporal Gabor filters. Willems et al. [39] proposed a
tive motion features. However, unlike image classification Hessian detector, which is a spatio-temporal extension of
tasks [14], these deep learning based methods fail to Hessian saliency measure used for blob detection in images.
outperform previous hand-crafted features. One problem of Meanwhile several local descriptors have been proposed to
deep learning methods is that they require a large number of represent the 3D volumes extracted around these interest
labeled videos for training, while most available datasets are points, such as Histogram of Gradient (HOG), Histogram
relatively small. Meanwhile, most of current deep learning of Optical Flow (HOF) [17], 3D Histogram of Gradient
based action recognition methods largely ignore the intrin- (HOG3D) [13], and Extended SURF (ESURF) [39]. Recent
sic difference between temporal domain and spatial domain, works made use of point trajectories [30, 31] to extract
and just treat temporal dimension as feature channels when and align 3D volumes, and resorted to more rich low level
adapting the architectures of ConvNets to model videos. descriptors for constructing effective video representations,
Motivated by the above analysis, this paper proposes a including HOG, HOF, and Motion Boundary Histogram
new kind of video feature, called trajectory-pooled deep- (MBH).
convolutional descriptor (TDD). The design of TDD aims One limitation of these local features is that they lack
to combine the benefits of both hand-crafted and deep- semantics and discriminative capacity. To overcome this is-
learned features. To achieve this goal, our approach sue, several mid-level and high-level video representations
integrates the key factors from two successful video rep- have been proposed such as Action Bank [22], Dynamic-
resentations, namely improved trajectories [31] and two- Poselets [37], Motionlets [35], Motion Atoms and Phrases
stream ConvNets [24]. We utilize deep architecture to learn [34], and Actons [42]. They usually resorted to some
multi-scale convolutional feature maps, and introduce the heuristic mining methods to select discriminative visual el-
strategies of trajectory-constrained sampling and pooling to ements as feature units. Instead, this paper takes a different
encode deep features into effective descriptors. view of this problem and replace these local hand-crafted
Specifically, we first train two-stream ConvNets on a descriptors with deep-learned representations. Our deep
relatively large dataset, while more labeled action videos representations deliver high level semantic information, and
will make ConvNet training more stable and robust. Then, are learned automatically from training data without using
we treat the learned two-stream ConvNets as generic feature these heuristic rules.
Extracting Trajectories Extracting Feature Maps
… …
+
… …
input video tracking in a single scale trajectories feature pyramid spatial & temporal nets frame & flow pyramid
TDD
r +
Y r
Figure 2. Pipeline of TDD. The whole process of extracting TDD is composed of three steps: (i) extracting trajectories, (ii) extracting multi-
scale convolutional feature maps, and (iii) calculating TDD. We effectively exploit two available state-of-the-art video representations,
namely improved trajectories and two-stream ConvNets. Grounded on them, we conduct trajectory-constrained sampling and pooling over
convolutional feature maps to obtain trajectory-pooled deep convolutional descriptors.
Deep-learned features. Deep learning techniques have 3. Improved Trajectories Revisited
achieved great success in image based tasks [14, 25, 28, 41]
and there have been a number of attempts to develop As shown in Figure 2, our proposed representation
(TDD) is based on low level trajectory extraction and we
deep architectures for video action recognition [9, 12, 24,
29]. Taylor et al. [29] used Gated Restricted Boltzmann choose improved trajectories [31]. In this section, we briefly
Machines (GRBMs) to learn the motion features in an review the extraction process of improved trajectories. It is
worth noting that our TDD is independent of the method
unsupervised manner and then resorted to convolutional
learning to fine tune the parameters. Ji et al. [9] extended of extracting trajectories, and we use improved trajectories
due to its good performance.
2D ConvNet to video domain for action recognition on
Improved trajectories are extended from dense trajecto-
relatively small datasets, and recently Karpathy et al. [12]
tested ConvNets with deep structures on a large dataset, ries [30]. To compute dense trajectories, the first step is to
densely sample a set of points on 8 spatial scales on a grid
called Sports-1M. However, these deep models achieved
lower performance compared with shallow hand-crafted with step size of 5 pixels. Points in homogeneous areas are
representation [31], which might be ascribed to two facts: eliminated by setting a threshold for the smaller eigenvalue
of their autocorrelation matrices. Then these sampled points
firstly, available action datasets are relatively small for
deep learning; secondly, learning complex motion patterns are tracked by media filtering of dense flow field.
is more challenging. Simonyan et al. [24] designed Pt+1 = (xt+1 , yt+1 ) = (xt , yt ) + (M ∗ ωt )|(xt ,yt ) , (1)
two-stream ConvNets containing spatial and temporal net
by exploiting large ImageNet dataset for pre-training and where M is the median filter kernel, ∗ is convolutional
explicitly calculating optical flow for capturing motion operation, ωt = (ut , vt ) is the dense optical flow field of the
information, and finally it matched the state-of-the-art tth frame, and (xt , y t ) is the rounded position of (xt , yt ).
performance. To avoid the drifting problem of tracking, the maximum
length of trajectory is set as 15-frame. Finally, those static
trajectories are removed as they lack motion information,
and other trajectories with suddenly large displacement are
also ignored, since they are obviously incorrect due to
However, these deep models lacked considerations of inaccurate optical flow.
temporal characteristics of video data and relied on large Improved trajectories boost the recognition performance
training datasets. We incorporate video temporal char- of dense trajectories by taking camera motion into account.
acteristics into deep architectures by using strategy of It assumes that the background motion of two consecutive
trajectory-constrained sampling and pooling, and propose frames can be characterized by a homography matrix. To
a new descriptor. Meanwhile, our descriptors can be easily estimate the homography matrix, the first step is to find
adapted to the datasets of smaller size. the correspondence between two consecutive frames. They
resort to SURF [2] feature matching and optical flow while temporal nets aim to describe the dynamic motion
based matching, as these two kinds of matching scheme information, whose input are volumes of stacking optical
are complementary to each other. Then, they use the flow fields (224 × 224 × 2F , F is the number of stacking
RANSAC [8] algorithm to estimate homography matrix. flows). Meanwhile, decoupling the spatial and temporal
Based on the homography, they rectify the frame image to nets also allows to exploit the available images by pre-
remove the camera motion and re-calculate the optical flow, training spatial nets on the ImageNet challenge dataset [6],
called warped flow. Warped flow brings advantages to the and explicitly handle motion information with optical flow
descriptors calculated from optical flows, in particular for algorithms for temporal nets
HOF, and trajectories corresponding to camera motion can The details about ConvNets are shown in Table 1. This
be removed too. ConvNet architecture is original from the Clarifai networks
We adopt improved trajectories for the task of TDD [41] and adapted to the task of action recognition with less
extraction, but make a modification. Unlike dense trajec- filters in conv4 layer and lower-dimensional full7 layer. But
tories or improved trajectories, we only track points on its we make a small modification. We use the same network
original spatial scale, and extract multi-scale TDDs around architecture for both spatial and temporal net in addition to
the extracted trajectories (see Section 4). We observe that the input data layer, while the original two-stream ConvNets
tracking on a single scale is fast for implementation. In [24] ignore the second local response normalized (LRN)
summary, given a video V , we obtain a set of trajectories layer in the temporal net due to memory consumption
problem. The implementation and training details can be
T(V ) = {T1 , T2 , · · · , TK }, (2) found in Section 5.
where K is the number of trajectories, and Tk denotes the
4.2. Convolutional feature maps
k th trajectory in the original spatial scale:
Once the training of two-stream ConvNets is complete,
Tk = {(xk1 , y1k , z1k ), (xk2 , y2k , z2k ), · · · , (xkP , yPk , zPk )}, (3) we treat them as generic feature extractors to obtain the
convolutional feature maps of videos. In general, for each
where (xkp , ypk , zpk ) is the pixel position of the pth point in
video, we obtain these feature maps of spatial and temporal
trajectory Tk , and P is the length of trajectory (P = 15).
net in a frame-by-frame and volume-by-volume manner,
These trajectories will be used for trajectory-constrained
respectively. In order to make the feature maps with equal
sampling and pooling in the process of TDD extraction, as
temporal duration with input video, we pad the optical flow
described in the next section.
fields at the beginning with F − 1 copies of the optical flow
field from the first frame, where F is the number of stacking
4. Deep Convolutional Descriptors
optical flow.
In this section, we describe a new video representa- For each frame or volume, we take it as the input for spa-
tion, called trajectory-pooled deep-convolutional descrip- tial or temporal nets. We make two modifications about the
tor (TDD), which shares the benefits of both hand-crafted spatial and temporal nets. The first one is that we remove
and deep-learned features. We first introduce the archi- the layers after the target layer for feature extraction. For
tectures of convolutional networks (ConvNets) we used. example, to extract feature maps of conv4, we will remove
Then, we show how to adapt the ConvNets trained on large the layers from conv5 to full8. Therefore, the output of
datasets to extract multi-scale convolutional feature maps. spatial and temporal net will be the convolutional feature
Finally, based on improved trajectories and convolutional maps, which will be used for extracting TDD in the next
feature maps, we describe the details of how to calculate subsection.
TDDs. The second modification is that before each convolu-
tional or pooling layer, with kernel size k, we conduct
4.1. Convolutional networks zero padding of the layer’s input with size ⌊k/2⌋. This
Our TDD starts with designing deep ConvNets for padding allows the input and output maps of these layers
extracting convolutional feature maps. In principle, any to have the same spatial extent. With this padding, it will
kind of ConvNet architecture can be adopted for TDD be straightforward to map the positions of trajectory points
extraction. In our implementation, we choose the two- in video to the coordinates of convolutional feature maps.
stream ConvNets[24] due to their good performance on the A trajectory point with video coordinates (xp , yp , zp ) in
datasets of UCF101 and HMDB51. Equation (3) will be centered on (r × xp , r × yp , zp ) in
The two-stream ConvNets contain two separate Con- convolutional map, where r is map size ratio with respective
vNets, namely spatial nets and temporal nets. Spatial nets to input size, as listed in Table 1.
are designed for capturing static appearance cues, which ConvNets are bottom-up architectures with a sequence
are trained on single frame images (224 × 224 × 3), of alternating convolutional and pooling layers. Different
Layer conv1 pool1 conv2 pool2 conv3 conv4 conv5 pool5 full6 full7 full8
size 7×7 3×3 5×5 3×3 3×3 3×3 3×3 3×3 - - -
stride 2 2 2 2 1 1 1 2 - - -
channel 96 96 256 256 512 512 512 512 4096 2048 101
map size ratio 1/2 1/4 1/8 1/16 1/16 1/16 1/16 1/32 - - -
receptive field 7×7 11 × 11 27 × 27 43 × 43 75 × 75 107 × 107 139 × 139 171 × 171 - - -
Table 1. ConvNet Architectures. We use similar architectures to two-stream ConvNets [24], which are adapted to the task of action
recognition from the Clarifai networks [41], with less filters in conv4 layer (512 vs. 1024) and lower-dimensional full7 layer (2048 vs.
4096). For layers of conv1 and conv2, local response normalized (LRN) is applied with parameters settings: n = 5, α = 5 × 10−4 , β =
0.75. The layers of full6 and full7 are regularised by using dropout and the full8 layer acts as a soft-max classifier. The activation function
for all weight layers is the rectification linear unit (RELU). The size ratios of feature maps with respect to input data range from 1/2 to
1/32, and the feature receptive fields vary from 7 × 7 to 171 × 171, for different convolutional and pooling layers (conv1 to pool5).
layers of ConvNets have various receptive fields as shown activation burstiness of some neurons. We design two kinds
in Table 1, ranging from 7 × 7 to 171 × 171. As described of normalization methods:
in paper [41], these different layers capture patterns from
simple visual elements such as edges, to complex visual • Spatiotemporal Normalization. For spatiotemporal
concepts such as parts and objects. The higher layers normalization, we normalize the feature map for each
have larger receptive fields and obtain more invariant and channel independently across the video spatiotemporal
discriminative features. Intuitively, these different layers extent. Given a feature map C ∈ RH×W ×L×N of
describe the visual content at different levels, each of which Equation (4), we normalize the convolutional feature
is complementary to each other for the task of recognition. value as follows:
We will exploit this complimentary property of different est (x, y, z, n) = C(x, y, z, n)/maxVn ,
C st (5)
layers during the extraction of TDD. Given a video V , we
obtain a set of convolutional feature maps: where maxVnst is the maximum value of nth feature
maps over the whole video spatiotemporal extent,
C(V ) = {C1s , C2s , · · · , CM
s
, C1t , C2t , · · · , CM
t
}, (4) which means maxVnst = maxx,y,z C(x, y, z, n).
s The spatiotemporal normalization method ensures that
where Cm ∈ RHm ×Wm ×L×Nm is the mth feature map
each convolutional feature channel ranges in the same
of spatial net, Hm is its height, Wm is its width, L is the
t interval, and thus contributes equally to final TDD
video duration, and Nm is the number of channels. Cm ∈
Hm ×Wm ×L×Nm th recognition performance.
R is the m feature map of temporal net,
M is the number of layers for extracting TDD. • Channel Normalization. For channel normalization,
we normalize the feature map for each pixel indepen-
4.3. Trajectory-pooled descriptors dently across the feature channels. We conduct chan-
We will describe the method for extracting trajectory- nel normalization for feature map C ∈ RH×W ×L×N
pooled deep-convolutional descriptors (TDDs) from a set as follows:
of improved trajectories T(V ) and convolutional feature ech (x, y, z, n) = C(x, y, z, n)/maxVx,y,z ,
C (6)
maps C(V ) for a given video V . In essence, TDD ch
is a kind of local trajectory-aligned descriptor computed where maxVx,y,z is the maximum value of different
ch
in a 3D volume around the trajectory. TDDs from the feature channels at pixel position (x, y, z), that is
spatial and temporal nets capture the appearance and motion maxVx,y,z = maxn C(x, y, z, n). This channel
ch
information of this 3D volume, respectively. The size of normalization is able to make sure that the feature
the volume is N × N pixels and P frames, where N is value of each pixel range in the same interval, and
the receptive field size and P is the trajectory length. The let each pixel make the equal contribution in the final
extraction of TDD is composed of two steps: feature map representation.
normalization and trajectory pooling.
Normalization proves to be an effective strategy in de- After the step of feature normalization, we will extract
signing features partially because it can reduce the influence TDDs based on trajectories and normalized convolutional
of illumination. It has been widely exploited in local feature maps by using trajectory pooling. Specifically,
descriptors such as SIFT [19], HOG [5], and HOF [17], and given a trajectory Tk and a normalized feature map C ea ,
m
th
in deep learning such as local response normalization [14]. which is the m -layer feature map after either spatiotem-
We apply the normalization strategy to the convolutional poral normalization or channel normalization from spatial
feature maps of two-stream ConvNets to suppress the net or temporal net (a ∈ {s, t}), we conduct sum-pooling of
the normalized feature maps over the 3D volume centered training/testing splits. In each split, each action class has
at the trajectory as follows: 70 clips for training and 30 clips for testing. The average
accuracy over these three splits is used to measure the final
P
X performance.
ea ) =
D(Tk , C ea ((rm × xk ), (rm × y k ), z k ),
C (7)
m m p p p The UCF101 dataset contains 101 action classes and
p=1
there are at least 100 video clips for each class. The whole
dataset contains 13, 320 video clips, which are divided
where (xkp , ypk , zpk ) is the pth point position of video coor-
into 25 groups for each action category. We follow the
dinates in trajectory Tk , rm is the mth -layer map size ratio
evaluation scheme of the THUMOS13 challenge [11] and
with respective to input size as listed in Table 1, (·) is the
adopt the three training/testing splits for evaluation. As
rounding operation. D(Tk , C em
a
) is called trajectory-pooled UCF101 is larger than HMDB51, we use the UCF101
deep convolutional descriptor, and is a new kind of feature dataset to train two-stream ConvNets initially, and transfer
combing the merits of both improved dense trajectories and this learned model for TDD extraction on the HMDB51
two-stream ConvNets. dataset.
Multi-scale TDD extension. The above description on
TDD extraction is about the single scale, we will present the 5.2. Implementation details
multi-scale extension of TDD. For improved trajectory, it
samples points and tracks them on multi-scale videos, while Two-stream ConvNets training. Training deep Con-
fixes the spatial extent of HOG, HOF, and MBH descriptors vNets is more challenging for action recognition as action
as 32 × 32. The original method needs to conduct point is more complex than object and the available dataset
tracking and descriptor calculation in multi-scale settings. is extremely small compared with the ImageNet dataset
In our implementation, we try a more efficient multi- [6]. We choose the training dataset of UCF101 split1 for
scale strategy. Specifically, we calculate optical flow and learning two-stream ConvNets as it is probably the largest
track point in a single scale. Then we construct multi- public available dataset. We use the Caffe toolbox [10]
scale pyramid representations of video frames and optical for ConvNet implementation. The network weights are
flow fields. These pyramid representations are fed into learnt using the mini-batch (set to 256) stochastic gradient
the two stream ConvNets and transformed into multi-scale descent with momentum (set to 0.9). For spatial net, we
convolutional feature maps as shown in Figure 2. Based on first resize the frame to make the smaller side as 256, and
multi-scale convolutional maps and single-scale improved then randomly crop a 224 × 224 region from the frame. It
trajectories, we are able to compute multi-scale TDDs then undergoes random horizontal flipping. We pre-train
efficiently, by applying trajectory pooling to multi-scale the network with the public available model [4]. Finally,
convolutional feature maps as described above. The only we fine tune the model parameters on the UCF101 dataset,
modification to different scales is to replace feature map where the learning rate is set as 10−2 , decreased to 10−3
size ratio rm in Equation (7) with rm × s, where s is after 14K iterations, and training stopped at 20K iterations.
the scale of current feature map. In practice, compared For temporal net, its input is 3D volume of stacking
optical flows fields. We choose the TVL1 optical flow
with improved
√ √trajectories, we use less scales with s =
1/2, 1/ 2, 1, 2, 2. algorithm [40] and use the OpenCV implementation, due
to its balance between accuracy and efficiency. For fast
5. Experiments computation, we discretize the values of optical flow fields
into integers and set their range as 0-255 just like images.
In this section, we first present the details of datasets and Specifically, we choose to stack 10 frames of optical
their evaluation scheme. Then, we describe the details of flow fields to keep a balance between performance and
our method. Finally, we give the experimental results and efficiency. We train temporal net on UCF101 from scratch.
compare TDD with the state of the art. As the dataset is relatively small, we use high dropout ratio
to improve the generalization capacity of trained model. We
5.1. Datasets
set dropout 0.9 for full6 layer and dropout 0.8 for full7 layer.
In order to verify the effectiveness of TDDs, we conduct The training procedure of temporal net is similar to spatial
experiments on two public large datasets, namely HMDB51 net and a 224 × 224 × 20 sub-volume is randomly cropped
[15] and UCF101 [26]. The HMDB51 dataset is a large and flipped from training video. The learning rate is initially
collection of realistic videos from various sources, includ- set as 10−2 and decreases to 10−3 after 50K iterations. It
ing movies and web videos. The dataset is composed of is then reduced to 10−4 after 70K iterations and training is
6, 766 video clips from 51 action categories, with each stopped at 90K iterations.
category containing at least 100 clips. Our experiments Results of two-stream ConvNets. To evaluate the
follow the original evaluation scheme using three different trained model, as in [24], we select 25 frames for each
0.48 0.49
Algorithm HMDB51 UCF101
0.46 0.48 HOG [31, 32] 40.2% 72.4%
Accuracy
Accuracy
0.47
0.42 0.46
MBH [31, 32] 52.1% 80.8%
0.4
HOF+MBH [31, 32] 54.7% 82.2%
0.45
iDT [31, 32] 57.2% 84.7%
0.38
32 64 128 256 0.44
PCA Dimension
No Norm.Cha. Norm.ST. Norm. Combine
Normalizaiton Method Spatial net [24] 40.5% 73.0%
Figure 3. Exploration of different settings in TDD on the HMDB51 Temporal net [24] 54.6% 83.7%
dataset. Left: Performance trend with varying PCA reduced di- Two-stream ConvNets [24] 59.4% 88.0%
mension. Right: Comparison of different normalization methods. Spatial conv4 48.5% 81.9%
“Combine” means the fusion of spatiotemporal normalization and Spatial conv5 47.2% 80.9%
channel normalization. Spatial conv4 and conv5 50.0% 82.8%
Temporal conv3 54.5% 81.7%
video clip and obtain 10 crops for each frame. The final Temporal conv4 51.2% 80.1%
recognition result is the average across these crops and Temporal conv3 and conv4 54.9% 82.2%
frames. We obtain 71.2% recognition accuracy with spatial TDD 63.2% 90.3%
net and 80.1% with temporal net. The performance of TDD and iDT 65.9% 91.5%
our implemented two-stream ConvNets is 84.7%, which Table 3. Performance of TDD on the HMDB51 dataset and
is similar to that of two-stream ConvNets [24] (85.6%). UCF101 dataset. We compare our proposed TDD with iDT
However, obtaining ConvNets with high performance is features [31] and two-stream ConvNets [24]. We also explore the
not the final goal of this paper, and we aim to verify the complementary properties TDD features and iDT features. The
effectiveness of TDDs. combination of them can further boost the performance.
Feature encoding. We choose Fisher vector [23] to
encode the TDDs of a video clip into high dimensional performance and spatiotemporal normalization is the best
representation as its effectiveness for action recognition has choice. We also explore the complementary property
been verified in previous works [38, 27], and then use a of these two normalization methods by fusing the Fisher
linear SVM as the classifer (C = 100). In order to train vectors of them, and observe that it can further improve the
GMMs, we first de-correlate TDD with PCA and reduce its performance. Therefore, in the remainder of this section,
dimension to D. Then, we train a GMM with K (K = 256) we will use the combined representation obtained from
mixtures, and finally the video is represented with a 2KD- these two normalization methods for TDDs.
dimensional vector. Different layers. Finally we investigate the performance
of TDDs from different layers of spatial and temporal nets
5.3. Exploration experiments on the HMDB51 dataset, and the results are summarized in
Table 2. For layers of conv5, conv4, and conv3, we use the
Dimension reduction. To specify the PCA dimension
outputs of RELU activations, and for layers of conv2 and
of TDD for GMM training and Fisher vector encoding, we
conv1, we choose the outputs of max pooling layers after
first explore different dimensions reduced by PCA on the
convolution operations. We see that descriptors of layers
HMDB51 dataset, with conv4 descriptors from spatial net.
conv4 and conv5 obtain highest recognition performance
In this exploration experiment, we use the spatiotemporal
for spatial net, while the ones of layers conv3 and conv4 are
normalization method for TDD and the results are shown in
top performers for temporal net. Therefore, in the following
the left of Figure 3. We vary the dimension from 32 to 256
evaluation of TDD, we choose the descriptors from conv4
and the results show that dimension 64 achieves the high
and conv5 layers for spatial nets, and conv3 and conv4
performance, and higher dimension may cause performance
layers for temporal nets.
degradation. Thus, we fix the dimension as 64 for TDDs in
the remainder of this section.
5.4. Evaluation of TDDs
Normalization method. Another important component
in TDD design is the normalization method and we have In this section, we evaluate the performance of our
presented two normalization methods: spatiotemporal nor- proposed TDDs on the HMDB51 and UCF101 dataset, and
malization (ST. Norm.) and channel normalization (Cha. the experimental results are summarized in Table 3. We first
Norm.) in Section 4.3 . We conduct experiments to compare the performance of TDDs with that of improved
investigate the effectiveness of normalization methods by trajectories. The convolutional descriptors of spatial net
using conv4 descriptors from spatial net on the HMDB51 are much better than HOG descriptors, which indicates that
dataset, and the results are shown in the right of Figure deep-learned features contains more discriminative capacity
3. We see that normalization is important for improving than hand-crafted features. For convolutional descriptors
Spatial ConvNets Temporal ConvNets
Convolutional layer conv1 conv2 conv3 conv4 conv5 conv1 conv2 conv3 conv4 conv5
Recognition accuracy 24.1% 33.9% 41.9% 48.5% 47.2% 39.2% 50.7% 54.5% 51.2% 46.1%
Table 2. The performance of different layers of spatial nets and temporal nets on the HMDB51 dataset.
(a) RGB (b) Flow-x (c) Flow-y (d) S-conv4 (e) S-conv5 (f) T-conv3 (g) T-conv4
Figure 4. Examples of video frames, optical flow fields, and their corresponding feature maps of spatial nets and temporal nets.
of temporal net, they are better than or comparable to the HMDB51 UCF101
descriptors of HOF and MBH, but the improvement is not STIP+BoVW [15] 23.0% STIP+BoVW [26] 43.9%
so evident as spatial convolutional descriptors. The reason Motionlets [35] 42.1% Deep Net [12] 63.3%
may be that HOF and MBH calculation is based on warped DT+BoVW [30] 46.6% DT+VLAD [3] 79.9%
optical flow instead of original optical flow, which has been DT+MVSV [3] 55.9% DT+MVSV [3] 83.5%
proved to be pretty effective for HOF descriptor [31]. We iDT+FV [31] 57.2% iDT+FV [32] 85.9%
iDT+HSV [21] 61.1% iDT+HSV [21] 87.9%
consider using warped flow for TDDs extraction in the
Two Stream [24] 59.4% Two Stream [24] 88.0%
future.
TDD+FV 63.2% TDD+FV 90.3%
We also compare the performance of TDDs with the Our best result 65.9% Our best result 91.5%
two-stream ConvNets. Although our trained two-stream Table 4. Comparison of TDD to the state of the art. We separately
ConvNets obtain slightly lower performance than theirs, present the results of TDDs and our best results obtained with early
we see that our spatial TDDs outperform spatial nets by fusion of TDDs and iDTs.
a large margin and temporal TDD is comparable to their
instead of warped flow. The ConvNets are implemented by
temporal net. These results indicate the fact that trajectory-
Cuda and computing is very efficient.
constrained sampling and pooling is an effective strategy
for improving recognition performance, in particular for 5.5. Comparison to the state of the art
spatial TDDs. We also notice that the combined TDDs from
spatial and temporal nets outperform two-stream ConvNets Table 4 compares our recognition results with several
by around 4% and 2% on the two datasets, respectively. recently published methods on the dataset of HMDB51 and
We also show some examples of video frames, optical flow UCF101. The performance of TDDs outperforms previous
fields, and their corresponding feature maps in Figure 4. methods on both datasets. On the HMDB51 dataset, our
From these examples, we see that the convolutional feature best result outperforms other methods by 4.8%, and on the
maps are relatively sparse and exhibit high correlation with UCF101 dataset, our best result outperforms by 3.5%. This
the action areas. superior performance of TDDs indicates the effectiveness
of introducing trajectory-constrained sampling and pooling
Finally, we explore a practical way to improve the
into deep-learned features.
recognition performance of action recognition system by
combining TDDs with iDTs, using early fusion of Fisher
vector representation. The recognition results are shown 6. Conclusions
in Table 3, and the fusion of them can further boost the This paper has proposed an effective video presenta-
performance. This further improvement indicates our TDDs tion, called trajectory-pooled deep-convolutional descriptor
are complementary to those low-level local features. (TDD), which integrates the advantages of hand-crafted and
Computational costs. Compared with iDT, we only deep-learned features. Deep architectures are utilized to
track points on a single scale and extract original flow learn discriminative convolutional feature maps, and then
the strategies of trajectory-constrained sampling and pool- [14] A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet
ing are adopted to aggregate these convolutional features classification with deep convolutional neural networks. In
into TDDs. Our features achieve superior performance NIPS, 2012. 2, 3, 5
on two datasets for action recognition, as evidenced by [15] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre.
comparison with the state-of-the-art methods. HMDB: A large video database for human motion recogni-
tion. In ICCV, 2011. 1, 2, 6, 8
[16] I. Laptev. On space-time interest points. IJCV, 64(2-3),
Acknowledgement 2005. 1, 2
This work is supported by a donation of Tesla K40 GPU [17] I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld.
from NVIDIA Corporation. Limin Wang is supported by Learning realistic human actions from movies. In CVPR,
Hong Kong PhD Fellowship. Yu Qiao is the corresponding 2008. 1, 2, 5
author and supported by National Natural Science [18] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-
based learning applied to document recognition. In ISP.
Foundation of China (91320101, 61472410), Shenzhen
IEEE Press, 2001. 1
Basic Research Program (JCYJ20120903092050890,
[19] D. G. Lowe. Distinctive image features from scale-invariant
JCYJ20120617114614438, JCYJ20130402113127496),
keypoints. IJCV, 60(2), 2004. 5
100 Talents Program of CAS, and Guangdong Innovative [20] A. Patron-Perez, M. Marszalek, I. Reid, and A. Zisserman.
Research Team Program (No.201001D0104648280). Structured learning of human interactions in TV shows.
TPAMI, 34(12), 2012. 1
References [21] X. Peng, L. Wang, X. Wang, and Y. Qiao. Bag of visual
words and fusion methods for action recognition: Compre-
[1] J. K. Aggarwal and M. S. Ryoo. Human activity analysis: A
hensive study and good practice. CoRR, abs/1405.4506,
review. ACM Comput. Surv., 43(3):16, 2011. 1
2014. 8
[2] H. Bay, T. Tuytelaars, and L. J. V. Gool. SURF: speeded up
[22] S. Sadanand and J. J. Corso. Action bank: A high-level
robust features. In ECCV, 2006. 4
representation of activity in video. In CVPR, 2012. 2
[3] Z. Cai, L. Wang, X. Peng, and Y. Qiao. Multi-view super
[23] J. Sánchez, F. Perronnin, T. Mensink, and J. J. Verbeek.
vector for action recognition. In CVPR, 2014. 8
Image classification with the Fisher vector: Theory and
[4] K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman. practice. IJCV, 105(3), 2013. 7
Return of the devil in the details: Delving deep into [24] K. Simonyan and A. Zisserman. Two-stream convolutional
convolutional nets. In BMVC, 2014. 6 networks for action recognition in videos. NIPS, 2014. 1, 2,
[5] N. Dalal and B. Triggs. Histograms of oriented gradients for 3, 4, 5, 6, 7, 8
human detection. In CVPR, 2005. 5 [25] K. Simonyan and A. Zisserman. Very deep convolu-
[6] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and F. Li. tional networks for large-scale image recognition. CoRR,
ImageNet: A large-scale hierarchical image database. In abs/1409.1556, 2014. 3
CVPR, 2009. 4, 6 [26] K. Soomro, A. R. Zamir, and M. Shah. UCF101: A dataset
[7] P. Dollár, V. Rabaud, G. Cottrell, and S. Belongie. Behavior of 101 human actions classes from videos in the wild. CoRR,
recognition via sparse spatio-temporal features. In VS-PETS, abs/1212.0402, 2012. 1, 2, 6, 8
2005. 1, 2 [27] C. Sun and R. Nevatia. Large-scale web video event
[8] M. A. Fischler and R. C. Bolles. Random sample consensus: classification by use of Fisher vectors. In WACV, 2013. 7
A paradigm for model fitting with applications to image [28] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed,
analysis and automated cartography. Commun. ACM, 24(6), D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich.
1981. 4 Going deeper with convolutions. CoRR, abs/1409.4842,
[9] S. Ji, W. Xu, M. Yang, and K. Yu. 3D convolutional neural 2014. 3
networks for human action recognition. TPAMI, 35(1), 2013. [29] G. W. Taylor, R. Fergus, Y. LeCun, and C. Bregler. Convolu-
2, 3 tional learning of spatio-temporal features. In ECCV, 2010.
[10] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, 2, 3
R. B. Girshick, S. Guadarrama, and T. Darrell. Caffe: Con- [30] H. Wang, A. Kläser, C. Schmid, and C.-L. Liu. Dense
volutional architecture for fast feature embedding. CoRR, trajectories and motion boundary descriptors for action
abs/1408.5093. 6 recognition. IJCV, 103(1), 2013. 1, 2, 3, 8
[11] Y.-G. Jiang, J. Liu, A. Roshan Zamir, I. Laptev, M. Piccardi, [31] H. Wang and C. Schmid. Action recognition with improved
M. Shah, and R. Sukthankar. THUMOS challenge: Action trajectories. In ICCV, 2013. 1, 2, 3, 7, 8
recognition with a large number of classes, 2013. 2, 6 [32] H. Wang and C. Schmid. LEAR-INRIA submission for
[12] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, the thumos workshop. In ICCV Workshop on THUMOS
and L. Fei-Fei. Large-scale video classification with convo- Challenge, 2013. 2, 7, 8
lutional neural networks. In CVPR, 2014. 2, 3, 8 [33] H. Wang, M. M. Ullah, A. Kläser, I. Laptev, and C. Schmid.
[13] A. Kläser, M. Marszalek, and C. Schmid. A spatio-temporal Evaluation of local spatio-temporal features for action recog-
descriptor based on 3D-gradients. In BMVC, 2008. 2 nition. In BMVC, 2009. 2
[34] L. Wang, Y. Qiao, and X. Tang. Mining motion atoms and
phrases for complex action recognition. In ICCV, 2013. 2
[35] L. Wang, Y. Qiao, and X. Tang. Motionlets: Mid-level 3D
parts for human motion recognition. In CVPR, 2013. 1, 2, 8
[36] L. Wang, Y. Qiao, and X. Tang. Latent hierarchical model of
temporal structure for complex activity classification. TIP,
23(2), 2014. 1
[37] L. Wang, Y. Qiao, and X. Tang. Video action detection with
relational dynamic-poselets. In ECCV, 2014. 2
[38] X. Wang, L. Wang, and Y. Qiao. A comparative study
of encoding, pooling and normalization methods for action
recognition. In ACCV, 2012. 7
[39] G. Willems, T. Tuytelaars, and L. J. V. Gool. An efficient
dense and scale-invariant spatio-temporal interest point de-
tector. In ECCV, 2008. 2
[40] C. Zach, T. Pock, and H. Bischof. A duality based approach
for realtime tv-L1 optical flow. In 29th DAGM Symposium
on Pattern Recognition, 2007. 6
[41] M. D. Zeiler and R. Fergus. Visualizing and understanding
convolutional networks. In ECCV, 2014. 3, 4, 5
[42] J. Zhu, B. Wang, X. Yang, W. Zhang, and Z. Tu. Action
recognition with actons. In ICCV, 2013. 2