Action Recognition With Trajectory-Pooled Deep-Convolutional Descriptors

This document presents a new video representation called trajectory-pooled deep-convolutional descriptor (TDD) that combines the merits of hand-crafted and deep-learned features for action recognition. The approach uses deep convolutional networks to learn feature maps from video, then performs trajectory-constrained pooling along detected trajectories to aggregate the features into descriptors. Experimental results on HMDB51 and UCF101 datasets show TDD outperforms previous hand-crafted and deep-learned features and achieves state-of-the-art performance.

Uploaded by

Venkata Praneeth

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

58 views10 pages

Action Recognition With Trajectory-Pooled Deep-Convolutional Descriptors

Uploaded by

Venkata Praneeth

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

Action Recognition with Trajectory-Pooled Deep-Convolutional Descriptors

Limin Wang1,2 Yu Qiao2 Xiaoou Tang1,2

1
Department of Information Engineering, The Chinese University of Hong Kong
2
Shenzhen key lab of Comp. Vis. & Pat. Rec., Shenzhen Institutes of Advanced Technology, CAS, China
arXiv:1505.04868v1 [cs.CV] 19 May 2015

[email protected], [email protected], [email protected]

Abstract

Visual features are of vital importance for human

action understanding in videos. This paper presents a
new video representation, called trajectory-pooled deep-
convolutional descriptor (TDD), which shares the merits of
both hand-crafted features [31] and deep-learned features
[24]. Specifically, we utilize deep architectures to learn
discriminative convolutional feature maps, and conduct
trajectory-constrained pooling to aggregate these convo-
lutional features into effective descriptors. To enhance
the robustness of TDDs, we design two normalization
methods to transform convolutional feature maps, namely Figure 1. There are mainly two types of features in action
spatiotemporal normalization and channel normalization. recognition: hand-crafted features and deep-learned features. For
The advantages of our features come from (i) TDDs hand-crafted features, improved trajectories [31] combined with
are automatically learned and contain high discriminative Fisher vector are most successful. For deep-learned features,
Convolutional Networks (ConvNets) [18] are popular deep archi-
capacity compared with those hand-crafted features; (ii)
tectures, which contain a sequence of convolutional and pooling
TDDs take account of the intrinsic characteristics of tem- layers. They aims to automatically learn features with a deep
poral dimension and introduce the strategies of trajectory- discriminatively trained neural network.
constrained sampling and pooling for aggregating deep-
learned features. We conduct experiments on two chal- action class, which may be caused by background clutter,
lenging datasets: HMDB51 and UCF101. Experimental viewpoint change, and various motion speeds and styles.
results show that TDDs outperform previous hand-crafted Meanwhile, the high dimension and low resolution of
features [31] and deep-learned features [24]. Our method video further increases the difficulty to design efficient and
also achieves superior performance to the state of the art robust recognition method. Visual representations from
on these datasets 1 . action videos are crucial for dealing with these issues and
designing effective recognition systems. Currently, there
are mainly two types of video features available for action
1. Introduction recognition, as illustrated in Figure 1.
Human action recognition [1, 24, 31, 35, 36] in videos The first type of representations are the hand-crafted
attracts increasing research interests in computer vision local features, and typical local features include Space
community due to its potential applications in video surveil- Time Interest Points [16], Cuboids [7], Dense Trajectories
lance, human computer interaction, and video content [30], and Improved Trajectories [31]. Calculation of these
analysis. However, action recognition remains as a diffi- local features can be usually decomposed into two phrases:
cult problem when focusing on realistic datasets collected detector, which aims to discover the salient and informative
from movies [17], web videos [15, 26], and TV shows regions for action understanding, and descriptor, whose
[20]. There are large intra-class variations in the same goal is to describe the visual patterns of extracted regions.
Among these local features, improved trajectories with
1 The TDD code and learned two-stream ConvNet models are available rich descriptors of HOG, HOF, MBH have shown to
at https://wanglimin.github.io/tdd/index.html be successful on a number of challenging datasets (e.g.

1
HMDB51 [15], UCF101 [26]) and contests (e.g. THUMOS extractors, and use them to obtain multi-scale convolutional
[11]). Improved trajectories include several important in- feature maps for each video. Meanwhile, we detect a
gredients in their extraction process. Firstly, these extracted set of point trajectories with the method of improved
trajectories are mainly located at regions with high motion trajectories. Based on convolutional feature maps and
salience, which contain rich and discriminative information improved trajectories, we pool the local ConvNet responses
for action recognition. Secondly, these local descriptors of over the spatiotemporal tubes centered at the trajectories,
the corresponding regions in several successive frames, are where the resulting descriptor is called TDD. Finally, we
aligned and pooled along the trajectories. This trajectory- choose Fisher vector representation to aggregate these local
constrained sampling strategy also takes account of the TDDs over the whole video into a global super vector,
temporal continuity of human action, and is effective and use linear SVM as the classifier to perform action
to deal with the variations of motion speed. However, recognition. We conduct experiments on two public action
these hand-crafted descriptors are not optimized for visual datasets: the HMDB51 dataset [15] and the UCF101 dataset
representation and may lack discriminative capacity for [26]. We show that our TDDs obtain the state-of-the-art
action recognition. performance for action recognition on these challenging
The second type of representations are the deep-learned datasets. Meanwhile, our results demonstrate that our
features, and typical methods include Convolutional RBMs TDDs are complementary to those hand-crafted features
[29], 3D ConvNets [9], Deep ConvNets [12], and Two- (HOG, HOF, and MBH) and the fusion of them is able to
Stream ConvNets [24]. These deep learning methods aim further boost the recognition performance.
to automatically learn the semantic representation from
raw video by using a deep neural network discriminatively 2. Related Works
trained from a large number of labeled data. Two- Hand-crafted features. Local features [7, 16, 33, 39]
Stream ConvNets [24] are probably the most successful have become popular and effective representations in action
architecture at present, and they match the state-of-the-art recognition, as these local features do not require algo-
performance of improved trajectories [31, 32] on UCF101 rithms to detect human body and are robust to background
and HMDB51. They are composed of two neural networks, clutter, illumination changes, and video noise. Space Time
namely spatial nets and temporal nets. Spatial nets mainly Interest Points [16] proposed Harris3D detector to extract
capture the discriminative appearance features for action informative regions, while Cuboid [7] detector relied on
understanding, while temporal nets aim to learn the effec- temporal Gabor filters. Willems et al. [39] proposed a
tive motion features. However, unlike image classification Hessian detector, which is a spatio-temporal extension of
tasks [14], these deep learning based methods fail to Hessian saliency measure used for blob detection in images.
outperform previous hand-crafted features. One problem of Meanwhile several local descriptors have been proposed to
deep learning methods is that they require a large number of represent the 3D volumes extracted around these interest
labeled videos for training, while most available datasets are points, such as Histogram of Gradient (HOG), Histogram
relatively small. Meanwhile, most of current deep learning of Optical Flow (HOF) [17], 3D Histogram of Gradient
based action recognition methods largely ignore the intrin- (HOG3D) [13], and Extended SURF (ESURF) [39]. Recent
sic difference between temporal domain and spatial domain, works made use of point trajectories [30, 31] to extract
and just treat temporal dimension as feature channels when and align 3D volumes, and resorted to more rich low level
adapting the architectures of ConvNets to model videos. descriptors for constructing effective video representations,
Motivated by the above analysis, this paper proposes a including HOG, HOF, and Motion Boundary Histogram
new kind of video feature, called trajectory-pooled deep- (MBH).
convolutional descriptor (TDD). The design of TDD aims One limitation of these local features is that they lack
to combine the benefits of both hand-crafted and deep- semantics and discriminative capacity. To overcome this is-
learned features. To achieve this goal, our approach sue, several mid-level and high-level video representations
integrates the key factors from two successful video rep- have been proposed such as Action Bank [22], Dynamic-
resentations, namely improved trajectories [31] and two- Poselets [37], Motionlets [35], Motion Atoms and Phrases
stream ConvNets [24]. We utilize deep architecture to learn [34], and Actons [42]. They usually resorted to some
multi-scale convolutional feature maps, and introduce the heuristic mining methods to select discriminative visual el-
strategies of trajectory-constrained sampling and pooling to ements as feature units. Instead, this paper takes a different
encode deep features into effective descriptors. view of this problem and replace these local hand-crafted
Specifically, we first train two-stream ConvNets on a descriptors with deep-learned representations. Our deep
relatively large dataset, while more labeled action videos representations deliver high level semantic information, and
will make ConvNet training more stable and robust. Then, are learned automatically from training data without using
we treat the learned two-stream ConvNets as generic feature these heuristic rules.
Extracting Trajectories Extracting Feature Maps

… …

+
… …

input video tracking in a single scale trajectories feature pyramid spatial & temporal nets frame & flow pyramid

Trajectory-Pooled Deep-Convolutional Descriptors (TDDs)

+
Z
r

TDD
r +
Y r

X spatiotemporal normalization channel normalization

+
convolutional feature map feature map normalization trajectory-constrained pooling

Figure 2. Pipeline of TDD. The whole process of extracting TDD is composed of three steps: (i) extracting trajectories, (ii) extracting multi-
scale convolutional feature maps, and (iii) calculating TDD. We effectively exploit two available state-of-the-art video representations,
namely improved trajectories and two-stream ConvNets. Grounded on them, we conduct trajectory-constrained sampling and pooling over
convolutional feature maps to obtain trajectory-pooled deep convolutional descriptors.
Deep-learned features. Deep learning techniques have 3. Improved Trajectories Revisited
achieved great success in image based tasks [14, 25, 28, 41]
and there have been a number of attempts to develop As shown in Figure 2, our proposed representation
(TDD) is based on low level trajectory extraction and we
deep architectures for video action recognition [9, 12, 24,
29]. Taylor et al. [29] used Gated Restricted Boltzmann choose improved trajectories [31]. In this section, we briefly
Machines (GRBMs) to learn the motion features in an review the extraction process of improved trajectories. It is
worth noting that our TDD is independent of the method
unsupervised manner and then resorted to convolutional
learning to fine tune the parameters. Ji et al. [9] extended of extracting trajectories, and we use improved trajectories
due to its good performance.
2D ConvNet to video domain for action recognition on
Improved trajectories are extended from dense trajecto-
relatively small datasets, and recently Karpathy et al. [12]
tested ConvNets with deep structures on a large dataset, ries [30]. To compute dense trajectories, the first step is to
densely sample a set of points on 8 spatial scales on a grid
called Sports-1M. However, these deep models achieved
lower performance compared with shallow hand-crafted with step size of 5 pixels. Points in homogeneous areas are
representation [31], which might be ascribed to two facts: eliminated by setting a threshold for the smaller eigenvalue
of their autocorrelation matrices. Then these sampled points
firstly, available action datasets are relatively small for
deep learning; secondly, learning complex motion patterns are tracked by media filtering of dense flow field.
is more challenging. Simonyan et al. [24] designed Pt+1 = (xt+1 , yt+1 ) = (xt , yt ) + (M ∗ ωt )|(xt ,yt ) , (1)
two-stream ConvNets containing spatial and temporal net
by exploiting large ImageNet dataset for pre-training and where M is the median filter kernel, ∗ is convolutional
explicitly calculating optical flow for capturing motion operation, ωt = (ut , vt ) is the dense optical flow field of the
information, and finally it matched the state-of-the-art tth frame, and (xt , y t ) is the rounded position of (xt , yt ).
performance. To avoid the drifting problem of tracking, the maximum
length of trajectory is set as 15-frame. Finally, those static
trajectories are removed as they lack motion information,
and other trajectories with suddenly large displacement are
also ignored, since they are obviously incorrect due to
However, these deep models lacked considerations of inaccurate optical flow.
temporal characteristics of video data and relied on large Improved trajectories boost the recognition performance
training datasets. We incorporate video temporal char- of dense trajectories by taking camera motion into account.
acteristics into deep architectures by using strategy of It assumes that the background motion of two consecutive
trajectory-constrained sampling and pooling, and propose frames can be characterized by a homography matrix. To
a new descriptor. Meanwhile, our descriptors can be easily estimate the homography matrix, the first step is to find
adapted to the datasets of smaller size. the correspondence between two consecutive frames. They
resort to SURF [2] feature matching and optical flow while temporal nets aim to describe the dynamic motion
based matching, as these two kinds of matching scheme information, whose input are volumes of stacking optical
are complementary to each other. Then, they use the flow fields (224 × 224 × 2F , F is the number of stacking
RANSAC [8] algorithm to estimate homography matrix. flows). Meanwhile, decoupling the spatial and temporal
Based on the homography, they rectify the frame image to nets also allows to exploit the available images by pre-
remove the camera motion and re-calculate the optical flow, training spatial nets on the ImageNet challenge dataset [6],
called warped flow. Warped flow brings advantages to the and explicitly handle motion information with optical flow
descriptors calculated from optical flows, in particular for algorithms for temporal nets
HOF, and trajectories corresponding to camera motion can The details about ConvNets are shown in Table 1. This
be removed too. ConvNet architecture is original from the Clarifai networks
We adopt improved trajectories for the task of TDD [41] and adapted to the task of action recognition with less
extraction, but make a modification. Unlike dense trajec- filters in conv4 layer and lower-dimensional full7 layer. But
tories or improved trajectories, we only track points on its we make a small modification. We use the same network
original spatial scale, and extract multi-scale TDDs around architecture for both spatial and temporal net in addition to
the extracted trajectories (see Section 4). We observe that the input data layer, while the original two-stream ConvNets
tracking on a single scale is fast for implementation. In [24] ignore the second local response normalized (LRN)
summary, given a video V , we obtain a set of trajectories layer in the temporal net due to memory consumption
problem. The implementation and training details can be
T(V ) = {T1 , T2 , · · · , TK }, (2) found in Section 5.
where K is the number of trajectories, and Tk denotes the
4.2. Convolutional feature maps
k th trajectory in the original spatial scale:
Once the training of two-stream ConvNets is complete,
Tk = {(xk1 , y1k , z1k ), (xk2 , y2k , z2k ), · · · , (xkP , yPk , zPk )}, (3) we treat them as generic feature extractors to obtain the
convolutional feature maps of videos. In general, for each
where (xkp , ypk , zpk ) is the pixel position of the pth point in
video, we obtain these feature maps of spatial and temporal
trajectory Tk , and P is the length of trajectory (P = 15).
net in a frame-by-frame and volume-by-volume manner,
These trajectories will be used for trajectory-constrained
respectively. In order to make the feature maps with equal
sampling and pooling in the process of TDD extraction, as
temporal duration with input video, we pad the optical flow
described in the next section.
fields at the beginning with F − 1 copies of the optical flow
field from the first frame, where F is the number of stacking
4. Deep Convolutional Descriptors
optical flow.
In this section, we describe a new video representa- For each frame or volume, we take it as the input for spa-
tion, called trajectory-pooled deep-convolutional descrip- tial or temporal nets. We make two modifications about the
tor (TDD), which shares the benefits of both hand-crafted spatial and temporal nets. The first one is that we remove
and deep-learned features. We first introduce the archi- the layers after the target layer for feature extraction. For
tectures of convolutional networks (ConvNets) we used. example, to extract feature maps of conv4, we will remove
Then, we show how to adapt the ConvNets trained on large the layers from conv5 to full8. Therefore, the output of
datasets to extract multi-scale convolutional feature maps. spatial and temporal net will be the convolutional feature
Finally, based on improved trajectories and convolutional maps, which will be used for extracting TDD in the next
feature maps, we describe the details of how to calculate subsection.
TDDs. The second modification is that before each convolu-
tional or pooling layer, with kernel size k, we conduct
4.1. Convolutional networks zero padding of the layer’s input with size ⌊k/2⌋. This
Our TDD starts with designing deep ConvNets for padding allows the input and output maps of these layers
extracting convolutional feature maps. In principle, any to have the same spatial extent. With this padding, it will
kind of ConvNet architecture can be adopted for TDD be straightforward to map the positions of trajectory points
extraction. In our implementation, we choose the two- in video to the coordinates of convolutional feature maps.
stream ConvNets[24] due to their good performance on the A trajectory point with video coordinates (xp , yp , zp ) in
datasets of UCF101 and HMDB51. Equation (3) will be centered on (r × xp , r × yp , zp ) in
The two-stream ConvNets contain two separate Con- convolutional map, where r is map size ratio with respective
vNets, namely spatial nets and temporal nets. Spatial nets to input size, as listed in Table 1.
are designed for capturing static appearance cues, which ConvNets are bottom-up architectures with a sequence
are trained on single frame images (224 × 224 × 3), of alternating convolutional and pooling layers. Different
Layer conv1 pool1 conv2 pool2 conv3 conv4 conv5 pool5 full6 full7 full8
size 7×7 3×3 5×5 3×3 3×3 3×3 3×3 3×3 - - -
stride 2 2 2 2 1 1 1 2 - - -
channel 96 96 256 256 512 512 512 512 4096 2048 101
map size ratio 1/2 1/4 1/8 1/16 1/16 1/16 1/16 1/32 - - -
receptive field 7×7 11 × 11 27 × 27 43 × 43 75 × 75 107 × 107 139 × 139 171 × 171 - - -

Table 1. ConvNet Architectures. We use similar architectures to two-stream ConvNets [24], which are adapted to the task of action
recognition from the Clarifai networks [41], with less filters in conv4 layer (512 vs. 1024) and lower-dimensional full7 layer (2048 vs.
4096). For layers of conv1 and conv2, local response normalized (LRN) is applied with parameters settings: n = 5, α = 5 × 10−4 , β =
0.75. The layers of full6 and full7 are regularised by using dropout and the full8 layer acts as a soft-max classifier. The activation function
for all weight layers is the rectification linear unit (RELU). The size ratios of feature maps with respect to input data range from 1/2 to
1/32, and the feature receptive fields vary from 7 × 7 to 171 × 171, for different convolutional and pooling layers (conv1 to pool5).
layers of ConvNets have various receptive fields as shown activation burstiness of some neurons. We design two kinds
in Table 1, ranging from 7 × 7 to 171 × 171. As described of normalization methods:
in paper [41], these different layers capture patterns from
simple visual elements such as edges, to complex visual • Spatiotemporal Normalization. For spatiotemporal
concepts such as parts and objects. The higher layers normalization, we normalize the feature map for each
have larger receptive fields and obtain more invariant and channel independently across the video spatiotemporal
discriminative features. Intuitively, these different layers extent. Given a feature map C ∈ RH×W ×L×N of
describe the visual content at different levels, each of which Equation (4), we normalize the convolutional feature
is complementary to each other for the task of recognition. value as follows:
We will exploit this complimentary property of different est (x, y, z, n) = C(x, y, z, n)/maxVn ,
C st (5)
layers during the extraction of TDD. Given a video V , we
obtain a set of convolutional feature maps: where maxVnst is the maximum value of nth feature
maps over the whole video spatiotemporal extent,
C(V ) = {C1s , C2s , · · · , CM
s
, C1t , C2t , · · · , CM
t
}, (4) which means maxVnst = maxx,y,z C(x, y, z, n).
s The spatiotemporal normalization method ensures that
where Cm ∈ RHm ×Wm ×L×Nm is the mth feature map
each convolutional feature channel ranges in the same
of spatial net, Hm is its height, Wm is its width, L is the
t interval, and thus contributes equally to final TDD
video duration, and Nm is the number of channels. Cm ∈
Hm ×Wm ×L×Nm th recognition performance.
R is the m feature map of temporal net,
M is the number of layers for extracting TDD. • Channel Normalization. For channel normalization,
we normalize the feature map for each pixel indepen-
4.3. Trajectory-pooled descriptors dently across the feature channels. We conduct chan-
We will describe the method for extracting trajectory- nel normalization for feature map C ∈ RH×W ×L×N
pooled deep-convolutional descriptors (TDDs) from a set as follows:
of improved trajectories T(V ) and convolutional feature ech (x, y, z, n) = C(x, y, z, n)/maxVx,y,z ,
C (6)
maps C(V ) for a given video V . In essence, TDD ch

is a kind of local trajectory-aligned descriptor computed where maxVx,y,z is the maximum value of different
ch
in a 3D volume around the trajectory. TDDs from the feature channels at pixel position (x, y, z), that is
spatial and temporal nets capture the appearance and motion maxVx,y,z = maxn C(x, y, z, n). This channel
ch
information of this 3D volume, respectively. The size of normalization is able to make sure that the feature
the volume is N × N pixels and P frames, where N is value of each pixel range in the same interval, and
the receptive field size and P is the trajectory length. The let each pixel make the equal contribution in the final
extraction of TDD is composed of two steps: feature map representation.
normalization and trajectory pooling.
Normalization proves to be an effective strategy in de- After the step of feature normalization, we will extract
signing features partially because it can reduce the influence TDDs based on trajectories and normalized convolutional
of illumination. It has been widely exploited in local feature maps by using trajectory pooling. Specifically,
descriptors such as SIFT [19], HOG [5], and HOF [17], and given a trajectory Tk and a normalized feature map C ea ,
m
th
in deep learning such as local response normalization [14]. which is the m -layer feature map after either spatiotem-
We apply the normalization strategy to the convolutional poral normalization or channel normalization from spatial
feature maps of two-stream ConvNets to suppress the net or temporal net (a ∈ {s, t}), we conduct sum-pooling of
the normalized feature maps over the 3D volume centered training/testing splits. In each split, each action class has
at the trajectory as follows: 70 clips for training and 30 clips for testing. The average
accuracy over these three splits is used to measure the final
P
X performance.
ea ) =
D(Tk , C ea ((rm × xk ), (rm × y k ), z k ),
C (7)
m m p p p The UCF101 dataset contains 101 action classes and
p=1
there are at least 100 video clips for each class. The whole
dataset contains 13, 320 video clips, which are divided
where (xkp , ypk , zpk ) is the pth point position of video coor-
into 25 groups for each action category. We follow the
dinates in trajectory Tk , rm is the mth -layer map size ratio
evaluation scheme of the THUMOS13 challenge [11] and
with respective to input size as listed in Table 1, (·) is the
adopt the three training/testing splits for evaluation. As
rounding operation. D(Tk , C em
a
) is called trajectory-pooled UCF101 is larger than HMDB51, we use the UCF101
deep convolutional descriptor, and is a new kind of feature dataset to train two-stream ConvNets initially, and transfer
combing the merits of both improved dense trajectories and this learned model for TDD extraction on the HMDB51
two-stream ConvNets. dataset.
Multi-scale TDD extension. The above description on
TDD extraction is about the single scale, we will present the 5.2. Implementation details
multi-scale extension of TDD. For improved trajectory, it
samples points and tracks them on multi-scale videos, while Two-stream ConvNets training. Training deep Con-
fixes the spatial extent of HOG, HOF, and MBH descriptors vNets is more challenging for action recognition as action
as 32 × 32. The original method needs to conduct point is more complex than object and the available dataset
tracking and descriptor calculation in multi-scale settings. is extremely small compared with the ImageNet dataset
In our implementation, we try a more efficient multi- [6]. We choose the training dataset of UCF101 split1 for
scale strategy. Specifically, we calculate optical flow and learning two-stream ConvNets as it is probably the largest
track point in a single scale. Then we construct multi- public available dataset. We use the Caffe toolbox [10]
scale pyramid representations of video frames and optical for ConvNet implementation. The network weights are
flow fields. These pyramid representations are fed into learnt using the mini-batch (set to 256) stochastic gradient
the two stream ConvNets and transformed into multi-scale descent with momentum (set to 0.9). For spatial net, we
convolutional feature maps as shown in Figure 2. Based on first resize the frame to make the smaller side as 256, and
multi-scale convolutional maps and single-scale improved then randomly crop a 224 × 224 region from the frame. It
trajectories, we are able to compute multi-scale TDDs then undergoes random horizontal flipping. We pre-train
efficiently, by applying trajectory pooling to multi-scale the network with the public available model [4]. Finally,
convolutional feature maps as described above. The only we fine tune the model parameters on the UCF101 dataset,
modification to different scales is to replace feature map where the learning rate is set as 10−2 , decreased to 10−3
size ratio rm in Equation (7) with rm × s, where s is after 14K iterations, and training stopped at 20K iterations.
the scale of current feature map. In practice, compared For temporal net, its input is 3D volume of stacking
optical flows fields. We choose the TVL1 optical flow
with improved
√ √trajectories, we use less scales with s =
1/2, 1/ 2, 1, 2, 2. algorithm [40] and use the OpenCV implementation, due
to its balance between accuracy and efficiency. For fast
5. Experiments computation, we discretize the values of optical flow fields
into integers and set their range as 0-255 just like images.
In this section, we first present the details of datasets and Specifically, we choose to stack 10 frames of optical
their evaluation scheme. Then, we describe the details of flow fields to keep a balance between performance and
our method. Finally, we give the experimental results and efficiency. We train temporal net on UCF101 from scratch.
compare TDD with the state of the art. As the dataset is relatively small, we use high dropout ratio
to improve the generalization capacity of trained model. We
5.1. Datasets
set dropout 0.9 for full6 layer and dropout 0.8 for full7 layer.
In order to verify the effectiveness of TDDs, we conduct The training procedure of temporal net is similar to spatial
experiments on two public large datasets, namely HMDB51 net and a 224 × 224 × 20 sub-volume is randomly cropped
[15] and UCF101 [26]. The HMDB51 dataset is a large and flipped from training video. The learning rate is initially
collection of realistic videos from various sources, includ- set as 10−2 and decreases to 10−3 after 50K iterations. It
ing movies and web videos. The dataset is composed of is then reduced to 10−4 after 70K iterations and training is
6, 766 video clips from 51 action categories, with each stopped at 90K iterations.
category containing at least 100 clips. Our experiments Results of two-stream ConvNets. To evaluate the
follow the original evaluation scheme using three different trained model, as in [24], we select 25 frames for each
0.48 0.49
Algorithm HMDB51 UCF101
0.46 0.48 HOG [31, 32] 40.2% 72.4%
Accuracy

0.44 HOF [31, 32] 48.9% 76.0%

Accuracy
0.47

0.42 0.46
MBH [31, 32] 52.1% 80.8%
0.4
HOF+MBH [31, 32] 54.7% 82.2%
0.45
iDT [31, 32] 57.2% 84.7%
0.38
32 64 128 256 0.44
PCA Dimension
No Norm.Cha. Norm.ST. Norm. Combine
Normalizaiton Method Spatial net [24] 40.5% 73.0%
Figure 3. Exploration of different settings in TDD on the HMDB51 Temporal net [24] 54.6% 83.7%
dataset. Left: Performance trend with varying PCA reduced di- Two-stream ConvNets [24] 59.4% 88.0%
mension. Right: Comparison of different normalization methods. Spatial conv4 48.5% 81.9%
“Combine” means the fusion of spatiotemporal normalization and Spatial conv5 47.2% 80.9%
channel normalization. Spatial conv4 and conv5 50.0% 82.8%
Temporal conv3 54.5% 81.7%
video clip and obtain 10 crops for each frame. The final Temporal conv4 51.2% 80.1%
recognition result is the average across these crops and Temporal conv3 and conv4 54.9% 82.2%
frames. We obtain 71.2% recognition accuracy with spatial TDD 63.2% 90.3%
net and 80.1% with temporal net. The performance of TDD and iDT 65.9% 91.5%
our implemented two-stream ConvNets is 84.7%, which Table 3. Performance of TDD on the HMDB51 dataset and
is similar to that of two-stream ConvNets [24] (85.6%). UCF101 dataset. We compare our proposed TDD with iDT
However, obtaining ConvNets with high performance is features [31] and two-stream ConvNets [24]. We also explore the
not the final goal of this paper, and we aim to verify the complementary properties TDD features and iDT features. The
effectiveness of TDDs. combination of them can further boost the performance.
Feature encoding. We choose Fisher vector [23] to
encode the TDDs of a video clip into high dimensional performance and spatiotemporal normalization is the best
representation as its effectiveness for action recognition has choice. We also explore the complementary property
been verified in previous works [38, 27], and then use a of these two normalization methods by fusing the Fisher
linear SVM as the classifer (C = 100). In order to train vectors of them, and observe that it can further improve the
GMMs, we first de-correlate TDD with PCA and reduce its performance. Therefore, in the remainder of this section,
dimension to D. Then, we train a GMM with K (K = 256) we will use the combined representation obtained from
mixtures, and finally the video is represented with a 2KD- these two normalization methods for TDDs.
dimensional vector. Different layers. Finally we investigate the performance
of TDDs from different layers of spatial and temporal nets
5.3. Exploration experiments on the HMDB51 dataset, and the results are summarized in
Table 2. For layers of conv5, conv4, and conv3, we use the
Dimension reduction. To specify the PCA dimension
outputs of RELU activations, and for layers of conv2 and
of TDD for GMM training and Fisher vector encoding, we
conv1, we choose the outputs of max pooling layers after
first explore different dimensions reduced by PCA on the
convolution operations. We see that descriptors of layers
HMDB51 dataset, with conv4 descriptors from spatial net.
conv4 and conv5 obtain highest recognition performance
In this exploration experiment, we use the spatiotemporal
for spatial net, while the ones of layers conv3 and conv4 are
normalization method for TDD and the results are shown in
top performers for temporal net. Therefore, in the following
the left of Figure 3. We vary the dimension from 32 to 256
evaluation of TDD, we choose the descriptors from conv4
and the results show that dimension 64 achieves the high
and conv5 layers for spatial nets, and conv3 and conv4
performance, and higher dimension may cause performance
layers for temporal nets.
degradation. Thus, we fix the dimension as 64 for TDDs in
the remainder of this section.
5.4. Evaluation of TDDs
Normalization method. Another important component
in TDD design is the normalization method and we have In this section, we evaluate the performance of our
presented two normalization methods: spatiotemporal nor- proposed TDDs on the HMDB51 and UCF101 dataset, and
malization (ST. Norm.) and channel normalization (Cha. the experimental results are summarized in Table 3. We first
Norm.) in Section 4.3 . We conduct experiments to compare the performance of TDDs with that of improved
investigate the effectiveness of normalization methods by trajectories. The convolutional descriptors of spatial net
using conv4 descriptors from spatial net on the HMDB51 are much better than HOG descriptors, which indicates that
dataset, and the results are shown in the right of Figure deep-learned features contains more discriminative capacity
3. We see that normalization is important for improving than hand-crafted features. For convolutional descriptors
Spatial ConvNets Temporal ConvNets
Convolutional layer conv1 conv2 conv3 conv4 conv5 conv1 conv2 conv3 conv4 conv5
Recognition accuracy 24.1% 33.9% 41.9% 48.5% 47.2% 39.2% 50.7% 54.5% 51.2% 46.1%
Table 2. The performance of different layers of spatial nets and temporal nets on the HMDB51 dataset.

(a) RGB (b) Flow-x (c) Flow-y (d) S-conv4 (e) S-conv5 (f) T-conv3 (g) T-conv4
Figure 4. Examples of video frames, optical flow fields, and their corresponding feature maps of spatial nets and temporal nets.
of temporal net, they are better than or comparable to the HMDB51 UCF101
descriptors of HOF and MBH, but the improvement is not STIP+BoVW [15] 23.0% STIP+BoVW [26] 43.9%
so evident as spatial convolutional descriptors. The reason Motionlets [35] 42.1% Deep Net [12] 63.3%
may be that HOF and MBH calculation is based on warped DT+BoVW [30] 46.6% DT+VLAD [3] 79.9%
optical flow instead of original optical flow, which has been DT+MVSV [3] 55.9% DT+MVSV [3] 83.5%
proved to be pretty effective for HOF descriptor [31]. We iDT+FV [31] 57.2% iDT+FV [32] 85.9%
iDT+HSV [21] 61.1% iDT+HSV [21] 87.9%
consider using warped flow for TDDs extraction in the
Two Stream [24] 59.4% Two Stream [24] 88.0%
future.
TDD+FV 63.2% TDD+FV 90.3%
We also compare the performance of TDDs with the Our best result 65.9% Our best result 91.5%
two-stream ConvNets. Although our trained two-stream Table 4. Comparison of TDD to the state of the art. We separately
ConvNets obtain slightly lower performance than theirs, present the results of TDDs and our best results obtained with early
we see that our spatial TDDs outperform spatial nets by fusion of TDDs and iDTs.
a large margin and temporal TDD is comparable to their
instead of warped flow. The ConvNets are implemented by
temporal net. These results indicate the fact that trajectory-
Cuda and computing is very efficient.
constrained sampling and pooling is an effective strategy
for improving recognition performance, in particular for 5.5. Comparison to the state of the art
spatial TDDs. We also notice that the combined TDDs from
spatial and temporal nets outperform two-stream ConvNets Table 4 compares our recognition results with several
by around 4% and 2% on the two datasets, respectively. recently published methods on the dataset of HMDB51 and
We also show some examples of video frames, optical flow UCF101. The performance of TDDs outperforms previous
fields, and their corresponding feature maps in Figure 4. methods on both datasets. On the HMDB51 dataset, our
From these examples, we see that the convolutional feature best result outperforms other methods by 4.8%, and on the
maps are relatively sparse and exhibit high correlation with UCF101 dataset, our best result outperforms by 3.5%. This
the action areas. superior performance of TDDs indicates the effectiveness
of introducing trajectory-constrained sampling and pooling
Finally, we explore a practical way to improve the
into deep-learned features.
recognition performance of action recognition system by
combining TDDs with iDTs, using early fusion of Fisher
vector representation. The recognition results are shown 6. Conclusions
in Table 3, and the fusion of them can further boost the This paper has proposed an effective video presenta-
performance. This further improvement indicates our TDDs tion, called trajectory-pooled deep-convolutional descriptor
are complementary to those low-level local features. (TDD), which integrates the advantages of hand-crafted and
Computational costs. Compared with iDT, we only deep-learned features. Deep architectures are utilized to
track points on a single scale and extract original flow learn discriminative convolutional feature maps, and then
the strategies of trajectory-constrained sampling and pool- [14] A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet
ing are adopted to aggregate these convolutional features classification with deep convolutional neural networks. In
into TDDs. Our features achieve superior performance NIPS, 2012. 2, 3, 5
on two datasets for action recognition, as evidenced by [15] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre.
comparison with the state-of-the-art methods. HMDB: A large video database for human motion recogni-
tion. In ICCV, 2011. 1, 2, 6, 8
[16] I. Laptev. On space-time interest points. IJCV, 64(2-3),
Acknowledgement 2005. 1, 2
This work is supported by a donation of Tesla K40 GPU [17] I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld.
from NVIDIA Corporation. Limin Wang is supported by Learning realistic human actions from movies. In CVPR,
Hong Kong PhD Fellowship. Yu Qiao is the corresponding 2008. 1, 2, 5
author and supported by National Natural Science [18] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-
based learning applied to document recognition. In ISP.
Foundation of China (91320101, 61472410), Shenzhen
IEEE Press, 2001. 1
Basic Research Program (JCYJ20120903092050890,
[19] D. G. Lowe. Distinctive image features from scale-invariant
JCYJ20120617114614438, JCYJ20130402113127496),
keypoints. IJCV, 60(2), 2004. 5
100 Talents Program of CAS, and Guangdong Innovative [20] A. Patron-Perez, M. Marszalek, I. Reid, and A. Zisserman.
Research Team Program (No.201001D0104648280). Structured learning of human interactions in TV shows.
TPAMI, 34(12), 2012. 1
References [21] X. Peng, L. Wang, X. Wang, and Y. Qiao. Bag of visual
words and fusion methods for action recognition: Compre-
[1] J. K. Aggarwal and M. S. Ryoo. Human activity analysis: A
hensive study and good practice. CoRR, abs/1405.4506,
review. ACM Comput. Surv., 43(3):16, 2011. 1
2014. 8
[2] H. Bay, T. Tuytelaars, and L. J. V. Gool. SURF: speeded up
[22] S. Sadanand and J. J. Corso. Action bank: A high-level
robust features. In ECCV, 2006. 4
representation of activity in video. In CVPR, 2012. 2
[3] Z. Cai, L. Wang, X. Peng, and Y. Qiao. Multi-view super
[23] J. Sánchez, F. Perronnin, T. Mensink, and J. J. Verbeek.
vector for action recognition. In CVPR, 2014. 8
Image classification with the Fisher vector: Theory and
[4] K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman. practice. IJCV, 105(3), 2013. 7
Return of the devil in the details: Delving deep into [24] K. Simonyan and A. Zisserman. Two-stream convolutional
convolutional nets. In BMVC, 2014. 6 networks for action recognition in videos. NIPS, 2014. 1, 2,
[5] N. Dalal and B. Triggs. Histograms of oriented gradients for 3, 4, 5, 6, 7, 8
human detection. In CVPR, 2005. 5 [25] K. Simonyan and A. Zisserman. Very deep convolu-
[6] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and F. Li. tional networks for large-scale image recognition. CoRR,
ImageNet: A large-scale hierarchical image database. In abs/1409.1556, 2014. 3
CVPR, 2009. 4, 6 [26] K. Soomro, A. R. Zamir, and M. Shah. UCF101: A dataset
[7] P. Dollár, V. Rabaud, G. Cottrell, and S. Belongie. Behavior of 101 human actions classes from videos in the wild. CoRR,
recognition via sparse spatio-temporal features. In VS-PETS, abs/1212.0402, 2012. 1, 2, 6, 8
2005. 1, 2 [27] C. Sun and R. Nevatia. Large-scale web video event
[8] M. A. Fischler and R. C. Bolles. Random sample consensus: classification by use of Fisher vectors. In WACV, 2013. 7
A paradigm for model fitting with applications to image [28] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed,
analysis and automated cartography. Commun. ACM, 24(6), D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich.
1981. 4 Going deeper with convolutions. CoRR, abs/1409.4842,
[9] S. Ji, W. Xu, M. Yang, and K. Yu. 3D convolutional neural 2014. 3
networks for human action recognition. TPAMI, 35(1), 2013. [29] G. W. Taylor, R. Fergus, Y. LeCun, and C. Bregler. Convolu-
2, 3 tional learning of spatio-temporal features. In ECCV, 2010.
[10] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, 2, 3
R. B. Girshick, S. Guadarrama, and T. Darrell. Caffe: Con- [30] H. Wang, A. Kläser, C. Schmid, and C.-L. Liu. Dense
volutional architecture for fast feature embedding. CoRR, trajectories and motion boundary descriptors for action
abs/1408.5093. 6 recognition. IJCV, 103(1), 2013. 1, 2, 3, 8
[11] Y.-G. Jiang, J. Liu, A. Roshan Zamir, I. Laptev, M. Piccardi, [31] H. Wang and C. Schmid. Action recognition with improved
M. Shah, and R. Sukthankar. THUMOS challenge: Action trajectories. In ICCV, 2013. 1, 2, 3, 7, 8
recognition with a large number of classes, 2013. 2, 6 [32] H. Wang and C. Schmid. LEAR-INRIA submission for
[12] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, the thumos workshop. In ICCV Workshop on THUMOS
and L. Fei-Fei. Large-scale video classification with convo- Challenge, 2013. 2, 7, 8
lutional neural networks. In CVPR, 2014. 2, 3, 8 [33] H. Wang, M. M. Ullah, A. Kläser, I. Laptev, and C. Schmid.
[13] A. Kläser, M. Marszalek, and C. Schmid. A spatio-temporal Evaluation of local spatio-temporal features for action recog-
descriptor based on 3D-gradients. In BMVC, 2008. 2 nition. In BMVC, 2009. 2
[34] L. Wang, Y. Qiao, and X. Tang. Mining motion atoms and
phrases for complex action recognition. In ICCV, 2013. 2
[35] L. Wang, Y. Qiao, and X. Tang. Motionlets: Mid-level 3D
parts for human motion recognition. In CVPR, 2013. 1, 2, 8
[36] L. Wang, Y. Qiao, and X. Tang. Latent hierarchical model of
temporal structure for complex activity classification. TIP,
23(2), 2014. 1
[37] L. Wang, Y. Qiao, and X. Tang. Video action detection with
relational dynamic-poselets. In ECCV, 2014. 2
[38] X. Wang, L. Wang, and Y. Qiao. A comparative study
of encoding, pooling and normalization methods for action
recognition. In ACCV, 2012. 7
[39] G. Willems, T. Tuytelaars, and L. J. V. Gool. An efficient
dense and scale-invariant spatio-temporal interest point de-
tector. In ECCV, 2008. 2
[40] C. Zach, T. Pock, and H. Bischof. A duality based approach
for realtime tv-L1 optical flow. In 29th DAGM Symposium
on Pattern Recognition, 2007. 6
[41] M. D. Zeiler and R. Fergus. Visualizing and understanding
convolutional networks. In ECCV, 2014. 3, 4, 5
[42] J. Zhu, B. Wang, X. Yang, W. Zhang, and Z. Tu. Action
recognition with actons. In ICCV, 2013. 2

GMP Dolphin-G Series
0% (1)
GMP Dolphin-G Series
1 page
Pvs4 Information
No ratings yet
Pvs4 Information
110 pages
NTU Academic Calendar (Semester) - AY2018-19 PDF
No ratings yet
NTU Academic Calendar (Semester) - AY2018-19 PDF
1 page
Two-Stream Convolutional Networks For Action Recognition in Videos
No ratings yet
Two-Stream Convolutional Networks For Action Recognition in Videos
9 pages
Ufc Sports Data
No ratings yet
Ufc Sports Data
10 pages
Singh a Multi-Stream Bi-Directional CVPR 2016 Paper
No ratings yet
Singh a Multi-Stream Bi-Directional CVPR 2016 Paper
10 pages
WangH2013 Densetrajectories IJCV
No ratings yet
WangH2013 Densetrajectories IJCV
21 pages
Dense Trajectories and Motion Boundary Descriptors For Action Recognition
No ratings yet
Dense Trajectories and Motion Boundary Descriptors For Action Recognition
22 pages
Raushan Pandey Review Paper of Deep Learning
No ratings yet
Raushan Pandey Review Paper of Deep Learning
3 pages
Xiao-Song2018 Article ActionRecognitionBasedOnHierar PDF
No ratings yet
Xiao-Song2018 Article ActionRecognitionBasedOnHierar PDF
14 pages
Action Recognition 2
No ratings yet
Action Recognition 2
6 pages
2015 - A Robust and Efficient Video Representation For Action Recognition (IMPROVED TRAJECTORIES)
No ratings yet
2015 - A Robust and Efficient Video Representation For Action Recognition (IMPROVED TRAJECTORIES)
20 pages
Action Recog
No ratings yet
Action Recog
11 pages
Nibali Extraction and Classification CVPR 2017 Paper
No ratings yet
Nibali Extraction and Classification CVPR 2017 Paper
11 pages
CNN-based and DTW Features For Human Activity Recognition On Depth Maps
No ratings yet
CNN-based and DTW Features For Human Activity Recognition On Depth Maps
14 pages
Feature Fusion for Dual-Stream Cooperative Action Recognition
No ratings yet
Feature Fusion for Dual-Stream Cooperative Action Recognition
9 pages
A Comprehensive Study of Deep Video Action Recognition
No ratings yet
A Comprehensive Study of Deep Video Action Recognition
30 pages
Plenary Feb 27 2017
No ratings yet
Plenary Feb 27 2017
25 pages
SIDDHARTH SHARMA - FinalProjectReport - Siddharth
No ratings yet
SIDDHARTH SHARMA - FinalProjectReport - Siddharth
27 pages
Lan Deep Local Video CVPR 2017 Paper
No ratings yet
Lan Deep Local Video CVPR 2017 Paper
7 pages
Taylor Eccv 10
No ratings yet
Taylor Eccv 10
14 pages
Multimodal Human Action Recognition Based On A Fusion of Dynamic Images Using CNN Descriptors
No ratings yet
Multimodal Human Action Recognition Based On A Fusion of Dynamic Images Using CNN Descriptors
8 pages
Action Recognition With Improved Trajectories: To Cite This Version
No ratings yet
Action Recognition With Improved Trajectories: To Cite This Version
9 pages
Embedded Features For 1D CNN-based Action Recognition On Depth Maps
No ratings yet
Embedded Features For 1D CNN-based Action Recognition On Depth Maps
13 pages
1 s2.0 S0031320316000169 Main
No ratings yet
1 s2.0 S0031320316000169 Main
14 pages
I3D-Shufflenet Based Human Action Recognition
No ratings yet
I3D-Shufflenet Based Human Action Recognition
14 pages
SLFLSDFKSFLDKJ
No ratings yet
SLFLSDFKSFLDKJ
3 pages
Cheron P-CNN Pose-Based CNN ICCV 2015 Paper
No ratings yet
Cheron P-CNN Pose-Based CNN ICCV 2015 Paper
9 pages
Action Progression Networks for Temporal Action Detection in Videos
No ratings yet
Action Progression Networks for Temporal Action Detection in Videos
16 pages
Frame-Skip Convolutional Neural Networks For Action Recognition
No ratings yet
Frame-Skip Convolutional Neural Networks For Action Recognition
6 pages
Final Selected Report
No ratings yet
Final Selected Report
4 pages
Abu Farha MS-TCN Multi-Stage Temporal Convolutional Network For Action Segmentation CVPR 2019 Paper
No ratings yet
Abu Farha MS-TCN Multi-Stage Temporal Convolutional Network For Action Segmentation CVPR 2019 Paper
10 pages
3D Convolutional Neural Networks For Human Action Recognition
No ratings yet
3D Convolutional Neural Networks For Human Action Recognition
11 pages
3.action Recognition
No ratings yet
3.action Recognition
10 pages
A Review of Research On Human Behavior Recognition Methods Based On Deep Learning
No ratings yet
A Review of Research On Human Behavior Recognition Methods Based On Deep Learning
5 pages
Action Classification and Highlighting in Videos
No ratings yet
Action Classification and Highlighting in Videos
12 pages
CS231N Section: Video Understanding
No ratings yet
CS231N Section: Video Understanding
52 pages
Thesis Samy
No ratings yet
Thesis Samy
218 pages
3D Convolutional Neural Networks For Human Action Recognition
No ratings yet
3D Convolutional Neural Networks For Human Action Recognition
11 pages
Continuous Human Action Recognition For Human Machine Interaction A Review
No ratings yet
Continuous Human Action Recognition For Human Machine Interaction A Review
31 pages
P-CNN: Pose-Based CNN Features For Action Recognition: Guilhem CH Eron Ivan Laptev Cordelia Schmid Inria
No ratings yet
P-CNN: Pose-Based CNN Features For Action Recognition: Guilhem CH Eron Ivan Laptev Cordelia Schmid Inria
9 pages
Human Action Recognition On Raw Depth Maps
No ratings yet
Human Action Recognition On Raw Depth Maps
4 pages
10224/submission 10224
No ratings yet
10224/submission 10224
10 pages
Zou 2018
No ratings yet
Zou 2018
6 pages
Human Action Recognition System For Elderly and Children Care Using Three Stream ConvNet
No ratings yet
Human Action Recognition System For Elderly and Children Care Using Three Stream ConvNet
5 pages
Action Recognition Based On Multi-Level Representation of 3D Shape
No ratings yet
Action Recognition Based On Multi-Level Representation of 3D Shape
9 pages
Action Recognition From Video Using
No ratings yet
Action Recognition From Video Using
16 pages
Unsupervised Learning of Human Action Categories Using Spatial-Temporal Words
No ratings yet
Unsupervised Learning of Human Action Categories Using Spatial-Temporal Words
10 pages
Action Recognition
No ratings yet
Action Recognition
14 pages
PMRF Annual Review
No ratings yet
PMRF Annual Review
16 pages
2017 Allah Bux PHD
No ratings yet
2017 Allah Bux PHD
173 pages
28 - Action Recognition in Australian Rules Football Through Deep Learning
No ratings yet
28 - Action Recognition in Australian Rules Football Through Deep Learning
14 pages
Gradient Local Auto-Correlation Features For Depth Human Action Recognition - SpringerLink
No ratings yet
Gradient Local Auto-Correlation Features For Depth Human Action Recognition - SpringerLink
3 pages
Feichtenhofer Convolutional Two-Stream Network CVPR 2016 Paper
No ratings yet
Feichtenhofer Convolutional Two-Stream Network CVPR 2016 Paper
9 pages
Ilidrissi-Tan2019 Article ADeepUnifiedFrameworkForSuspic
No ratings yet
Ilidrissi-Tan2019 Article ADeepUnifiedFrameworkForSuspic
6 pages
Detecting Violence in Video Based On Dee
No ratings yet
Detecting Violence in Video Based On Dee
15 pages
Convolution Neural Network For Human Activity
No ratings yet
Convolution Neural Network For Human Activity
5 pages
Sun Human Action Recognition ICCV 2015 Paper
No ratings yet
Sun Human Action Recognition ICCV 2015 Paper
9 pages
Video Survivallence
No ratings yet
Video Survivallence
3 pages
Time Invariant Gesture Recognition by Modelling Body Posture Space
No ratings yet
Time Invariant Gesture Recognition by Modelling Body Posture Space
10 pages
Online Learnable Keyframe Extraction in Videos and Its Application With Semantic Word Vector in Action Recognition
No ratings yet
Online Learnable Keyframe Extraction in Videos and Its Application With Semantic Word Vector in Action Recognition
28 pages
Nerdctl for Containerd Environments: The Complete Guide for Developers and Engineers
From Everand
Nerdctl for Containerd Environments: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
Tinkerbell Architecture and Bare Metal Provisioning: The Complete Guide for Developers and Engineers
From Everand
Tinkerbell Architecture and Bare Metal Provisioning: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
01 Streaming PDF
No ratings yet
01 Streaming PDF
8 pages
Ap Eamcet - 2017 District Wise Toppers in Engineering: Page 1 of 28
No ratings yet
Ap Eamcet - 2017 District Wise Toppers in Engineering: Page 1 of 28
28 pages
Department of Humanities and Social Sciences IIT Madras: S.No. Slot Course No Course Name Instructor Name Room Credit
No ratings yet
Department of Humanities and Social Sciences IIT Madras: S.No. Slot Course No Course Name Instructor Name Room Credit
1 page
Learning Spatiotemporal Features With 3D Convolutional Networks
No ratings yet
Learning Spatiotemporal Features With 3D Convolutional Networks
16 pages
Marx and Hegel On Alienation
No ratings yet
Marx and Hegel On Alienation
10 pages
The Drinking Philosophers Problem: K. M. Chandy and J. Misra University of Texas at Austin
No ratings yet
The Drinking Philosophers Problem: K. M. Chandy and J. Misra University of Texas at Austin
15 pages
Case Study
No ratings yet
Case Study
9 pages
Stack by Linked List (By C++) : #Include
No ratings yet
Stack by Linked List (By C++) : #Include
4 pages
Quality Plan
No ratings yet
Quality Plan
1 page
Go Ahead PDF
No ratings yet
Go Ahead PDF
2 pages
Cambridge International AS & A Level: PHYSICS 9702/34
No ratings yet
Cambridge International AS & A Level: PHYSICS 9702/34
12 pages
Daily Report October 2013 Yde (4) Rtttfrre
No ratings yet
Daily Report October 2013 Yde (4) Rtttfrre
112 pages
Mentor Interview Questions
No ratings yet
Mentor Interview Questions
3 pages
TN206
No ratings yet
TN206
37 pages
Ek Ehsaas Ek Vishwas
No ratings yet
Ek Ehsaas Ek Vishwas
32 pages
Metal Casting 3
No ratings yet
Metal Casting 3
23 pages
Polity (Articles Compilation June2024-Jan2025) M IE Explained - All Subjects (Dec 2025)
No ratings yet
Polity (Articles Compilation June2024-Jan2025) M IE Explained - All Subjects (Dec 2025)
23 pages
Academic Unit: Space Description Quantity Standard Applied Standard Area (
No ratings yet
Academic Unit: Space Description Quantity Standard Applied Standard Area (
5 pages
End 1 End 2: Intralox, Inc. P.O. Box 50699 New Orleans, LA 70150 USA Fax: (504) 734-0063
No ratings yet
End 1 End 2: Intralox, Inc. P.O. Box 50699 New Orleans, LA 70150 USA Fax: (504) 734-0063
2 pages
Business Functions - Chaper 1-9-162-191
No ratings yet
Business Functions - Chaper 1-9-162-191
30 pages
How To Use Nmap - Commands and Tutorial Guide
No ratings yet
How To Use Nmap - Commands and Tutorial Guide
18 pages
Legal Issues To Consider To Protect Your App in 2024
No ratings yet
Legal Issues To Consider To Protect Your App in 2024
23 pages
Woodhouse: Midgley Gardens
No ratings yet
Woodhouse: Midgley Gardens
36 pages
Project Year 12 English
No ratings yet
Project Year 12 English
7 pages
Stock Transport
No ratings yet
Stock Transport
1 page
Impacts of The World Recession and Economic Crisis On Tourism North America
No ratings yet
Impacts of The World Recession and Economic Crisis On Tourism North America
11 pages
Shortcut Virus Remover
100% (1)
Shortcut Virus Remover
5 pages
2022 Q1 OKR Update Supply Chain Tech Update
No ratings yet
2022 Q1 OKR Update Supply Chain Tech Update
5 pages
A Review On Springback Effect in Sheet Metal Forming Process
No ratings yet
A Review On Springback Effect in Sheet Metal Forming Process
7 pages
Abl90 Manual Operação
No ratings yet
Abl90 Manual Operação
59 pages
Bagua Map
No ratings yet
Bagua Map
1 page
Middle East Real Estate Predictions - Dubai
No ratings yet
Middle East Real Estate Predictions - Dubai
28 pages
How To Improve Your Apache Web Server's Performance?
No ratings yet
How To Improve Your Apache Web Server's Performance?
2 pages

Action Recognition With Trajectory-Pooled Deep-Convolutional Descriptors

Uploaded by

Action Recognition With Trajectory-Pooled Deep-Convolutional Descriptors

Uploaded by

Action Recognition with Trajectory-Pooled Deep-Convolutional Descriptors

Limin Wang1,2 Yu Qiao2 Xiaoou Tang1,2

[email protected], [email protected], [email protected]

Visual features are of vital importance for human

Trajectory-Pooled Deep-Convolutional Descriptors (TDDs)

X spatiotemporal normalization channel normalization

0.44 HOF [31, 32] 48.9% 76.0%

You might also like