0% found this document useful (0 votes)
64 views4 pages

Convolutional Neural Networks With Gener PDF

This document proposes a new method called Generalized Attentional Pooling (GAP) for action recognition in still images using convolutional neural networks (CNNs). GAP is a generalization of existing attentional pooling techniques that approximates the weight matrix in the second-order pooling layer of a CNN using a top-down vector, multiple bottom-up matrices, and a bottom-up vector. When incorporated into a CNN and provided with additional human pose information, the proposed GAP-CNN+Pose method achieves state-of-the-art accuracy on the large-scale MPII still image dataset for action recognition.

Uploaded by

yudi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
64 views4 pages

Convolutional Neural Networks With Gener PDF

This document proposes a new method called Generalized Attentional Pooling (GAP) for action recognition in still images using convolutional neural networks (CNNs). GAP is a generalization of existing attentional pooling techniques that approximates the weight matrix in the second-order pooling layer of a CNN using a top-down vector, multiple bottom-up matrices, and a bottom-up vector. When incorporated into a CNN and provided with additional human pose information, the proposed GAP-CNN+Pose method achieves state-of-the-art accuracy on the large-scale MPII still image dataset for action recognition.

Uploaded by

yudi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Convolutional Neural Networks with Generalized

Attentional Pooling for Action Recognition


Yunfeng Wang∗ , Wengang Zhou† , Qilin Zhang‡ and Houqiang Li§
∗†§ Department of Electronic Engineering and Information Science, University of Science and Technology of China, Hefei, China
‡ Highly Automated Driving, HERE Technologies, Chicago, Illinois, USA

Email: [email protected], † [email protected], ‡ [email protected], § [email protected]


Abstract—Inspired by the recent advance in attentional pooling action classification [3]. Despite their performance advantages
techniques in image classification and action recognition tasks, we over the standard full-image based CNNs, the hard attention
propose the Generalized Attentional Pooling (GAP) based Convo- based CNNs suffer from significantly higher computational
lutional Neural Network (CNN) algorithm for action recognition
in still images. The proposed GAP-CNN can be formulated as complexity due to the extra human bounding boxes detection
a new approximation of the second-order/bilinear pooling tech- step. Worse still, the required manual labeling of such bound-
niques widely used in fine-grained image classification. Unlike the ing boxes in training data is prohibitively time-consuming and
existing rank-1 approximation, a generalized factoring (with non- potentially expensive.
linear functions) is introduced to exploit the intrinsic structural Pooling layer is an indispensable component of a modern
information of the sample covariance matrices of convolutional
layer outputs. Without requiring preprocessing steps such as CNN. Popular pooling algorithms include mean pooling and
object (e.g., human body) bounding boxes detection, the proposed max pooling, both of which are first-order pooling (pooling
GAP-CNN automatically focuses on the most informative part in operates on the feature map/matrix itself). Alternatively, the
still images. With the additional guidance of keypoints of human second-order pooling (pooling operates on the sample co-
pose, the proposed GAP-CNN algorithm achieves the state-of- variance matrix of the feature map/matrix) is advocated in
the-art action recognition accuracy on the large-scale MPII still
image dataset. [11], especially in applications such as semantic segmenta-
Index Terms—Action Recognition, Generalized Attentional tion and fine-grained image classification. In [9], an evolved
Pooling, Convolutional Neural Network variant of the second-order pooling is proposed, with low-
rank approximation and reformulation as attentional pooling.
I. I NTRODUCTION However, it assumes a rank-1 approximation of the weight
Human action recognition is a fundamental and well ex- matrix, which is arguably too restrictive and could potentially
plored research area in computer vision, due to its widespread lead to performance penalties.
applications in human-computer interaction, surveillance and Inspired by [9], we propose a generalized factoring scheme
game control. Traditional methods are based on handcrafted (with additional non-linear functions) of the weight matrix,
features, such as dense trajectory [1], object detection [2] or to exploit the intrinsic structural information of the sample
context mining [3] in image content. Recent convolutional covariance matrices of convolutional layer outputs. With the
neural networks (CNNs) based approaches have achieved proposed factoring scheme, the weights matrix of a pooling
impressive performance in action recognition with both still layer is approximated by a top-down vector, a bottom-up
images and videos. Among them, multi-stream CNN methods vector and multiple bottom-up matrices. Parameters such as
such as “Two-Stream” [4] and its derivatives [5], [6] are among the optimal number of bottom-up matrices are empirically
the top performers on the UCF101 [7] and HMDB51 [8] video determined via cross validation. By incorporating extra super-
action recognition datasets. Currently, the ResNet-101 based vision in the form of human pose keypoints, our proposed
attentional pooling method [9] keeps the record of highest Generalized Attentional Pooling (GAP) based CNN+Pose
action recognition accuracy in still images with the MPII [10] (GAP-CNN+Pose) method can achieve even better results than
dataset. the original attentional pooling [9] on the large-scale MPII
Previously in action recognition in still images, it is the still image action recognition dataset, indicating that GAP is
norm to feed entire images to a CNN for classification. Later complementary to hard attention.
the hard attention concept is introduced with fine-grained The primary contribution of this paper is a new, generalized
features around the human bounding boxes or human pose factoring/approximation to the weight matrix in the second
keypoints, and such features are subsequently fed to CNNs for order pooling layer of a CNN, with the action recognition
application in a large-scale MPII still image dataset.
This work was supported in part to Dr. Houqiang Li by 973 Program under
contract No. 2015CB351803 and NSFC under contract No. 61390514, and in II. R ELATED W ORK
part to Dr. Wengang Zhou by NSFC under contract No. 61472378 and No.
61632019, the Fundamental Research Funds for the Central Universities, and Visual recognition has been widely studied in recent years,
Young Elite Scientists Sponsorship Program By CAST (2016QNRC001). with both still image datasets and video datasets [1], [3], [12]–
Fig. 1. Overview of the proposed GAP-CNN+Pose algorithm. Input images are fed into a ResNet-101 CNN (with the last pooling layer removed) to generate
the feature map/matrix X. Subsequently, two types of attention are imposed on the feature map, following [9]. The top branch denotes the top-down attention
(i.e., class-specific attention), which is constructed by multiplying the feature map/matrix X with a list of class-dependent vectors a1 , a2 , · · · aK . On the
bottom branch of the architecture, a series of T class-agnostic matrices U1 , · · · UT are multiplied after nonlinear transformations f (·), e.g., rectified linear unit
(ReLU), followed by a class-agnostic vector c to represent the bottom-up attention, i.e., the saliency-based attention. The additional human pose information
is incorporated via the pose heatmaps and `2 regression.

[18]. For large scale still image action recognition datasets map, and f is the number of channels. Conventional 1st order
such as MPII [10] and HICO [19], the performance of popular mean pooling and binary classification score computation can
baseline methods is unimpressive, e.g., about 30% mAP on be formulated as
MPII dataset. Thanks to the extremely large number of classes 1 T
scorebin
order1 (X) = 1 Xw, (1)
(393 and 600 classes for MPII and HICO, respectively.) as n
well as high diversity1 , it is highly challenging to achieve high with n1 X T 1 being the mean-pooled feature and w being a
recognition accuracy on such datasets. On the contrary, popular f × 1 scoring weights.
video based action recognition datasets like UCF101 [7] and Correspondingly, let matrix W ∈ Rf ×f denote the scoring
HMDB51 [8] are comparatively much smaller, with only 101 weight matrix after a second order pooling layer [11]. Follow-
and 51 categories, respectively. ing [9], the binary classification score is obtained by
In this paper we focus on action recognition with still
images. R*CNN [3] is a recent work in this field, in which R- scorebin T T
order2 (X) = T r(X XW ), (2)
CNN [9] is adapted to include one primary region and multiple n×f T
where X ∈ R and Σ := X X is traditionally termed
proposal regions. The proposal region with the highest score the sample covariance matrix2 . Substitution of Σ into Eq. (2)
is selected to cooperate with primary region to recognize the yields
action in an image. Assisted with bounding boxes of the X
subject (e.g., human), R*CNN achieves good result on the scorebin T
order2 (X) = T r(ΣW ) = Σi,j Wi,j , (3)
i,j
MPII dataset [10].
f ×f
The most related work is [9], in which a rank-1 approxima- where Σ, W ∈ R . From Eq. (3), matrix W can be inter-
tion of the weight matrix is proposed and attentional pooling preted as the element-wise weights of the sample covariance
is reformulated as low-rank second-order pooling. In [9], the matrix Σ.
attentional pooling is reformulated as a drop-in replacement Unlike the highly restrictive rank-1 approximation of W
for the popular mean pooling or max pooling near the end (W := abT ) in [9], we propose a gentler regularization by
of CNNs. In contrast with [9], the proposed GAP extends setting
the rank-1 approximation to a series of generalized non-linear W := af (V c)T , a ∈ Rf ×1 , c ∈ Rr×1 , V ∈ Rf ×r , (4)
factoring, and GAP can be incorporated after any layer in a
CNN. where f (·) is an element-wise nonlinear transform function
that keeps output dimensions as the input dimension, e.g.,
III. F ORMULATION rectified linear unit (ReLU). In addition, V can be further
The proposed GAP architecture is illustrated in Fig. 1. Let factorized into T matrices as
T
X ∈ Rn×f denote the reshaped output feature of a given Y
layer, where n is the total number of spatial elements in the V = f (Ut ) = f (U1 )f (U2 ) . . . f (UT ), (5)
t=1
feature map, i.e., the product of width and height of the feature
2 Sometimes sample mean values are subtracted before computing such
1 In addition, it could be ambiguous to determine an action class based on sample covariance matrix. A constant factor of 1/(n−1) can also be included
a still image without temporal cues, e.g., “sit down” versus “stand up”. in the definition of Σ.
where U1 ∈ Rf ×r1 , U2 ∈ Rr1 ×r2 , · · · , UT ∈ Rr(T −1) ×r .
By the introduction of the matrix factorizations and non-
linear functions in Eq. (4)–(5), more structural information
in the sample covariance matrix Σ could potentially be ex-
ploited. Practically, such factorization and nonlinearity are
implemented as convolutional and ReLU layers, respectively.
The optimal value of T is empirically decided to balance
performance and model complexity3 . Substitution of Eq. (4)–
(5) into Eq. (2) yields a reformulation as the attentional score,
scorebin T T

att (X) = T r X Xf (V c)a (6)
= (Xa)T (Xf (V c)) . (7)
Eq. (7) indicates that the score can be seen as the inner product
of two attentional heatmaps. Similarly, such derivations can be Fig. 2. Illustration of different mAP with respect to varying weights for the
extended to K-class (K ≥ 3) classifier. Let Wk be the class- regularization pose `2 loss based on the validation split of the MPII dataset.
X-axis is on inverted logarithmic scale while Y-axis is on linear scale.
specific weights for class k, k = 1, · · · , K. Eq. (2) can be
rewritten as,
scoreKclass T T
order2 (X, k) = T r(X XWk ), (8) Weight of Pose Regularization. Cross validation experiments
with Wk ∈ R f ×f
. Parallel to Eq. (6)–(7), Let Wk := 4 are conducted to empirically determine the optimal weight of
ak (V c)T , the class-specific attentional pooling and scoring the regularization `2 loss from pose keypoints. Without loss
is obtained as of generality, the weight of the cross-entropy loss is fixed at
constant value 1, and the weight of the pose regularization loss
scoreKclass
att (X, k) = (Xak )T X(f (V c)). (9) varies from 1 to 10−8 , as shown in Fig. 2. From Fig. 2, we
In Eq. (9), the former terms Xak represent the class-specific observe that the mAP is insensitive to the choice of weight
top-down attentional feature maps; while the latter terms value for the pose regularization loss. The highest mAP is
Xf (V c) denote the saliency-based, class-agnostic bottom-up achieved with such weight at approximately 10−6 , thus 10−6
attentional feature maps. As advocated in [20] and [9], the is chosen and fixed throughout the rest of the paper.
fusion of top-down and bottom-up attention maps is motivated Number of Bottom-up Matrices. In this part we show the
biologically, and it is beneficial to modulate saliency maps experiments designed to determine the optimal number of
with class-specific top-down information. bottom-up matrices, i.e., T in Eq. (5). Since convolution
From [9], human pose regularization can contribute to the operations in CNNs are implemented by matrix multiplication,
action recognition accuracy. Therefore, we incorporate human we take advantage of the existing convolution layers to imple-
body keypoints heatmaps and use it as the regularization ment matrix multiplication operations. We set r1 = 4096 and
term for the cross-entropy loss in Fig. 1. Specifically, two determine the remaining values by induction as ri+1 = ri /2,
additional convolutional layers are added after the last layer i = 1, · · · , T − 2. We use the convolutional layer Ci with
of the ResNet-101 CNN and a 16-channel regression layer the input ri−1 and output ri to represent Ui . ReLU layers are
to predict the pose keypoints. An l2 loss is used to calculate added between such convolution layers. Recognition accuracy
the cost between the predicted heatmaps and the ground truth and mAP are used as criteria in the choice of T based on the
heatmaps. validation split of the MPII dataset, as shown in Fig. 3. We
The overall loss is calculated by weighted sum of this l2 observe that both criteria reach plateau with T over 3. To keep
loss and a cross-entropy loss, making it possible to optimize the number of such convolutional layers as small as possible
the entire GAP-CNN+Pose network in an end-to-end manner. (for computational efficiency), T is fixed at 3 in the rest of
this paper.
IV. E XPERIMENTS
Attention Visualization. Figure 4 shows several typical exam-
Dataset. In this section, experiments are conducted on the ples of the GAP-CNN predicted attention heatmaps imposed
challenging large-scale action recognition datasets, i.e., the on input images. We observe that the most informative parts of
MPII still image dataset [10]. The MPII human pose dataset such input images are mostly highlighted in the corresponding
contains 15205 images in 393 action classes, grouped into a heatmaps.
train split, a validation split and a test split, with 8218, 6987 Comparison. Because the ground truth labels for the test
and 5708 images, respectively. The dataset is also annotated split of the MPII dataset is not publicly available, the vali-
with ground truth human body keypoints. We use the mean dation split is used for such evaluation. The comparison of
average precision (mAP) and classification accuracy as criteria the proposed GAP-CNN method with competing algorithms
to evaluate the performance of competing methods. (without pose information) are summarized in the top half of
3 More details are presented in Section IV. Table I. Our proposed GAP-CNN method achieves both the
4 Note that V and c are bottom-up parameters, thus are class-agnostic. highest mAP and the highest recognition accuracy. In addition,
pose-enhanced version of attentional pooling [9], supporting
our speculation that the proposed GAP model could be com-
plementary to hard attention.
V. C ONCLUSION
In this paper, the Generalized Attentional Pooling based
Convolutional Neural Network (GAP-CNN) algorithm is pro-
posed for action recognition in still images. Empirical exper-
iments are carried out to determine the practically optimal
number of bottom-up pooling matrices. In addition, extra
supervisions such as human pose keypoints are exploited.
With the practically optimal number of bottom-up attentional
pooling and a single top-down pooling, the proposed GAP-
CNN algorithm outperforms 4 competing algorithms, includ-
ing the original attentional pooling method [9]. Even after
Fig. 3. Illustration of recognition accuracies and mAP with different T values the incorporation of human pose keypoints information, the
based on the validation split of the MPII dataset. We observe that both mAP proposed GAP-CNN+Pose algorithm nevertheless achieves the
and accuracy reach plateau with T over 3. T = 3 is the choice to maximize
mAP and accuracy. state-of-the-art action recognition performance on the large-
scale MPII still image dataset.
R EFERENCES
[1] H. Wang, A. Kläser, C. Schmid, and C.-L. Liu, “Action recognition by
dense trajectories,” in CVPR, 2011.
[2] B. Yao and L. Fei-Fei, “Modeling mutual context of object and human
pose in human-object interaction activities,” in CVPR, 2010.
[3] G. Gkioxari, R. Girshick, and J. Malik, “Contextual action recognition
with r*cnn,” in ICCV, 2015.
[4] K. Simonyan and A. Zisserman, “Two-stream convolutional networks
for action recognition in videos,” in NIPS, 2014.
[5] C. Feichtenhofer, A. Pinz, and A. Zisserman, “Convolutional two-stream
network fusion for video action recognition,” in CVPR, 2016.
[6] L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Van Gool,
“Temporal segment networks: towards good practices for deep action
recognition,” in ECCV, 2016.
[7] K. Soomro, A. R. Zamir, and M. Shah, “Ucf101: A dataset of 101 human
actions classes from videos in the wild,” arXiv preprint arXiv:1212.0402,
2012.
[8] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre, “HMDB: a
large video database for human motion recognition,” in ICCV, 2011.
[9] R. Girdhar and D. Ramanan, “Attentional pooling for action recogni-
tion,” in NIPS, 2017.
[10] M. Andriluka, L. Pishchulin, P. Gehler, and S. Bernt, “2d human pose
estimation: New benchmark and state of the art analysis,” in CVPR,
2014.
[11] J. Carreira, R. Caseiro, J. Batista, and C. Sminchisescu, “Semantic
segmentation with second-order pooling,” in ECCV, 2012.
[12] H. Wang and C. Schmid, “Action recognition with improved trajecto-
Fig. 4. Examples of merged attentions on training images. All input images ries,” in ICCV, 2013.
are color images with RGB values, which are only shown in grayscale in [13] Q. Zhang and G. Hua, “Multi-view visual recognition of imperfect
Fig. 4 to facilitate the visualization of heatmaps. We can find that our method testing data,” in ACM MM, 2015.
can focus on the important parts in images. [14] Y. Wang, W. Zhou, Q. Zhang, X. Zhu, and H. Li, “Low-latency
human action recognition with weighted multi-region convolutional
neural network,” arXiv preprint arXiv:1805.02877, 2018.
TABLE I [15] X. Lv, L. Wang, Q. Zhang, Z. Niu, N. Zheng, and G. Hua, “Video object
P ERFORMANCE COMPARISON ON THE VALIDATION SET OF MPII. co-segmentation from noisy videos by a multi-level hypergraph model,”
in ICIP, 2018.
Method mAP Accuracy [16] J. Zang, L. Wang, Z. Liu, Q. Zhang, Z. Niu, G. Hua, and N. Zheng,
VGG16, R-CNN [3] 16.5% - “Attention-based temporal weighted convolutional neural network for
VGG16, R*CNN [3] 21.7% - action recognition,” in AIAI, 2018.
ResNet-101 [9] 26.2% - [17] J. Huang, W. Zhou, Q. Zhang, H. Li, and W. Li, “Video-based sign
Attn. Pool [9] 30.3% 35.3% language recognition without temporal segmentation,” in AAAI, 2018.
Proposed GAP-CNN 30.6% 36.0% [18] Q. Zhang, G. Hua, W. Liu, Z. Liu, and Z. Zhang, “Auxiliary training
Attn. Pool.+Pose [9] 30.6% 35.7% information assisted visual recognition,” IPSJ Trans. Comput. Vis. and
Proposed GAP-CNN+Pose 31.6% 36.9% Appl., vol. 7, pp. 138–150, 2015.
[19] Y.-W. Chao, Z. Wang, Y. He, J. Wang, and J. Deng, “Hico: A benchmark
for recognizing human-object interactions in images,” in ICCV, 2015.
[20] V. Navalpakkam and L. Itti, “An integrated model of top-down and
our proposed GAP-CNN+Pose algorithm also outperforms the bottom-up attention for optimizing detection speed,” in CVPR, 2006.

You might also like