0% found this document useful (0 votes)
42 views10 pages

Liu Disentangling and Unifying Graph Convolutions For Skeleton-Based Action Recognition CVPR 2020 Paper

This paper proposes two techniques - a disentangled multi-scale aggregation scheme and a unified spatial-temporal graph convolutional operator (G3D) - to improve skeleton-based action recognition from spatial-temporal graphs. The multi-scale aggregation scheme models relationships between nodes in different neighborhoods independently to capture long-range dependencies more effectively. G3D facilitates direct information flow across space and time for feature learning. Integrating these gives a powerful feature extractor (MS-G3D) that outperforms state-of-the-art methods on large action recognition datasets.

Uploaded by

huicheng chen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views10 pages

Liu Disentangling and Unifying Graph Convolutions For Skeleton-Based Action Recognition CVPR 2020 Paper

This paper proposes two techniques - a disentangled multi-scale aggregation scheme and a unified spatial-temporal graph convolutional operator (G3D) - to improve skeleton-based action recognition from spatial-temporal graphs. The multi-scale aggregation scheme models relationships between nodes in different neighborhoods independently to capture long-range dependencies more effectively. G3D facilitates direct information flow across space and time for feature learning. Integrating these gives a powerful feature extractor (MS-G3D) that outperforms state-of-the-art methods on large action recognition datasets.

Uploaded by

huicheng chen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Disentangling and Unifying Graph Convolutions

for Skeleton-Based Action Recognition

Ziyu Liu1,3 , Hongwen Zhang2 , Zhenghao Chen1 , Zhiyong Wang1 , Wanli Ouyang1,3
1
The University of Sydney 2 University of Chinese Academy of Sciences & CASIA
3
The University of Sydney, SenseTime Computer Vision Research Group, Australia
{zliu6676@uni.,zhenghao.chen@,zhiyong.wang@,wanli.ouyang@}sydney.edu.au,[email protected]

Abstract
Spatial-temporal graphs have been widely used by Spatial
Information
skeleton-based action recognition algorithms to model hu- Flow

man action dynamics. To capture robust movement patterns


from these graphs, long-range and multi-scale context ag-
gregation and spatial-temporal dependency modeling are
critical aspects of a powerful feature extractor. However,
Temporal Spatial-Temporal Disentangled Multi-
existing methods have limitations in achieving (1) unbi- (a) Information Flow (b) Information Flow (c) Scale Aggregation

ased long-range joint relationship modeling under multi- Figure 1: (a) Factorized spatial and temporal modeling on skeleton
scale operators and (2) unobstructed cross-spacetime in- graph sequences causes indirect information flow. (b) In this work,
formation flow for capturing complex spatial-temporal de- we propose to capture cross-spacetime correlations with unified
pendencies. In this work, we present (1) a simple method to spatial-temporal graph convolutions. (c) Disentangling node fea-
disentangle multi-scale graph convolutions and (2) a uni- tures at separate spatial-temporal neighborhoods (yellow, blue, red
fied spatial-temporal graph convolutional operator named at different distances, partially colored for clarity) is pivotal for ef-
fective multi-scale learning in the spatial-temporal domain.
G3D. The proposed multi-scale aggregation scheme dis-
entangles the importance of nodes in different neighbor-
hoods for effective long-range modeling. The proposed free of environmental noises (e.g. background clutter, light-
G3D module leverages dense cross-spacetime edges as skip ing conditions, clothing), allowing action recognition algo-
connections for direct information propagation across the rithms to focus on the robust features of the action.
spatial-temporal graph. By coupling these proposals, we Earlier approaches to skeleton-based action recognition
develop a powerful feature extractor named MS-G3D based treat human joints as a set of independent features, and they
on which our model1 outperforms previous state-of-the-art model the spatial and temporal joint correlations through
methods on three large-scale datasets: NTU RGB+D 60, hand-crafted [42, 43] or learned [31, 6, 48, 54] aggrega-
NTU RGB+D 120, and Kinetics Skeleton 400. tions of these features. However, these methods overlook
the inherent relationships between the human joints, which
are best captured with human skeleton graphs with joints as
1. Introduction nodes and their natural connectivity (i.e. “bones”) as edges.
For this reason, recent approaches [50, 19, 34, 35, 32] model
Human action recognition is an important task with
the joint movement patterns of an action with a skeleton
many real-world applications. In particular, skeleton-based
spatial-temporal graph, which is a series of disjoint and
human action recognition involves predicting actions from
isomorphic skeleton graphs at different time steps carrying
skeleton representations of human bodies instead of raw
information in both spatial and temporal dimensions.
RGB videos, and the significant results seen in recent
work [50, 33, 32, 34, 21, 20, 54, 35] have proven its merits. For robust action recognition from skeleton graphs, an
In contrast to RGB representations, skeleton data contain ideal algorithm should look beyond the local joint con-
only the 2D [50, 15] or 3D [31, 25] positions of the human nectivity and extract multi-scale structural features and
key joints, providing highly abstract information that is also long-range dependencies, since joints that are structurally
apart can also have strong correlations. Many existing ap-
1 Code is available at github.com/kenziyuliu/ms-g3d proaches achieve this by performing graph convolutions

143
[17] with higher-order polynomials of the skeleton adja- G3D, forms a building block of our final model architecture
cency matrix: intuitively, a powered adjacency matrix cap- that outperforms state-of-the-art methods on three large-
tures the number of walks between every pair of nodes scale skeleton action datasets: NTU RGB+D 120 [25], NTU
with the length of the walks being the same as the power; RGB+D 60 [31], and Kinetics Skeleton 400 [15]. The main
the adjacency polynomial thus increases the receptive field contributions of this work are summarized as follows:
of graph convolutions by making distant neighbors reach-
(i) We propose a disentangled multi-scale aggregation
able. However, this formulation suffers from the biased
scheme that removes redundant dependencies between node
weighting problem, where the existence of cyclic walks on
features from different neighborhoods, which allows pow-
undirected graphs means that edge weights will be biased
erful multi-scale aggregators to effectively capture graph-
towards closer nodes against further nodes. On skeleton
wide joint relationships on human skeletons.
graphs, this means that a higher polynomial order is only
marginally effective at capturing information from distant (ii) We propose a unified spatial-temporal graph convo-
joints, since the aggregated features will be dominated by lution (G3D) operator which facilitates direct information
the joints from local body parts. This is a critical drawback flow across spacetime for effective feature learning.
limiting the scalability of existing multi-scale aggregators.
(iii) Integrating the disentangled aggregation scheme with
Another desirable characteristic of robust algorithms is G3D gives a powerful feature extractor (MS-G3D) with
the ability to leverage the complex cross-spacetime joint re- multi-scale receptive fields across both spatial and temporal
lationships for action recognition. However, to this end, dimensions. The direct multi-scale aggregation of features
most existing approaches [50, 33, 19, 32, 21, 34, 18] in spacetime further boosts model performance.
deploy interleaving spatial-only and temporal-only mod-
ules (Fig. 1(a)), analogous to factorized 3D convolutions 2. Related Work
[30, 39]. A typical approach is to first use graph convo-
lutions to extract spatial relationships at each time step, 2.1. Neural Nets on Graphs
and then use recurrent [19, 34, 18] or 1D convolutional
Architectures. To extract features from arbitrarily struc-
[50, 33, 21, 32] layers to model temporal dynamics. While
tured graphs, Graph Neural Networks (GNNs) have been
such factorization allows efficient long-range modeling, it
developed and explored extensively [5, 17, 3, 2, 10, 40, 49,
hinders the direct information flow across spacetime for
1, 7, 11, 22]. Recently proposed GNNs can broadly be
capturing complex regional spatial-temporal joint depen-
classified into spectral GNNs [3, 11, 22, 13, 17] and spa-
dencies. For example, the action “standing up” often has
tial GNNs [17, 49, 10, 51, 41, 1, 45]. Spectral GNNs con-
co-occurring movements of upper and lower body across
volve the input graph signals with a set of learned filters
both space and time, where upper body movements (leaning
in the graph Fourier domain. They are however limited
forward) strongly correlate to the lower body’s future move-
in terms of computational efficiency and generalizability
ments (standing up). These strong cues for making predic-
to new graphs due to the requirement of eigendecomposi-
tions may be ineffectively captured by factorized modeling.
tion and the assumption of fixed adjacency. Spatial GNNs,
In this work, we address the above limitations from in contrast, generally perform layer-wise update for each
two aspects. First, we propose a new multi-scale aggre- node by (1) selecting neighbors with a neighborhood func-
gation scheme that tackles the biased weighting problem tion (e.g. adjacent nodes); (2) merging the features from the
by removing redundant dependencies between further and selected neighbors and itself with an aggregation function
closer neighborhoods, thus disentangling their features un- (e.g. mean pooling); and (3) applying an activated trans-
der multi-scale aggregation (illustrated in Fig. 2). This leads formation to the merged features (e.g. MLP [49]). Among
to more powerful multi-scale operators that can model re- different GNN variants, the Graph Convolutional Network
lationships of joints irrespective of the distances between (GCN) [17] was first introduced as a first-order approxima-
them. Second, we propose G3D, a novel unified spatial- tion for localized spectral convolutions, but its simplicity as
temporal graph convolution module that directly models a mean neighborhood aggregator [49, 46] has quickly led
cross-spacetime joint dependencies. G3D does so by in- many subsequent spatial GNN architectures [49, 1, 45, 7]
troducing graph edges across the “3D” spatial-temporal and various applications involving graph structured data
domain as skip connections for unobstructed information [44, 47, 52, 50, 33, 34, 21] to treat it as a spatial GNN base-
flow (Fig. 1(b)), substantially facilitating spatial-temporal line. This work adapts the layer-wise update rule in GCN.
feature learning. Remarkably, our proposed disentangled
aggregation scheme augments G3D with multi-scale rea- Multi-Scale Graph Convolutions. Multi-scale spatial
soning in spacetime (Fig. 1(c)) without being affected by the GNNs have also been proposed to capture features from
biased weighting problem, despite extra edges were intro- non-local neighbors. [1, 19, 21, 45, 24] use higher order
duced. The resulting powerful feature extractor, named MS- polynomials of the graph adjacency matrix to aggregate fea-

144
tures from long-range neighbor nodes. Truncated Block ral context of three frames. (3) G3D can learn from multiple
Krylov network [29] similarly raises the adjacency matrix to temporal contexts simultaneously leveraging different win-
higher powers and obtains multi-scale information through dow sizes and dilations, which is not addressed in GR-GCN.
dense features concatenation from different hidden layers.
LanczosNet [24] deploys a low-rank approximation of the 3. MS-G3D
adjacency matrix to speed up the exponentiation on large
graphs. As mentioned in Section 1, we argue that adjacency
3.1. Preliminaries
powering can have adverse effects on long-range modeling Notations. A human skeleton graph is denoted as G =
due to weighting bias, and our proposed module aims to (V, E), where V = {v1 , ..., vN } is the set of N nodes repre-
address this with disentangled multi-scale aggregators. senting joints, and E is the edge set representing bones cap-
tured by an adjacency matrix A ∈ RN ×N where initially
2.2. Skeleton-Based Action Recognition Ai,j = 1 if an edge directs from vi to vj and 0 otherwise.
Earlier approaches [42, 6, 31, 36, 43, 48, 54] to skeleton- A is symmetric since G is undirected. Actions as graph se-
based action recognition focus on hand-crafting features quences have a node features set X = {xt,n ∈ RC | t, n ∈
and joint relationships for downstream classifiers, which Z, 1 ≤ t ≤ T, 1 ≤ n ≤ N } represented as a feature tensor
ignore the important semantic connectivity of the human X ∈ RT ×N ×C , where xt,n = Xt,n,: is the C dimensional
body. By constructing spatial-temporal graphs and mod- feature vector for node vn at time t over a total of T frames.
eling the spatial relationships with GNNs directly, recent The input action is thus adequately described by A struc-
approaches [50, 19, 8, 21, 8, 33, 32, 34, 18] have seen sig- turally and by X feature-wise, with Xt ∈ RN ×C being the
nificant performance boost, indicating the necessity of the node features at time t. Θ(l) ∈ RCl ×Cl+1 denotes a learn-
semantic human skeleton for action predictions. able weight matrix at layer l of a network.
An early application of graph convolutions is ST-GCN Graph Convolutional Nets (GCNs). On skeleton inputs
[50], where spatial graph convolutions along with inter- defined by features X and graph structure A, the layer-wise
leaving temporal convolutions are used for spatial-temporal update rule of GCNs can be applied to features at time t as:
modeling. A concurrent work by Li et al. [19] presents  
(l+1) 1 1 (l)
a similar approach, but it notably introduces a multi-scale Xt = σ D̃− 2 ÃD̃− 2 Xt Θ(l) , (1)
module by raising skeleton adjacency to higher powers.
AS-GCN [21] also uses adjacency powering for multi-scale where à = A + I is the skeleton graph with added self-
modeling, but it additionally generates human poses to loops to keep identity features, D̃ is the diagonal degree
augment the spatial graph convolution. Spatial-Temporal matrix of Ã, and σ(·) is an activation function. The term
Graph Routing (STGR) network [18] adds extra edges to the 1 1 (l)
D̃− 2 ÃD̃− 2 Xt can be intuitively interpreted as an ap-
skeleton graph using frame-wise attention and global self-
proximate spatial mean feature aggregation from the direct
attention mechanisms. Similarly, 2s-AGCN [33] introduces
neighborhood followed by an activated linear layer.
graph adaptiveness with self-attention along with a freely
learned graph residual mask. It also uses a two-stream en- 3.2. Disentangled Multi-Scale Aggregation
semble with skeleton bone features to boost performance.
DGNN [32] likewise leverages bone features, but it instead Biased Weighting Problem. Under the spatial aggrega-
simultaneously updates the joint and bone features through tion framework in Eq. 1, existing approaches [21] employ
an alternating spatial aggregation scheme. Note that these higher-order polynomials of the adjacency matrix to aggre-
approaches primarily focus on spatial modeling; in contrast, gate multi-scale structural information at time t, as:
we present a unified approach for capturing complex joint K
!
X
correlations directly across spacetime. X
(l+1)
=σ b k (l) (l)
A X Θ , (2)
t t (k)
Another relevant work is GR-GCN [8], which merges ev- k=0
ery three frames over the skeleton graph sequence and adds
sparsified edges between adjacent frames. Whereas GR- where K controls the number of scales to aggregate. Here,
GCN also deploys cross-spacetime edges, our G3D mod- Ab is a normalized form of A, e.g. [19] uses the symmetric
ule has several important distinctions: (1) Cross-spacetime normalized graph Laplacian A b = Lnorm = I − D 21 AD 21 ;
edges in G3D follow the semantic human skeleton, which is [21] uses the random-walk normalized adjacency A b =
naturally a more interpretable and more robust representa- b
D A; more generally, one can use A = D̃ ÃD̃− 2
−1 − 12 1

tion than the sparsified, one-size-fits-all graph in GR-GCN. from GCNs. It is easy to see that Aki,j = Akj,i gives the
The underlying graph is also much easier to compute. (2) number of length k walks between vi and vj , and thus
GR-GCN has cross-spacetime edges only between adjacent the term Ab k X(l) is performing a weighted feature average
t
frames, which prevents it to reason beyond a limited tempo- based on the number of such walks. However, it is clear

145
0 0 0 the proposed disentangled formulation in Eq. 4 addresses
3 5 3 5 3 5

2
1

2
1

2
1
the biased weighting problem by removing redundant de-
4 4 4

6 !#
A 7
6 !&
A
7
!'
A 7
!#
A !&
A !'
A pendencies of distant neighborhoods’ weighting on closer
6
Closer Further
neighborhoods. Additional scales with larger k are there-
3
0

5 3
0

5 3
0

5
fore aggregated in an additive manner under a multi-scale
1 1 1

2 4 2 4
2 4
operator, making long-range modeling with large values of
! (#)
A ! (&)
A ! (')
A
6
! (#) 7
A 6
! (&) 7
A 6 ! (') 7
A k to remain effective. The resulting k-adjacency matrices
are also more sparse than their exponentiated counterparts
Figure 2: Illustration of the biased weighting problem and the
proposed disentangled aggregation scheme. Darker color indi-
(see Fig. 2), allowing more efficient representations.
cates higher weighting to the central node (red). Top left: closer 3.3. G3D: Unified Spatial-Temporal Modeling
nodes receive higher weighting from adjacency powering, which
makes long-range modeling less effective, especially when multi- Most existing work treats skeleton actions as a sequence
ple scales are aggregated. Bottom left: our proposed disentangled of disjoint graphs where features are extracted through
aggregation models joint relationships at each neighborhood while spatial-only (e.g. GCNs) and temporal-only (e.g. TCNs)
keeping identity features. Right: Visualizing the corresponding modules. We argue that such factorized formulation is less
adjacency matrices. Node self-loops are omitted for visual clarity. effective for capturing complex spatial-temporal joint rela-
tionships. Clearly, if a strong connection exists between a
that there are drastically more possible length k walks to pair of nodes, then during layer-wise propagation the pair
closer nodes than to the actual k-hop neighbors due to cyclic should incorporate a significant portion each other’s fea-
walks. This causes a bias towards the local region as well tures to reflect such a connection [50, 33, 34]. However, as
as nodes with higher degrees. The node self-loops in GCNs signals are propagated across spacetime through a series of
allow even more possible cycles (as walks can always cycle local aggregators (GCNs and TCNs alike), they are weak-
on self-loops) and thus amplify the bias. See Fig. 2 for illus- ened as redundant information is aggregated from an in-
tration. Under multi-scale aggregation on skeleton graphs, creasingly larger spatial-temporal receptive field. The prob-
the aggregated features will thus be dominated by signals lem is more evident if one observes that GCNs do not per-
from local body parts, making it ineffective to capture long- form a weighted aggregation to distinguish each neighbor.
range joint dependencies with higher polynomial orders. Cross-Spacetime Skip Connections. To tackle the above
Disentangling Neighborhoods. To address the above problem, we propose a more reasonable approach to allow
problem, we first define the k-adjacency matrix Ã(k) as cross-spacetime skip connections, which are readily mod-
 eled with cross-spacetime edges in a spatial-temporal graph.
 1 if d(vi , vj ) = k, Let us first consider a sliding temporal window of size τ
[Ã(k) ]i,j = 1 if i = j, (3) over the input graph sequence, which, at each step, ob-
 tains a spatial-temporal subgraph G(τ ) = (V(τ ) , E(τ ) ) where
0 otherwise,
V(τ ) = V1 ∪ ... ∪ Vτ is the union of all node sets across τ
where d(vi , vj ) gives the shortest distance in number of frames in the window. The initial edge set E(τ ) is defined
hops between vi and vj . Ã(k) is thus a generalization of by tiling à into a block adjacency matrix Ã(τ ) , where
à to further neighborhoods, with Ã(1) = à and Ã(0) = I.  
Under spatial aggregation in Eq. 1, the inclusion of self- Ã · · · Ã
 
loops in Ã(k) is critical for learning the relationships be- Ã(τ ) =  ... . . . ...  ∈ Rτ N ×τ N . (5)
tween the current joint and its k-hop neighbors, as well as
à ··· Ã
for keeping each joint’s identity information when no k-hop
neighbors are available. Given that N is small, Ã(k) can be Intuitively, each submatrix [Ã(τ ) ]i,j = Ã means every node
easily computed,  e.g., using
 differences
 of graph
 powers as in Vi is connected to itself and its 1-hop spatial neighbors
Ã(k) = I + ✶ Ã ≥ 1 − ✶ Ã
k k−1
≥ 1 . Substituting at frame j by extrapolating the frame-wise spatial connec-
b k tivity (which is [Ã(τ ) ]i,i for all i) to the temporal domain.
A with Ã(k) in Eq. 2, we arrive at:
Thus, each node within G(τ ) is densely connected to itself
K
! and its 1-hop spatial neighbors across all τ frames. We
(l+1)
X −1 − 1 (l) (l)
Xt =σ D̃(k)2 Ã(k) D̃(k)2 Xt Θ(k) , (4) can easily obtain X(τ ) ∈ RT ×τ N ×C using the same sliding
k=0 window over X with zero padding to construct T windows.
−1 −1
Using Eq. 1, we thus arrive at a unified spatial-temporal
where D̃(k)2 Ã(k) D̃(k)2 is the normalized [17] k-adjacency. graph convolutional operator for the tth temporal window:
Unlike the previous case where possible length k walks  
(l+1) −1 −1 (l)
are predominantly conditioned on length k − 1 walks, [X(τ ) ]t = σ D̃(τ 2) Ã(τ ) D̃(τ 2) [X(τ ) ]t Θ(l) . (6)

146
Dilated Windows. Another significant aspect of the lations as well as long-range spatial and temporal dependen-
above window construction is that the frames need not to cies: (1) The G3D pathway first constructs spatial-temporal
be adjacent. A dilated window with τ frames and a dilation windows, performs disentangled multi-scale graph convo-
rate d can be constructed by picking a frame every d frames, lutions on them, and then collapses them with a fully con-
and reusing the same spatial-temporal structure Ã(τ ) . Sim- nected layer for window feature readout. The extra dotted
ilarly, we can obtain node features X(τ,d) ∈ RT ×τ N ×C G3D pathway (Fig. 3(b)) indicates the model can learn from
(d = 1 if omitted) and perform layer-wise update as in multiple spatial-temporal contexts concurrently with differ-
Eq. 6. Dilated windows allow larger temporal receptive ent τ and d; (2) The factorized pathway augments the G3D
fields without growing the size of Ã(τ ) , analogous to how pathway with long-range, spatial-only, and temporal-only
dilated convolutions [53] keep constant complexities. modules: the first layer is a multi-scale graph convolutional
layer capable of modeling the entire skeleton graph with the
Multi-Scale G3D. We can also integrate the proposed dis- maximum K; it is then followed by two multi-scale tempo-
entangled multi-scale aggregation scheme (Eq. 4) into G3D ral convolutions layers to capture extended temporal con-
for multi-scale reasoning directly in the spatial-temporal do- texts (discussed below). The outputs from all pathways are
main. We thus derive the MS-G3D module from Eq. 6 as: aggregated as the STGC block output, which has 96, 192,
K
! and 384 feature channels respectively within a typical r=3
(l+1)
X −1 −1 (l) (l)
[X(τ ) ]t =σ 2
D̃(τ,k) 2
Ã(τ,k) D̃(τ,k) [X(τ ) ]t Θ(k) , (7) block architecture. Batch normalization [14] and ReLU is
k=0 added at the end of each layer except for the last layer. All
STGC blocks, except the first, downsample the temporal di-
where Ã(τ,k) and D̃(τ,k) are defined similarly as Ã(k) and
mension with stride 2 temporal conv and sliding windows.
D̃(k) respectively. Remarkably, our proposed disentangled
aggregation scheme complements this unified operator, as Multi-Scale Temporal Modeling. The spatial-temporal
G3D’s increased node degrees from spatial-temporal con- windows G(τ ) used by G3D are a closed structure by them-
nectivity can contribute to the biased weighting problem. selves, which means G3D must be accompanied by tempo-
ral modules for cross-window information exchange. Many
Discussion. We give more in-depth analyses on G3D as existing work [50, 18, 33, 32, 21] performs temporal mod-
follows. (1) It is analogous to classical 3D convolutional eling using temporal convolutions with a fixed kernel size
blocks [38], with its spatial-temporal receptive field defined kt × 1 throughout the architecture. As a natural extension
by τ , d, and Ã. (2) Unlike 3D convolutions, G3D’s param- to our multi-scale spatial aggregation, we enhance vanilla
(·)
eter count from Θ(·) is independent of τ or |E(τ ) |, making temporal convolutional layers with multi-scale learning, as
it generally less prone to overfitting with large τ . (3) The illustrated in Fig. 3(c). To lower the computational costs
dense cross-spacetime connections in G3D entail a trade- due to the extra branches, we deploy a bottleneck design
off on τ , as larger values of τ bring larger temporal recep- [37], fix kernel sizes at 3×1, and use different dilation rates
tive fields at the cost of more generic features due to larger [53] instead of larger kernels for larger receptive fields. We
immediate neighborhoods. Additionally, larger τ implies also use residual connections [12] to facilitate training.
a quadratically larger Ã(τ ) and thus more operations with
multi-scale aggregation. On the other hand, larger dilations Adaptive Graphs. To improve the flexibility of graph
d bring larger temporal coverage at the cost of temporal res- convolutional layers which performs homogeneous neigh-
olution (lower frame rates). τ and d thus must be balanced borhood averaging, we add a simple learnable, uncon-
carefully. (4) G3D modules are designed to capture com- strained graph residual mask Ares inspired by [33, 32] to
plex regional spatial-temporal instead of long-range depen- every Ã(k) and Ã(τ,k) to strengthen, weaken, add, or re-
dencies that are otherwise more economically captured by move edges dynamically. For example, Eq. 4 is updated to
factorized modules. We thus observe the best performance K
!
(l+1)
X −1 res −1 (l) (l)
when G3D modules are augmented with long-range, factor- Xt =σ D̃(k)2 (Ã(k) + A(k) )D̃(k)2 Xt Θ(k) . (8)
ized modules, which we discuss in the next section. k=0

Ares is initialized with random values around zero and is


3.4. Model Architecture
different for each k and τ , allowing each multi-scale context
Overall Architecture. The final model architecture is il- (either spatial or spatial-temporal) to select the best suited
lustrated in Fig. 3. On a high level, it contains a stack mask. Note also that since Ares is optimized for all possi-
of r spatial-temporal graph convolutional (STGC) blocks ble actions, which may have different optimal edge sets for
to extract features from skeleton sequences, followed by a feature propagation, it is expected to give minor edge cor-
global average pooling layer and a softmax classifier. Each rections and may be insufficient when the graph structures
STGC block deploys two types of pathways to simultane- have major deficiencies. In particular, Ares only partially
ously capture complex regional spatial-temporal joint corre- mitigates the biased weighting problem (see Section 4.3).

147
Inputs MS-TCN Inputs
STGC Block 𝑙
(Section 3.4) 𝐗: 𝑇 (;) ×𝑁×𝐶 (;) Conv 1×1 Conv 1×1 Conv 1×1 Conv 1×1 Conv 1×1
... Conv 1×1
Conv 1×1
Conv 3×1 Conv 3×1 Conv 3×1 Conv 3×1 MaxPool stride = 2
Conv 1×1 Conv 1×1
dilation = 1 dilation = 2 dilation = 3 dilation = 4 3×1
𝜏 = 𝜏? 𝜏 = 𝜏B MS-GCN Concat
𝑑 = 𝑑? 𝑑 = 𝑑B
𝐗: 𝑇 (6) ×𝑁×𝐶 (6) Residual
MS-G3D MS-G3D Add + Path
STGC MS-TCN
(stride = 2) 𝐀: 𝑁×𝑁 (c) MS-TCN
MS-G3D
𝐗: 𝑇)* ×𝑁×𝐶)*
r ... Sliding Temporal Window
MS-TCN MS-GCN-D Inputs
size = 𝜏, dilation = 𝑑
(stride = 2)
STGC Factorized 𝐀(/) : 𝜏𝑁×𝜏𝑁
G3D Pathway 𝐗 (/) : 𝑇123 ×𝜏𝑁×𝐶)*
Global Average Pooling Pathway(s) GraphConv GraphConv GraphConv
MS-GCN 𝐀(6) + 𝐀89: 𝐀(?) + 𝐀89:
... 𝐀(<) + 𝐀89:
6 ? <
𝐗: 1×1×𝐶 (@) 𝐀: Skeleton Graph Adjacency Add + 𝐗 (/) : 𝑇123 ×𝜏𝑁×𝐶4)5
𝐗: Node Features 𝐗: 𝑇 (;>?) ×𝑁×𝐶 (;>?)
FC + Softmax 𝑇: Number of Frames Collapse Window
Reshape + FC 𝐀 (<) : 𝑘-adjacency
𝑁: Number of Nodes MS-TCN + Add
“Hand Waving” 𝐶: Number of Channels 𝐗: 𝑇123 ×𝑁×𝐶123 𝐀89:
< : scale-specific mask

(a) Full Architecture (b) STGC Block (d) MS-G3D (e) MS-GCN-Disentangled

Figure 3: (Match components with colors) Architecture Overview. “TCN”, “GCN”, prefix “MS-”, and suffix “-D” denotes temporal
and graph convolutional blocks, and multi-scale and disentangled aggregation, respectively (Section 3.2). Each of the r STGC blocks
(b) deploys a multi-pathway design to capture long-range and regional spatial-temporal dependencies simultaneously. Dotted modules,
including extra G3D pathway, 1×1 conv, and strided temporal convolutions, are situational for model performance/complexity trade-off.

Joint-Bone Two-Stream Fusion. Inspired by the two- jects and 32 different camera setups. The authors now rec-
stream methods in [33, 32, 34] and the intuition that visu- ommend replacing the Cross-View setting with a Cross-
alizing bones along with joints can help humans recognize Setup (X-Set) setting, where 54,468 samples collected from
skeleton actions, we use a two-stream framework where a half of the camera setups are used for training and the rest
separate model with identical architecture is trained using 59,477 samples for testing. In Cross-Subject, 63,026 sam-
the bone features initialized as vector differences of adja- ples from a selected group of 53 subjects are used for train-
cent joints directed away from the body center. The softmax ing, and the rest 50,919 samples for testing.
scores from the joint/bone models are summed to obtain fi-
Kinetics Skeleton 400. The Kinetics Skeleton 400 dataset
nal prediction scores. Since skeleton graphs are trees, we
is adapted from the Kinetics 400 video dataset [15] us-
add a zero bone vector at the body center to obtain N bones
ing the OpenPose [4] pose estimation toolbox. It contains
from N joints and reuse A for connectivity definition.
240,436 training and 19,796 testing skeleton sequences over
400 classes, where each skeleton graph contains 18 body
4. Experiments joints, along with their 2D spatial coordinates and the pre-
diction confidence score from OpenPose as the initial joint
4.1. Datasets
features [50]. At each time step, the number of skele-
NTU RGB+D 60 and NTU RGB+D 120. NTU RGB+D tons is capped at 2, and skeletons with lower overall confi-
60 [31] is a large-scale action recognition dataset containing dence scores are discarded. Following the convention from
56,578 skeleton sequences over 60 action classes captured [15, 50], Top-1 and Top-5 accuracies are reported.
from 40 distinct subjects and 3 different camera view an-
4.2. Implementation Details
gles. Each skeleton graph contains N = 25 body joints as
nodes, with their 3D locations in space as initial features. Unless otherwise stated, all models have r = 3 and are
Each frame of the action contains 1 to 2 subjects. The au- trained with SGD with momentum 0.9, batch size 32 (16
thors recommend reporting the classification accuracy un- per worker), an initial learning rate 0.05 (can linearly scale
der two settings: (1) Cross-Subject (X-Sub), where the 40 up with batch size [9]) for 50, 60, and 65 epochs with step
subjects are split into training and testing groups, yielding LR decay with a factor of 0.1 at epochs {30, 40}, {30, 50},
40,091 and 16,487 training and testing examples respec- and {45, 55} for NTU RGB+D 60, 120, and Kinetics Skele-
tively. (2) Cross-View (X-View), where all 18,932 samples ton 400, respectively. Weight decay is set to 0.0005 for final
collected from camera 1 are used for testing and the rest models and is adjusted accordingly during component stud-
37,646 samples used for training. NTU RGB+D 120 [25] ies. All skeleton sequences are padded to T = 300 frames
extends NTU RGB+D 60 with an additional 57,367 skele- by replaying the actions. Inputs are preprocessed with nor-
ton sequences over 60 extra action classes, totalling 113,945 malization and translation following [33, 32]. No data aug-
samples over 120 classes captured from 106 distinct sub- mentation is used for fair performance comparison.

148
4.3. Component Studies Number of Scales
Methods
K=1 K=4 K=8 K = 12
We analyze the individual components and their configu-
GCN-E 85.1 85.6 86.5 86.6
rations in the final architecture. Unless stated, performance
GCN-D 85.1 87.0 86.9 86.8
is reported as classification accuracy on the Cross-Subject
GCN-E + Mask 86.1 87.0 87.5 87.7
setting of NTU RGB+D 60 using only the joint data.
GCN-D + Mask 86.1 86.9 87.9 87.8
Disentangled Multi-Scale Aggregation. We first justify G3D-E 85.1 85.5 85.4 85.5
our proposed disentangled multi-scale aggregation scheme G3D-D 85.1 86.4 86.5 86.4
by verifying its effectiveness with different number of G3D-E + Mask 86.6 87.0 86.5 86.2
scales over sparse and dense graphs. In Table 1, we G3D-D + Mask 86.6 87.4 87.1 87.0
do so using the individual pathways of the STGC blocks Table 1: Accuracy (%) with multi-scale aggregation on individual
(Fig. 3(b)), referred to as “GCN” and “G3D”, respectively, pathways of STGC blocks with different K. “Mask” refers to the
with suffixes “-E” and “-D” denoting adjacency powering residual masks Ares . If K>1, GCN/G3D is Multi-Scale (MS-).
and disentangled aggregation. Here, the maximum K = 12
is the diamater of skeleton graphs from NTU RGB+D 60, Model Configurations Params Acc (%)
and we set τ = 5 for G3D modules. To keep consistent nor- Baseline (Js-AGCN [33]) 3.5M 86.0
malization, we set A b = D̃− 12 ÃD̃− 12 in Eq. 2 for GCN-E
Baseline + MS-TCN 1.6M 86.7
and G3D-E. We first observe that the disentangled formula- MS-GCN (Factorized Pathway) Only 1.4M 87.8
tion can bring as much as 1.4% gain over simple adjacency with 2.5× Capacity 3.5M 88.5
powering at K = 4, underpinning the necessity for neigh- with Dual Pathway 2.8M 88.6
borhood disentanglement. In this case, the residual mask
MS-GCN (Factorized Pathway)
Ares partially corrects the weighting imbalance, narrowing
with MS-G3D (τ = 3, d = 1) 2.7M 89.0
the largest gap to 0.4%. However, the same set of experi-
with MS-G3D (τ = 3, d = 2) 2.7M 89.1
ments on the G3D pathway, where the window graph G(τ )
with MS-G3D (τ = 3, d = 3) 2.7M 89.1
is denser than the spatial graph G, shows wider accuracy
with MS-G3D (τ = 5, d = 1) 3.2M 89.2
gaps between G3D-E and G3D-D, indicating a more severe
with MS-G3D (τ = 5, d = 2) 3.2M 89.2
biased weighting problem. In particular, we see 0.8% per-
with MS-G3D (τ = 7, d = 1)† 3.0M 89.0
formance gap at K = 12 even if residual masks are added.
with 2 MS-G3D Pathways†
These results verify the effectiveness of the proposed dis- 2.8M 89.3
τ = (3, 3), d = (1, 2)
entangled aggregation scheme for multi-scale learning; it
with 2 MS-G3D Pathways†
boosts performance across different number scales not only 3.2M 89.4
τ = (3, 5), d = (1, 1)
in the spatial domain, but more so in the spatial-temporal
domain where it complements the proposed G3D module. Table 2: Model accuracy with various settings. MS-GCN and MS-
In general, the spatial GCNs benefits more from large K G3D uses K ∈ {12, 5} respectively. † Output channels double at
than do the spatial-temporal G3D modules; for final archi- the collapse window layer (Fig. 3(d), Cmid to Cout ) instead of at the
tectures, we empirically set K ∈ {12, 5} for MS-GCN and graph convolution (Cin to Cmid ) to maintain similar budget.
MS-G3D blocks respectively.
G3D Graph Connectivity Params Acc (%)
Effectiveness of G3D. To validate the efficacy of G3D (1) Grid-like 2.7M 88.7
modules to capture complex spatial-temporal features, we (2) Grid-like + dense self-edges 2.7M 88.6
build up the model incrementally with its individual com- (Eq. 5) Cross-spacetime edges 2.7M 89.1
ponents, and show its performance in Table 2. We use the Table 3: Comparing graph connectivity settings (τ = 3, d = 2).
joint stream from 2s-AGCN [33] as the baseline for con-
trolled experiments, and for fair comparison, we replaced its
regular temporal convolutional layers with MS-TCN layers the G3D pathway is added, we observe consistently bet-
and obtained an improvement with less parameters. First, ter results with similar or less parameters, verifying G3D’s
we observe that the factorized pathway alone can outper- ability to pick up complex regional spatial-temporal corre-
form the baseline due to the powerful disentangled aggre- lations that are previously overlooked by modeling spatial
gation in MS-GCN. However, if we simply scale up the and temporal dependencies in a factorized fashion.
factorized pathway to larger capacity (deeper and wider),
or duplicate the factorized pathway to learn from different Exploring G3D Configurations. Table 2 also compares
feature subspaces and mimic the multi-pathway design in various G3D settings, including different values of τ , d,
STGC blocks, we observe limited gains. In contrast, when and the number of G3D pathways in STGC blocks. We

149
NTU RGB+D 120 pected, we obtain the best results when the temporal reso-
Methods
X-Sub (%) X-Set (%) lution is unaltered by setting τ = (3, 5).
ST-LSTM [26] 55.7 57.9
Cross-spacetime Connectivity. To demonstrate the need
GCA-LSTM [27] 61.2 63.3
for cross-spacetime edges in G(τ ) defined in Eq. 5 instead of
RotClips + MTCNN [16] 62.2 61.8
simple, grid-like temporal self-edges (on which G3D also
Body Pose Evolution Map [28] 64.6 66.9
applies), we contrast different connectivity schemes in Ta-
2s-AGCN [33] 82.9 84.9 ble 3 while fixing other parts of the architecture. The first
MS-G3D Net 86.9 88.4 two settings refer to modifying the block adjacency matrix
Table 4: Classification accuracy comparison against state-of-the- Ã(τ ) such that: (1) the blocks à on the main diagonal are
art methods on the NTU RGB+D 120 Skeleton dataset. kept, the blocks on superdiagonal/subdiagonal is set to I,
and the rest set to 0; and (2) all blocks but the main diago-
NTU RGB+D 60 nal of à are set to I. Intuitively, the first produces “3D grid”
Methods graphs and the second includes extra dense self-edges over
X-Sub (%) X-View (%)
τ frames. Clearly, while all settings allow unified spatial-
IndRNN [23] 81.8 88.0
temporal graph convolutions, cross-spacetime edges as skip
HCN [20] 86.5 91.1
connections are essential for efficient information flow.
ST-GR [18] 86.9 92.3
AS-GCN [21] 86.8 94.2 Joint-Bone Two-Stream Fusion. We verify our method
2s-AGCN [33] 88.5 95.1 under the joint-bone fusion framework on the NTU RGB+D
AGC-LSTM [34] 89.2 95.0 60 dataset in Table 5. Similar to [33], we obtain best per-
DGNN [32] 89.9 96.1 formance when joint and bone features are fused, indicating
GR-GCN [8] 87.5 94.3 the generalizablity of our method to other input modalities.
MS-G3D Net (Joint Only) 89.4 95.0
MS-G3D Net (Bone Only) 90.1 95.3
4.4. Comparison against the State-of-the-Art
MS-G3D Net 91.5 96.2 We compare our full model (Fig. 3(a)) to the state-of-the-
Table 5: Classification accuracy comparison against state-of-the- art in Tables 4, 5, and 6. Table 4 compares non-graph [26,
art methods on the NTU RGB+D 60 Skeleton dataset. 27, 16, 28] and graph-based methods [33]. Table 5 com-
pares non-graph methods [23, 20], graph-based methods
with spatial edges [18, 21, 33, 34, 32] and with spatial-
Kinetics Skeleton 400
Methods temporal edges [8]. Table 6 compares single-stream [50, 21]
Top-1 (%) Top-5 (%)
and multi-stream [18, 33, 32] methods. On all three large-
ST-GCN [50] 30.7 52.8 scale datasets, our method outperforms all existing methods
AS-GCN [21] 34.8 56.5 under all evaluation settings. Notably, our method is the first
ST-GR [18] 33.6 56.1 to apply a multi-pathway design to learn both long-range
2s-AGCN [33] 36.1 58.7 spatial and temporal dependencies and complex regional
DGNN [32] 36.9 59.6 spatial-temporal correlations from skeleton sequences, and
MS-G3D Net 38.0 60.9 the results verify the effectiveness of our approach.
Table 6: Classification accuracy comparison against state-of-the-
art methods on the Kinetics Skeleton 400 dataset. 5. Conclusion
In this work, we present two methods for improving
first observe that all configurations consistently outperform skeleton-based action recognition: a disentangled multi-
the baseline, confirming the stability of MS-G3D as a ro- scale aggregation scheme for graph convolutions that re-
bust feature extractor. We also see that τ = 5 give slightly moves redundant dependencies between different neighbor-
better results, but the gain diminishes at τ = 7 as the ag- hoods, and G3D, a unified spatial-temporal graph convolu-
gregated features become too generic due to the oversized tional operator that directly models spatial-temporal depen-
local spatial-temporal neighborhood, thus counteracting the dencies from skeleton graph sequences. By coupling these
benefits of larger temporal coverage. The dilation rate d has methods, we derive MS-G3D, a powerful feature extrac-
varying effects: (1) when τ = 3, d = 1 underperforms tor that captures multi-scale spatial-temporal features pre-
d ∈ {2, 3}, justifying the need for larger temporal contexts; viously overlooked by factorized modeling. With experi-
(2) larger d has marginal benefits, as its larger temporal cov- ments on three large-scale datasets, we show that our model
erage come at a cost of temporal resolution (thus coarsened outperforms existing methods by a sizable margin.
skeleton motions). We thus observe better results when two Acknowledgements: This work was supported by the Australian Research
G3D pathways with d = (1, 2) are combined, and as ex- Council Grant DP200103223. ZL thanks Weiqing Cao for designing figures.

150
References [14] Sergey Ioffe and Christian Szegedy. Batch normalization:
Accelerating deep network training by reducing internal co-
[1] Sami Abu-El-Haija, Bryan Perozzi, Amol Kapoor, Nazanin variate shift. arXiv preprint arXiv:1502.03167, 2015. 5
Alipourfard, Kristina Lerman, Hrayr Harutyunyan, Greg Ver
[15] Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang,
Steeg, and Aram Galstyan. MixHop: Higher-order graph
Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola,
convolutional architectures via sparsified neighborhood mix-
Tim Green, Trevor Back, Paul Natsev, Mustafa Suleyman,
ing. In Kamalika Chaudhuri and Ruslan Salakhutdinov, ed-
and Andrew Zisserman. The kinetics human action video
itors, Proceedings of the 36th International Conference on
dataset, 2017. 1, 2, 6
Machine Learning, volume 97 of Proceedings of Machine
Learning Research, pages 21–29, Long Beach, California, [16] Qiuhong Ke, Mohammed Bennamoun, Senjian An, Ferdous
USA, 09–15 Jun 2019. PMLR. 2 Sohel, and Farid Boussaid. Learning clip representations for
skeleton-based 3d action recognition. IEEE Transactions on
[2] James Atwood and Don Towsley. Diffusion-convolutional
Image Processing, 27(6):2842–2855, 2018. 8
neural networks. In Advances in Neural Information Pro-
cessing Systems, pages 1993–2001, 2016. 2 [17] Thomas N Kipf and Max Welling. Semi-supervised classi-
fication with graph convolutional networks. arXiv preprint
[3] Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann Le-
arXiv:1609.02907, 2016. 2, 4
Cun. Spectral networks and locally connected networks on
graphs. arXiv preprint arXiv:1312.6203, 2013. 2 [18] Bin Li, Xi Li, Zhongfei Zhang, and Fei Wu. Spatio-
[4] Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. temporal graph routing for skeleton-based action recogni-
Realtime multi-person 2d pose estimation using part affin- tion. In Thirty-Third AAAI Conference on Artificial Intel-
ity fields. 2017 IEEE Conference on Computer Vision and ligence, 2019. 2, 3, 5, 8
Pattern Recognition (CVPR), Jul 2017. 6 [19] Chaolong Li, Zhen Cui, Wenming Zheng, Chunyan Xu, and
[5] Michaël Defferrard, Xavier Bresson, and Pierre Van- Jian Yang. Spatio-temporal graph convolution for skeleton
dergheynst. Convolutional neural networks on graphs with based action recognition. In Thirty-Second AAAI Conference
fast localized spectral filtering. In Advances in neural infor- on Artificial Intelligence, 2018. 1, 2, 3
mation processing systems, pages 3844–3852, 2016. 2 [20] Chao Li, Qiaoyong Zhong, Di Xie, and Shiliang Pu. Co-
[6] Yong Du, Wei Wang, and Liang Wang. Hierarchical recur- occurrence feature learning from skeleton data for action
rent neural network for skeleton based action recognition. In recognition and detection with hierarchical aggregation. Pro-
Proceedings of the IEEE conference on computer vision and ceedings of the Twenty-Seventh International Joint Confer-
pattern recognition, pages 1110–1118, 2015. 1, 3 ence on Artificial Intelligence, Jul 2018. 1, 8
[7] Hongyang Gao and Shuiwang Ji. Graph u-nets. In Proceed- [21] Maosen Li, Siheng Chen, Xu Chen, Ya Zhang, Yanfeng
ings of the 36th International Conference on Machine Learn- Wang, and Qi Tian. Actional-structural graph convolutional
ing, ICML 2019, 9-15 June 2019, Long Beach, California, networks for skeleton-based action recognition. In Proceed-
USA, pages 2083–2092, 2019. 2 ings of the IEEE Conference on Computer Vision and Pattern
[8] Xiang Gao, Wei Hu, Jiaxiang Tang, Jiaying Liu, and Zong- Recognition, pages 3595–3603, 2019. 1, 2, 3, 5, 8
ming Guo. Optimized skeleton-based action recognition via [22] Ruoyu Li, Sheng Wang, Feiyun Zhu, and Junzhou Huang.
sparsified graph regression. In Proceedings of the 27th ACM Adaptive graph convolutional neural networks. In Thirty-
International Conference on Multimedia, MM ’19, pages Second AAAI Conference on Artificial Intelligence, 2018. 2
601–610, New York, NY, USA, 2019. ACM. 3, 8 [23] Shuai Li, Wanqing Li, Chris Cook, Ce Zhu, and Yanbo Gao.
[9] Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noord- Independently recurrent neural network (indrnn): Building
huis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, a longer and deeper rnn. 2018 IEEE/CVF Conference on
Yangqing Jia, and Kaiming He. Accurate, large mini- Computer Vision and Pattern Recognition, Jun 2018. 8
batch sgd: Training imagenet in 1 hour. arXiv preprint [24] Renjie Liao, Zhizhen Zhao, Raquel Urtasun, and Richard S
arXiv:1706.02677, 2017. 6 Zemel. Lanczosnet: Multi-scale deep graph convolutional
[10] Will Hamilton, Zhitao Ying, and Jure Leskovec. Induc- networks. arXiv preprint arXiv:1901.01484, 2019. 2, 3
tive representation learning on large graphs. In Advances in [25] Jun Liu, Amir Shahroudy, Mauricio Lisboa Perez, Gang
Neural Information Processing Systems, pages 1024–1034, Wang, Ling-Yu Duan, and Alex Kot Chichung. Ntu rgb+d
2017. 2 120: A large-scale benchmark for 3d human activity under-
[11] David K Hammond, Pierre Vandergheynst, and Rémi Gri- standing. IEEE Transactions on Pattern Analysis and Ma-
bonval. Wavelets on graphs via spectral graph theory. Ap- chine Intelligence, page 1–1, 2019. 1, 2, 6
plied and Computational Harmonic Analysis, 30(2):129– [26] Jun Liu, Amir Shahroudy, Dong Xu, and Gang Wang.
150, 2011. 2 Spatio-temporal lstm with trust gates for 3d human action
[12] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. recognition. In European Conference on Computer Vision,
Deep residual learning for image recognition. In Proceed- pages 816–833. Springer, 2016. 8
ings of the IEEE conference on computer vision and pattern [27] Jun Liu, Gang Wang, Ling-Yu Duan, Kamila Abdiyeva, and
recognition, pages 770–778, 2016. 5 Alex C Kot. Skeleton-based human action recognition with
[13] Mikael Henaff, Joan Bruna, and Yann LeCun. Deep convo- global context-aware attention lstm networks. IEEE Trans-
lutional networks on graph-structured data, 2015. 2 actions on Image Processing, 27(4):1586–1599, 2017. 8

151
[28] Mengyuan Liu and Junsong Yuan. Recognizing human ac- [41] Petar Veličković, William Fedus, William L Hamilton, Pietro
tions as the evolution of pose estimation maps. In Proceed- Liò, Yoshua Bengio, and R Devon Hjelm. Deep graph info-
ings of the IEEE Conference on Computer Vision and Pattern max. arXiv preprint arXiv:1809.10341, 2018. 2
Recognition, pages 1159–1168, 2018. 8 [42] Raviteja Vemulapalli, Felipe Arrate, and Rama Chellappa.
[29] Sitao Luan, Mingde Zhao, Xiao-Wen Chang, and Doina Pre- Human action recognition by representing 3d skeletons as
cup. Break the ceiling: Stronger multi-scale deep graph con- points in a lie group. In Proceedings of the IEEE conference
volutional networks. arXiv, 1906.02174, 2019. 3 on computer vision and pattern recognition, pages 588–595,
[30] Zhaofan Qiu, Ting Yao, and Tao Mei. Learning spatio- 2014. 1, 3
temporal representation with pseudo-3d residual networks. [43] Jiang Wang, Zicheng Liu, Ying Wu, and Junsong Yuan.
In proceedings of the IEEE International Conference on Mining actionlet ensemble for action recognition with depth
Computer Vision, pages 5533–5541, 2017. 2 cameras. In 2012 IEEE Conference on Computer Vision and
Pattern Recognition, pages 1290–1297. IEEE, 2012. 1, 3
[31] Amir Shahroudy, Jun Liu, Tian-Tsong Ng, and Gang Wang.
Ntu rgb+d: A large scale dataset for 3d human activity anal- [44] Xiaolong Wang and Abhinav Gupta. Videos as space-time
ysis. In IEEE Conference on Computer Vision and Pattern region graphs. In Proceedings of the European Conference
Recognition, June 2016. 1, 2, 3, 6 on Computer Vision (ECCV), pages 399–417, 2018. 2
[45] Felix Wu, Amauri Souza, Tianyi Zhang, Christopher Fifty,
[32] Lei Shi, Yifan Zhang, Jian Cheng, and Hanqing Lu.
Tao Yu, and Kilian Weinberger. Simplifying graph con-
Skeleton-based action recognition with directed graph neural
volutional networks. In Kamalika Chaudhuri and Ruslan
networks. In Proceedings of the IEEE Conference on Com-
Salakhutdinov, editors, Proceedings of the 36th International
puter Vision and Pattern Recognition, pages 7912–7921,
Conference on Machine Learning, volume 97 of Proceed-
2019. 1, 2, 3, 5, 6, 8
ings of Machine Learning Research, pages 6861–6871, Long
[33] Lei Shi, Yifan Zhang, Jian Cheng, and Hanqing Lu. Two- Beach, California, USA, 09–15 Jun 2019. PMLR. 2
stream adaptive graph convolutional networks for skeleton- [46] Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long,
based action recognition. In Proceedings of the IEEE Con- Chengqi Zhang, and Philip S. Yu. A comprehensive survey
ference on Computer Vision and Pattern Recognition, pages on graph neural networks. CoRR, abs/1901.00596, 2019. 2
12026–12035, 2019. 1, 2, 3, 4, 5, 6, 7, 8
[47] Zonghan Wu, Shirui Pan, Guodong Long, Jing Jiang, and
[34] Chenyang Si, Wentao Chen, Wei Wang, Liang Wang, and Chengqi Zhang. Graph wavenet for deep spatial-temporal
Tieniu Tan. An attention enhanced graph convolutional lstm graph modeling. arXiv preprint arXiv:1906.00121, 2019. 2
network for skeleton-based action recognition. In Proceed- [48] Chunyu Xie, Ce Li, Baochang Zhang, Chen Chen, Jun-
ings of the IEEE Conference on Computer Vision and Pattern gong Han, and Jianzhuang Liu. Memory attention networks
Recognition, pages 1227–1236, 2019. 1, 2, 3, 4, 6, 8 for skeleton-based action recognition. Proceedings of the
[35] Chenyang Si, Ya Jing, Wei Wang, Liang Wang, and Tieniu Twenty-Seventh International Joint Conference on Artificial
Tan. Skeleton-based action recognition with spatial reason- Intelligence, Jul 2018. 1, 3
ing and temporal stack learning. In Proceedings of the Eu- [49] Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka.
ropean Conference on Computer Vision (ECCV), pages 103– How powerful are graph neural networks? In International
118, 2018. 1 Conference on Learning Representations (ICLR), 2019. 2
[36] Sijie Song, Cuiling Lan, Junliang Xing, Wenjun Zeng, and [50] Sijie Yan, Yuanjun Xiong, and Dahua Lin. Spatial tempo-
Jiaying Liu. An end-to-end spatio-temporal attention model ral graph convolutional networks for skeleton-based action
for human action recognition from skeleton data. In Thirty- recognition. In Thirty-Second AAAI Conference on Artificial
first AAAI conference on artificial intelligence, 2017. 3 Intelligence, 2018. 1, 2, 3, 4, 5, 6, 8
[37] Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and [51] Zhitao Ying, Jiaxuan You, Christopher Morris, Xiang Ren,
Alex Alemi. Inception-v4, inception-resnet and the impact Will Hamilton, and Jure Leskovec. Hierarchical graph rep-
of residual connections on learning, 2016. 5 resentation learning with differentiable pooling. In Advances
[38] Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, in Neural Information Processing Systems, pages 4800–
and Manohar Paluri. Learning spatiotemporal features with 4810, 2018. 2
3d convolutional networks. In Proceedings of the IEEE inter- [52] Bing Yu, Haoteng Yin, and Zhanxing Zhu. Spatio-temporal
national conference on computer vision, pages 4489–4497, graph convolutional networks: A deep learning framework
2015. 5 for traffic forecasting. arXiv preprint arXiv:1709.04875,
2017. 2
[39] Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann
LeCun, and Manohar Paluri. A closer look at spatiotemporal [53] Fisher Yu and Vladlen Koltun. Multi-scale context aggrega-
convolutions for action recognition. In Proceedings of the tion by dilated convolutions, 2015. 5
IEEE conference on Computer Vision and Pattern Recogni- [54] Pengfei Zhang, Cuiling Lan, Junliang Xing, Wenjun Zeng,
tion, pages 6450–6459, 2018. 2 Jianru Xue, and Nanning Zheng. View adaptive recurrent
neural networks for high performance human action recog-
[40] Petar Veličković, Guillem Cucurull, Arantxa Casanova,
nition from skeleton data. In Proceedings of the IEEE Inter-
Adriana Romero, Pietro Lio, and Yoshua Bengio. Graph at-
national Conference on Computer Vision, pages 2117–2126,
tention networks. In International Conference on Learning
2017. 1, 3
Representations (ICLR), 2018. 2

152

You might also like