0% found this document useful (0 votes)

42 views10 pages

Liu Disentangling and Unifying Graph Convolutions For Skeleton-Based Action Recognition CVPR 2020 Paper

This paper proposes two techniques - a disentangled multi-scale aggregation scheme and a unified spatial-temporal graph convolutional operator (G3D) - to improve skeleton-based action recognition from spatial-temporal graphs. The multi-scale aggregation scheme models relationships between nodes in different neighborhoods independently to capture long-range dependencies more effectively. G3D facilitates direct information flow across space and time for feature learning. Integrating these gives a powerful feature extractor (MS-G3D) that outperforms state-of-the-art methods on large action recognition datasets.

Uploaded by

huicheng chen

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

42 views10 pages

Liu Disentangling and Unifying Graph Convolutions For Skeleton-Based Action Recognition CVPR 2020 Paper

Uploaded by

huicheng chen

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

Disentangling and Unifying Graph Convolutions

for Skeleton-Based Action Recognition

Ziyu Liu1,3 , Hongwen Zhang2 , Zhenghao Chen1 , Zhiyong Wang1 , Wanli Ouyang1,3
1
The University of Sydney 2 University of Chinese Academy of Sciences & CASIA
3
The University of Sydney, SenseTime Computer Vision Research Group, Australia
{zliu6676@uni.,zhenghao.chen@,zhiyong.wang@,wanli.ouyang@}sydney.edu.au,[email protected]

Abstract
Spatial-temporal graphs have been widely used by Spatial
Information
skeleton-based action recognition algorithms to model hu- Flow

man action dynamics. To capture robust movement patterns

from these graphs, long-range and multi-scale context ag-
gregation and spatial-temporal dependency modeling are
critical aspects of a powerful feature extractor. However,
Temporal Spatial-Temporal Disentangled Multi-
existing methods have limitations in achieving (1) unbi- (a) Information Flow (b) Information Flow (c) Scale Aggregation

ased long-range joint relationship modeling under multi- Figure 1: (a) Factorized spatial and temporal modeling on skeleton
scale operators and (2) unobstructed cross-spacetime in- graph sequences causes indirect information flow. (b) In this work,
formation flow for capturing complex spatial-temporal de- we propose to capture cross-spacetime correlations with unified
pendencies. In this work, we present (1) a simple method to spatial-temporal graph convolutions. (c) Disentangling node fea-
disentangle multi-scale graph convolutions and (2) a uni- tures at separate spatial-temporal neighborhoods (yellow, blue, red
fied spatial-temporal graph convolutional operator named at different distances, partially colored for clarity) is pivotal for ef-
fective multi-scale learning in the spatial-temporal domain.
G3D. The proposed multi-scale aggregation scheme dis-
entangles the importance of nodes in different neighbor-
hoods for effective long-range modeling. The proposed free of environmental noises (e.g. background clutter, light-
G3D module leverages dense cross-spacetime edges as skip ing conditions, clothing), allowing action recognition algo-
connections for direct information propagation across the rithms to focus on the robust features of the action.
spatial-temporal graph. By coupling these proposals, we Earlier approaches to skeleton-based action recognition
develop a powerful feature extractor named MS-G3D based treat human joints as a set of independent features, and they
on which our model1 outperforms previous state-of-the-art model the spatial and temporal joint correlations through
methods on three large-scale datasets: NTU RGB+D 60, hand-crafted [42, 43] or learned [31, 6, 48, 54] aggrega-
NTU RGB+D 120, and Kinetics Skeleton 400. tions of these features. However, these methods overlook
the inherent relationships between the human joints, which
are best captured with human skeleton graphs with joints as
1. Introduction nodes and their natural connectivity (i.e. “bones”) as edges.
For this reason, recent approaches [50, 19, 34, 35, 32] model
Human action recognition is an important task with
the joint movement patterns of an action with a skeleton
many real-world applications. In particular, skeleton-based
spatial-temporal graph, which is a series of disjoint and
human action recognition involves predicting actions from
isomorphic skeleton graphs at different time steps carrying
skeleton representations of human bodies instead of raw
information in both spatial and temporal dimensions.
RGB videos, and the significant results seen in recent
work [50, 33, 32, 34, 21, 20, 54, 35] have proven its merits. For robust action recognition from skeleton graphs, an
In contrast to RGB representations, skeleton data contain ideal algorithm should look beyond the local joint con-
only the 2D [50, 15] or 3D [31, 25] positions of the human nectivity and extract multi-scale structural features and
key joints, providing highly abstract information that is also long-range dependencies, since joints that are structurally
apart can also have strong correlations. Many existing ap-
1 Code is available at github.com/kenziyuliu/ms-g3d proaches achieve this by performing graph convolutions

143
[17] with higher-order polynomials of the skeleton adja- G3D, forms a building block of our final model architecture
cency matrix: intuitively, a powered adjacency matrix cap- that outperforms state-of-the-art methods on three large-
tures the number of walks between every pair of nodes scale skeleton action datasets: NTU RGB+D 120 [25], NTU
with the length of the walks being the same as the power; RGB+D 60 [31], and Kinetics Skeleton 400 [15]. The main
the adjacency polynomial thus increases the receptive field contributions of this work are summarized as follows:
of graph convolutions by making distant neighbors reach-
(i) We propose a disentangled multi-scale aggregation
able. However, this formulation suffers from the biased
scheme that removes redundant dependencies between node
weighting problem, where the existence of cyclic walks on
features from different neighborhoods, which allows pow-
undirected graphs means that edge weights will be biased
erful multi-scale aggregators to effectively capture graph-
towards closer nodes against further nodes. On skeleton
wide joint relationships on human skeletons.
graphs, this means that a higher polynomial order is only
marginally effective at capturing information from distant (ii) We propose a unified spatial-temporal graph convo-
joints, since the aggregated features will be dominated by lution (G3D) operator which facilitates direct information
the joints from local body parts. This is a critical drawback flow across spacetime for effective feature learning.
limiting the scalability of existing multi-scale aggregators.
(iii) Integrating the disentangled aggregation scheme with
Another desirable characteristic of robust algorithms is G3D gives a powerful feature extractor (MS-G3D) with
the ability to leverage the complex cross-spacetime joint re- multi-scale receptive fields across both spatial and temporal
lationships for action recognition. However, to this end, dimensions. The direct multi-scale aggregation of features
most existing approaches [50, 33, 19, 32, 21, 34, 18] in spacetime further boosts model performance.
deploy interleaving spatial-only and temporal-only mod-
ules (Fig. 1(a)), analogous to factorized 3D convolutions 2. Related Work
[30, 39]. A typical approach is to first use graph convo-
lutions to extract spatial relationships at each time step, 2.1. Neural Nets on Graphs
and then use recurrent [19, 34, 18] or 1D convolutional
Architectures. To extract features from arbitrarily struc-
[50, 33, 21, 32] layers to model temporal dynamics. While
tured graphs, Graph Neural Networks (GNNs) have been
such factorization allows efficient long-range modeling, it
developed and explored extensively [5, 17, 3, 2, 10, 40, 49,
hinders the direct information flow across spacetime for
1, 7, 11, 22]. Recently proposed GNNs can broadly be
capturing complex regional spatial-temporal joint depen-
classified into spectral GNNs [3, 11, 22, 13, 17] and spa-
dencies. For example, the action “standing up” often has
tial GNNs [17, 49, 10, 51, 41, 1, 45]. Spectral GNNs con-
co-occurring movements of upper and lower body across
volve the input graph signals with a set of learned filters
both space and time, where upper body movements (leaning
in the graph Fourier domain. They are however limited
forward) strongly correlate to the lower body’s future move-
in terms of computational efficiency and generalizability
ments (standing up). These strong cues for making predic-
to new graphs due to the requirement of eigendecomposi-
tions may be ineffectively captured by factorized modeling.
tion and the assumption of fixed adjacency. Spatial GNNs,
In this work, we address the above limitations from in contrast, generally perform layer-wise update for each
two aspects. First, we propose a new multi-scale aggre- node by (1) selecting neighbors with a neighborhood func-
gation scheme that tackles the biased weighting problem tion (e.g. adjacent nodes); (2) merging the features from the
by removing redundant dependencies between further and selected neighbors and itself with an aggregation function
closer neighborhoods, thus disentangling their features un- (e.g. mean pooling); and (3) applying an activated trans-
der multi-scale aggregation (illustrated in Fig. 2). This leads formation to the merged features (e.g. MLP [49]). Among
to more powerful multi-scale operators that can model re- different GNN variants, the Graph Convolutional Network
lationships of joints irrespective of the distances between (GCN) [17] was first introduced as a first-order approxima-
them. Second, we propose G3D, a novel unified spatial- tion for localized spectral convolutions, but its simplicity as
temporal graph convolution module that directly models a mean neighborhood aggregator [49, 46] has quickly led
cross-spacetime joint dependencies. G3D does so by in- many subsequent spatial GNN architectures [49, 1, 45, 7]
troducing graph edges across the “3D” spatial-temporal and various applications involving graph structured data
domain as skip connections for unobstructed information [44, 47, 52, 50, 33, 34, 21] to treat it as a spatial GNN base-
flow (Fig. 1(b)), substantially facilitating spatial-temporal line. This work adapts the layer-wise update rule in GCN.
feature learning. Remarkably, our proposed disentangled
aggregation scheme augments G3D with multi-scale rea- Multi-Scale Graph Convolutions. Multi-scale spatial
soning in spacetime (Fig. 1(c)) without being affected by the GNNs have also been proposed to capture features from
biased weighting problem, despite extra edges were intro- non-local neighbors. [1, 19, 21, 45, 24] use higher order
duced. The resulting powerful feature extractor, named MS- polynomials of the graph adjacency matrix to aggregate fea-

144
tures from long-range neighbor nodes. Truncated Block ral context of three frames. (3) G3D can learn from multiple
Krylov network [29] similarly raises the adjacency matrix to temporal contexts simultaneously leveraging different win-
higher powers and obtains multi-scale information through dow sizes and dilations, which is not addressed in GR-GCN.
dense features concatenation from different hidden layers.
LanczosNet [24] deploys a low-rank approximation of the 3. MS-G3D
adjacency matrix to speed up the exponentiation on large
graphs. As mentioned in Section 1, we argue that adjacency
3.1. Preliminaries
powering can have adverse effects on long-range modeling Notations. A human skeleton graph is denoted as G =
due to weighting bias, and our proposed module aims to (V, E), where V = {v1 , ..., vN } is the set of N nodes repre-
address this with disentangled multi-scale aggregators. senting joints, and E is the edge set representing bones cap-
tured by an adjacency matrix A ∈ RN ×N where initially
2.2. Skeleton-Based Action Recognition Ai,j = 1 if an edge directs from vi to vj and 0 otherwise.
Earlier approaches [42, 6, 31, 36, 43, 48, 54] to skeleton- A is symmetric since G is undirected. Actions as graph se-
based action recognition focus on hand-crafting features quences have a node features set X = {xt,n ∈ RC | t, n ∈
and joint relationships for downstream classifiers, which Z, 1 ≤ t ≤ T, 1 ≤ n ≤ N } represented as a feature tensor
ignore the important semantic connectivity of the human X ∈ RT ×N ×C , where xt,n = Xt,n,: is the C dimensional
body. By constructing spatial-temporal graphs and mod- feature vector for node vn at time t over a total of T frames.
eling the spatial relationships with GNNs directly, recent The input action is thus adequately described by A struc-
approaches [50, 19, 8, 21, 8, 33, 32, 34, 18] have seen sig- turally and by X feature-wise, with Xt ∈ RN ×C being the
nificant performance boost, indicating the necessity of the node features at time t. Θ(l) ∈ RCl ×Cl+1 denotes a learn-
semantic human skeleton for action predictions. able weight matrix at layer l of a network.
An early application of graph convolutions is ST-GCN Graph Convolutional Nets (GCNs). On skeleton inputs
[50], where spatial graph convolutions along with inter- defined by features X and graph structure A, the layer-wise
leaving temporal convolutions are used for spatial-temporal update rule of GCNs can be applied to features at time t as:
modeling. A concurrent work by Li et al. [19] presents
(l+1) 1 1 (l)
a similar approach, but it notably introduces a multi-scale Xt = σ D̃− 2 ÃD̃− 2 Xt Θ(l) , (1)
module by raising skeleton adjacency to higher powers.
AS-GCN [21] also uses adjacency powering for multi-scale where Ã = A + I is the skeleton graph with added self-
modeling, but it additionally generates human poses to loops to keep identity features, D̃ is the diagonal degree
augment the spatial graph convolution. Spatial-Temporal matrix of Ã, and σ(·) is an activation function. The term
Graph Routing (STGR) network [18] adds extra edges to the 1 1 (l)
D̃− 2 ÃD̃− 2 Xt can be intuitively interpreted as an ap-
skeleton graph using frame-wise attention and global self-
proximate spatial mean feature aggregation from the direct
attention mechanisms. Similarly, 2s-AGCN [33] introduces
neighborhood followed by an activated linear layer.
graph adaptiveness with self-attention along with a freely
learned graph residual mask. It also uses a two-stream en- 3.2. Disentangled Multi-Scale Aggregation
semble with skeleton bone features to boost performance.
DGNN [32] likewise leverages bone features, but it instead Biased Weighting Problem. Under the spatial aggrega-
simultaneously updates the joint and bone features through tion framework in Eq. 1, existing approaches [21] employ
an alternating spatial aggregation scheme. Note that these higher-order polynomials of the adjacency matrix to aggre-
approaches primarily focus on spatial modeling; in contrast, gate multi-scale structural information at time t, as:
we present a unified approach for capturing complex joint K
!
X
correlations directly across spacetime. X
(l+1)
=σ b k (l) (l)
A X Θ , (2)
t t (k)
Another relevant work is GR-GCN [8], which merges ev- k=0
ery three frames over the skeleton graph sequence and adds
sparsified edges between adjacent frames. Whereas GR- where K controls the number of scales to aggregate. Here,
GCN also deploys cross-spacetime edges, our G3D mod- Ab is a normalized form of A, e.g. [19] uses the symmetric
ule has several important distinctions: (1) Cross-spacetime normalized graph Laplacian A b = Lnorm = I − D 21 AD 21 ;
edges in G3D follow the semantic human skeleton, which is [21] uses the random-walk normalized adjacency A b =
naturally a more interpretable and more robust representa- b
D A; more generally, one can use A = D̃ ÃD̃− 2
−1 − 12 1

tion than the sparsified, one-size-fits-all graph in GR-GCN. from GCNs. It is easy to see that Aki,j = Akj,i gives the
The underlying graph is also much easier to compute. (2) number of length k walks between vi and vj , and thus
GR-GCN has cross-spacetime edges only between adjacent the term Ab k X(l) is performing a weighted feature average
t
frames, which prevents it to reason beyond a limited tempo- based on the number of such walks. However, it is clear

145
0 0 0 the proposed disentangled formulation in Eq. 4 addresses
3 5 3 5 3 5

2
1

2
1
the biased weighting problem by removing redundant de-
4 4 4

6 !#
A 7
6 !&
A
7
!'
A 7
!#
A !&
A !'
A pendencies of distant neighborhoods’ weighting on closer
6
Closer Further
neighborhoods. Additional scales with larger k are there-
3
0

5 3
0

5
fore aggregated in an additive manner under a multi-scale
1 1 1

2 4 2 4
2 4
operator, making long-range modeling with large values of
! (#)
A ! (&)
A ! (')
A
6
! (#) 7
A 6
! (&) 7
A 6 ! (') 7
A k to remain effective. The resulting k-adjacency matrices
are also more sparse than their exponentiated counterparts
Figure 2: Illustration of the biased weighting problem and the
proposed disentangled aggregation scheme. Darker color indi-
(see Fig. 2), allowing more efficient representations.
cates higher weighting to the central node (red). Top left: closer 3.3. G3D: Unified Spatial-Temporal Modeling
nodes receive higher weighting from adjacency powering, which
makes long-range modeling less effective, especially when multi- Most existing work treats skeleton actions as a sequence
ple scales are aggregated. Bottom left: our proposed disentangled of disjoint graphs where features are extracted through
aggregation models joint relationships at each neighborhood while spatial-only (e.g. GCNs) and temporal-only (e.g. TCNs)
keeping identity features. Right: Visualizing the corresponding modules. We argue that such factorized formulation is less
adjacency matrices. Node self-loops are omitted for visual clarity. effective for capturing complex spatial-temporal joint rela-
tionships. Clearly, if a strong connection exists between a
that there are drastically more possible length k walks to pair of nodes, then during layer-wise propagation the pair
closer nodes than to the actual k-hop neighbors due to cyclic should incorporate a significant portion each other’s fea-
walks. This causes a bias towards the local region as well tures to reflect such a connection [50, 33, 34]. However, as
as nodes with higher degrees. The node self-loops in GCNs signals are propagated across spacetime through a series of
allow even more possible cycles (as walks can always cycle local aggregators (GCNs and TCNs alike), they are weak-
on self-loops) and thus amplify the bias. See Fig. 2 for illus- ened as redundant information is aggregated from an in-
tration. Under multi-scale aggregation on skeleton graphs, creasingly larger spatial-temporal receptive field. The prob-
the aggregated features will thus be dominated by signals lem is more evident if one observes that GCNs do not per-
from local body parts, making it ineffective to capture long- form a weighted aggregation to distinguish each neighbor.
range joint dependencies with higher polynomial orders. Cross-Spacetime Skip Connections. To tackle the above
Disentangling Neighborhoods. To address the above problem, we propose a more reasonable approach to allow
problem, we first define the k-adjacency matrix Ã(k) as cross-spacetime skip connections, which are readily mod-
 eled with cross-spacetime edges in a spatial-temporal graph.
 1 if d(vi , vj ) = k, Let us first consider a sliding temporal window of size τ
[Ã(k) ]i,j = 1 if i = j, (3) over the input graph sequence, which, at each step, ob-
 tains a spatial-temporal subgraph G(τ ) = (V(τ ) , E(τ ) ) where
0 otherwise,
V(τ ) = V1 ∪ ... ∪ Vτ is the union of all node sets across τ
where d(vi , vj ) gives the shortest distance in number of frames in the window. The initial edge set E(τ ) is defined
hops between vi and vj . Ã(k) is thus a generalization of by tiling Ã into a block adjacency matrix Ã(τ ) , where
Ã to further neighborhoods, with Ã(1) = Ã and Ã(0) = I.  
Under spatial aggregation in Eq. 1, the inclusion of self- Ã · · · Ã
 
loops in Ã(k) is critical for learning the relationships be- Ã(τ ) =  ... . . . ...  ∈ Rτ N ×τ N . (5)
tween the current joint and its k-hop neighbors, as well as
Ã ··· Ã
for keeping each joint’s identity information when no k-hop
neighbors are available. Given that N is small, Ã(k) can be Intuitively, each submatrix [Ã(τ ) ]i,j = Ã means every node
easily computed, e.g., using
differences
of graph
powers as in Vi is connected to itself and its 1-hop spatial neighbors
Ã(k) = I + ✶ Ã ≥ 1 − ✶ Ã
k k−1
≥ 1 . Substituting at frame j by extrapolating the frame-wise spatial connec-
b k tivity (which is [Ã(τ ) ]i,i for all i) to the temporal domain.
A with Ã(k) in Eq. 2, we arrive at:
Thus, each node within G(τ ) is densely connected to itself
K
! and its 1-hop spatial neighbors across all τ frames. We
(l+1)
X −1 − 1 (l) (l)
Xt =σ D̃(k)2 Ã(k) D̃(k)2 Xt Θ(k) , (4) can easily obtain X(τ ) ∈ RT ×τ N ×C using the same sliding
k=0 window over X with zero padding to construct T windows.
−1 −1
Using Eq. 1, we thus arrive at a unified spatial-temporal
where D̃(k)2 Ã(k) D̃(k)2 is the normalized [17] k-adjacency. graph convolutional operator for the tth temporal window:
Unlike the previous case where possible length k walks
(l+1) −1 −1 (l)
are predominantly conditioned on length k − 1 walks, [X(τ ) ]t = σ D̃(τ 2) Ã(τ ) D̃(τ 2) [X(τ ) ]t Θ(l) . (6)

146
Dilated Windows. Another significant aspect of the lations as well as long-range spatial and temporal dependen-
above window construction is that the frames need not to cies: (1) The G3D pathway first constructs spatial-temporal
be adjacent. A dilated window with τ frames and a dilation windows, performs disentangled multi-scale graph convo-
rate d can be constructed by picking a frame every d frames, lutions on them, and then collapses them with a fully con-
and reusing the same spatial-temporal structure Ã(τ ) . Sim- nected layer for window feature readout. The extra dotted
ilarly, we can obtain node features X(τ,d) ∈ RT ×τ N ×C G3D pathway (Fig. 3(b)) indicates the model can learn from
(d = 1 if omitted) and perform layer-wise update as in multiple spatial-temporal contexts concurrently with differ-
Eq. 6. Dilated windows allow larger temporal receptive ent τ and d; (2) The factorized pathway augments the G3D
fields without growing the size of Ã(τ ) , analogous to how pathway with long-range, spatial-only, and temporal-only
dilated convolutions [53] keep constant complexities. modules: the first layer is a multi-scale graph convolutional
layer capable of modeling the entire skeleton graph with the
Multi-Scale G3D. We can also integrate the proposed dis- maximum K; it is then followed by two multi-scale tempo-
entangled multi-scale aggregation scheme (Eq. 4) into G3D ral convolutions layers to capture extended temporal con-
for multi-scale reasoning directly in the spatial-temporal do- texts (discussed below). The outputs from all pathways are
main. We thus derive the MS-G3D module from Eq. 6 as: aggregated as the STGC block output, which has 96, 192,
K
! and 384 feature channels respectively within a typical r=3
(l+1)
X −1 −1 (l) (l)
[X(τ ) ]t =σ 2
D̃(τ,k) 2
Ã(τ,k) D̃(τ,k) [X(τ ) ]t Θ(k) , (7) block architecture. Batch normalization [14] and ReLU is
k=0 added at the end of each layer except for the last layer. All
STGC blocks, except the first, downsample the temporal di-
where Ã(τ,k) and D̃(τ,k) are defined similarly as Ã(k) and
mension with stride 2 temporal conv and sliding windows.
D̃(k) respectively. Remarkably, our proposed disentangled
aggregation scheme complements this unified operator, as Multi-Scale Temporal Modeling. The spatial-temporal
G3D’s increased node degrees from spatial-temporal con- windows G(τ ) used by G3D are a closed structure by them-
nectivity can contribute to the biased weighting problem. selves, which means G3D must be accompanied by tempo-
ral modules for cross-window information exchange. Many
Discussion. We give more in-depth analyses on G3D as existing work [50, 18, 33, 32, 21] performs temporal mod-
follows. (1) It is analogous to classical 3D convolutional eling using temporal convolutions with a fixed kernel size
blocks [38], with its spatial-temporal receptive field defined kt × 1 throughout the architecture. As a natural extension
by τ , d, and Ã. (2) Unlike 3D convolutions, G3D’s param- to our multi-scale spatial aggregation, we enhance vanilla
(·)
eter count from Θ(·) is independent of τ or |E(τ ) |, making temporal convolutional layers with multi-scale learning, as
it generally less prone to overfitting with large τ . (3) The illustrated in Fig. 3(c). To lower the computational costs
dense cross-spacetime connections in G3D entail a trade- due to the extra branches, we deploy a bottleneck design
off on τ , as larger values of τ bring larger temporal recep- [37], fix kernel sizes at 3×1, and use different dilation rates
tive fields at the cost of more generic features due to larger [53] instead of larger kernels for larger receptive fields. We
immediate neighborhoods. Additionally, larger τ implies also use residual connections [12] to facilitate training.
a quadratically larger Ã(τ ) and thus more operations with
multi-scale aggregation. On the other hand, larger dilations Adaptive Graphs. To improve the flexibility of graph
d bring larger temporal coverage at the cost of temporal res- convolutional layers which performs homogeneous neigh-
olution (lower frame rates). τ and d thus must be balanced borhood averaging, we add a simple learnable, uncon-
carefully. (4) G3D modules are designed to capture com- strained graph residual mask Ares inspired by [33, 32] to
plex regional spatial-temporal instead of long-range depen- every Ã(k) and Ã(τ,k) to strengthen, weaken, add, or re-
dencies that are otherwise more economically captured by move edges dynamically. For example, Eq. 4 is updated to
factorized modules. We thus observe the best performance K
!
(l+1)
X −1 res −1 (l) (l)
when G3D modules are augmented with long-range, factor- Xt =σ D̃(k)2 (Ã(k) + A(k) )D̃(k)2 Xt Θ(k) . (8)
ized modules, which we discuss in the next section. k=0

Ares is initialized with random values around zero and is

3.4. Model Architecture
different for each k and τ , allowing each multi-scale context
Overall Architecture. The final model architecture is il- (either spatial or spatial-temporal) to select the best suited
lustrated in Fig. 3. On a high level, it contains a stack mask. Note also that since Ares is optimized for all possi-
of r spatial-temporal graph convolutional (STGC) blocks ble actions, which may have different optimal edge sets for
to extract features from skeleton sequences, followed by a feature propagation, it is expected to give minor edge cor-
global average pooling layer and a softmax classifier. Each rections and may be insufficient when the graph structures
STGC block deploys two types of pathways to simultane- have major deficiencies. In particular, Ares only partially
ously capture complex regional spatial-temporal joint corre- mitigates the biased weighting problem (see Section 4.3).

147
Inputs MS-TCN Inputs
STGC Block 𝑙
(Section 3.4) 𝐗: 𝑇 (;) ×𝑁×𝐶 (;) Conv 1×1 Conv 1×1 Conv 1×1 Conv 1×1 Conv 1×1
... Conv 1×1
Conv 1×1
Conv 3×1 Conv 3×1 Conv 3×1 Conv 3×1 MaxPool stride = 2
Conv 1×1 Conv 1×1
dilation = 1 dilation = 2 dilation = 3 dilation = 4 3×1
𝜏 = 𝜏? 𝜏 = 𝜏B MS-GCN Concat
𝑑 = 𝑑? 𝑑 = 𝑑B
𝐗: 𝑇 (6) ×𝑁×𝐶 (6) Residual
MS-G3D MS-G3D Add + Path
STGC MS-TCN
(stride = 2) 𝐀: 𝑁×𝑁 (c) MS-TCN
MS-G3D
𝐗: 𝑇)* ×𝑁×𝐶)*
r ... Sliding Temporal Window
MS-TCN MS-GCN-D Inputs
size = 𝜏, dilation = 𝑑
(stride = 2)
STGC Factorized 𝐀(/) : 𝜏𝑁×𝜏𝑁
G3D Pathway 𝐗 (/) : 𝑇123 ×𝜏𝑁×𝐶)*
Global Average Pooling Pathway(s) GraphConv GraphConv GraphConv
MS-GCN 𝐀(6) + 𝐀89: 𝐀(?) + 𝐀89:
... 𝐀(<) + 𝐀89:
6 ? <
𝐗: 1×1×𝐶 (@) 𝐀: Skeleton Graph Adjacency Add + 𝐗 (/) : 𝑇123 ×𝜏𝑁×𝐶4)5
𝐗: Node Features 𝐗: 𝑇 (;>?) ×𝑁×𝐶 (;>?)
FC + Softmax 𝑇: Number of Frames Collapse Window
Reshape + FC 𝐀 (<) : 𝑘-adjacency
𝑁: Number of Nodes MS-TCN + Add
“Hand Waving” 𝐶: Number of Channels 𝐗: 𝑇123 ×𝑁×𝐶123 𝐀89:
< : scale-specific mask

(a) Full Architecture (b) STGC Block (d) MS-G3D (e) MS-GCN-Disentangled

Figure 3: (Match components with colors) Architecture Overview. “TCN”, “GCN”, prefix “MS-”, and suffix “-D” denotes temporal
and graph convolutional blocks, and multi-scale and disentangled aggregation, respectively (Section 3.2). Each of the r STGC blocks
(b) deploys a multi-pathway design to capture long-range and regional spatial-temporal dependencies simultaneously. Dotted modules,
including extra G3D pathway, 1×1 conv, and strided temporal convolutions, are situational for model performance/complexity trade-off.

Joint-Bone Two-Stream Fusion. Inspired by the two- jects and 32 different camera setups. The authors now rec-
stream methods in [33, 32, 34] and the intuition that visu- ommend replacing the Cross-View setting with a Cross-
alizing bones along with joints can help humans recognize Setup (X-Set) setting, where 54,468 samples collected from
skeleton actions, we use a two-stream framework where a half of the camera setups are used for training and the rest
separate model with identical architecture is trained using 59,477 samples for testing. In Cross-Subject, 63,026 sam-
the bone features initialized as vector differences of adja- ples from a selected group of 53 subjects are used for train-
cent joints directed away from the body center. The softmax ing, and the rest 50,919 samples for testing.
scores from the joint/bone models are summed to obtain fi-
Kinetics Skeleton 400. The Kinetics Skeleton 400 dataset
nal prediction scores. Since skeleton graphs are trees, we
is adapted from the Kinetics 400 video dataset [15] us-
add a zero bone vector at the body center to obtain N bones
ing the OpenPose [4] pose estimation toolbox. It contains
from N joints and reuse A for connectivity definition.
240,436 training and 19,796 testing skeleton sequences over
400 classes, where each skeleton graph contains 18 body
4. Experiments joints, along with their 2D spatial coordinates and the pre-
diction confidence score from OpenPose as the initial joint
4.1. Datasets
features [50]. At each time step, the number of skele-
NTU RGB+D 60 and NTU RGB+D 120. NTU RGB+D tons is capped at 2, and skeletons with lower overall confi-
60 [31] is a large-scale action recognition dataset containing dence scores are discarded. Following the convention from
56,578 skeleton sequences over 60 action classes captured [15, 50], Top-1 and Top-5 accuracies are reported.
from 40 distinct subjects and 3 different camera view an-
4.2. Implementation Details
gles. Each skeleton graph contains N = 25 body joints as
nodes, with their 3D locations in space as initial features. Unless otherwise stated, all models have r = 3 and are
Each frame of the action contains 1 to 2 subjects. The au- trained with SGD with momentum 0.9, batch size 32 (16
thors recommend reporting the classification accuracy un- per worker), an initial learning rate 0.05 (can linearly scale
der two settings: (1) Cross-Subject (X-Sub), where the 40 up with batch size [9]) for 50, 60, and 65 epochs with step
subjects are split into training and testing groups, yielding LR decay with a factor of 0.1 at epochs {30, 40}, {30, 50},
40,091 and 16,487 training and testing examples respec- and {45, 55} for NTU RGB+D 60, 120, and Kinetics Skele-
tively. (2) Cross-View (X-View), where all 18,932 samples ton 400, respectively. Weight decay is set to 0.0005 for final
collected from camera 1 are used for testing and the rest models and is adjusted accordingly during component stud-
37,646 samples used for training. NTU RGB+D 120 [25] ies. All skeleton sequences are padded to T = 300 frames
extends NTU RGB+D 60 with an additional 57,367 skele- by replaying the actions. Inputs are preprocessed with nor-
ton sequences over 60 extra action classes, totalling 113,945 malization and translation following [33, 32]. No data aug-
samples over 120 classes captured from 106 distinct sub- mentation is used for fair performance comparison.

148
4.3. Component Studies Number of Scales
Methods
K=1 K=4 K=8 K = 12
We analyze the individual components and their configu-
GCN-E 85.1 85.6 86.5 86.6
rations in the final architecture. Unless stated, performance
GCN-D 85.1 87.0 86.9 86.8
is reported as classification accuracy on the Cross-Subject
GCN-E + Mask 86.1 87.0 87.5 87.7
setting of NTU RGB+D 60 using only the joint data.
GCN-D + Mask 86.1 86.9 87.9 87.8
Disentangled Multi-Scale Aggregation. We first justify G3D-E 85.1 85.5 85.4 85.5
our proposed disentangled multi-scale aggregation scheme G3D-D 85.1 86.4 86.5 86.4
by verifying its effectiveness with different number of G3D-E + Mask 86.6 87.0 86.5 86.2
scales over sparse and dense graphs. In Table 1, we G3D-D + Mask 86.6 87.4 87.1 87.0
do so using the individual pathways of the STGC blocks Table 1: Accuracy (%) with multi-scale aggregation on individual
(Fig. 3(b)), referred to as “GCN” and “G3D”, respectively, pathways of STGC blocks with different K. “Mask” refers to the
with suffixes “-E” and “-D” denoting adjacency powering residual masks Ares . If K>1, GCN/G3D is Multi-Scale (MS-).
and disentangled aggregation. Here, the maximum K = 12
is the diamater of skeleton graphs from NTU RGB+D 60, Model Configurations Params Acc (%)
and we set τ = 5 for G3D modules. To keep consistent nor- Baseline (Js-AGCN [33]) 3.5M 86.0
malization, we set A b = D̃− 12 ÃD̃− 12 in Eq. 2 for GCN-E
Baseline + MS-TCN 1.6M 86.7
and G3D-E. We first observe that the disentangled formula- MS-GCN (Factorized Pathway) Only 1.4M 87.8
tion can bring as much as 1.4% gain over simple adjacency with 2.5× Capacity 3.5M 88.5
powering at K = 4, underpinning the necessity for neigh- with Dual Pathway 2.8M 88.6
borhood disentanglement. In this case, the residual mask
MS-GCN (Factorized Pathway)
Ares partially corrects the weighting imbalance, narrowing
with MS-G3D (τ = 3, d = 1) 2.7M 89.0
the largest gap to 0.4%. However, the same set of experi-
with MS-G3D (τ = 3, d = 2) 2.7M 89.1
ments on the G3D pathway, where the window graph G(τ )
with MS-G3D (τ = 3, d = 3) 2.7M 89.1
is denser than the spatial graph G, shows wider accuracy
with MS-G3D (τ = 5, d = 1) 3.2M 89.2
gaps between G3D-E and G3D-D, indicating a more severe
with MS-G3D (τ = 5, d = 2) 3.2M 89.2
biased weighting problem. In particular, we see 0.8% per-
with MS-G3D (τ = 7, d = 1)† 3.0M 89.0
formance gap at K = 12 even if residual masks are added.
with 2 MS-G3D Pathways†
These results verify the effectiveness of the proposed dis- 2.8M 89.3
τ = (3, 3), d = (1, 2)
entangled aggregation scheme for multi-scale learning; it
with 2 MS-G3D Pathways†
boosts performance across different number scales not only 3.2M 89.4
τ = (3, 5), d = (1, 1)
in the spatial domain, but more so in the spatial-temporal
domain where it complements the proposed G3D module. Table 2: Model accuracy with various settings. MS-GCN and MS-
In general, the spatial GCNs benefits more from large K G3D uses K ∈ {12, 5} respectively. † Output channels double at
than do the spatial-temporal G3D modules; for final archi- the collapse window layer (Fig. 3(d), Cmid to Cout ) instead of at the
tectures, we empirically set K ∈ {12, 5} for MS-GCN and graph convolution (Cin to Cmid ) to maintain similar budget.
MS-G3D blocks respectively.
G3D Graph Connectivity Params Acc (%)
Effectiveness of G3D. To validate the efficacy of G3D (1) Grid-like 2.7M 88.7
modules to capture complex spatial-temporal features, we (2) Grid-like + dense self-edges 2.7M 88.6
build up the model incrementally with its individual com- (Eq. 5) Cross-spacetime edges 2.7M 89.1
ponents, and show its performance in Table 2. We use the Table 3: Comparing graph connectivity settings (τ = 3, d = 2).
joint stream from 2s-AGCN [33] as the baseline for con-
trolled experiments, and for fair comparison, we replaced its
regular temporal convolutional layers with MS-TCN layers the G3D pathway is added, we observe consistently bet-
and obtained an improvement with less parameters. First, ter results with similar or less parameters, verifying G3D’s
we observe that the factorized pathway alone can outper- ability to pick up complex regional spatial-temporal corre-
form the baseline due to the powerful disentangled aggre- lations that are previously overlooked by modeling spatial
gation in MS-GCN. However, if we simply scale up the and temporal dependencies in a factorized fashion.
factorized pathway to larger capacity (deeper and wider),
or duplicate the factorized pathway to learn from different Exploring G3D Configurations. Table 2 also compares
feature subspaces and mimic the multi-pathway design in various G3D settings, including different values of τ , d,
STGC blocks, we observe limited gains. In contrast, when and the number of G3D pathways in STGC blocks. We

149
NTU RGB+D 120 pected, we obtain the best results when the temporal reso-
Methods
X-Sub (%) X-Set (%) lution is unaltered by setting τ = (3, 5).
ST-LSTM [26] 55.7 57.9
Cross-spacetime Connectivity. To demonstrate the need
GCA-LSTM [27] 61.2 63.3
for cross-spacetime edges in G(τ ) defined in Eq. 5 instead of
RotClips + MTCNN [16] 62.2 61.8
simple, grid-like temporal self-edges (on which G3D also
Body Pose Evolution Map [28] 64.6 66.9
applies), we contrast different connectivity schemes in Ta-
2s-AGCN [33] 82.9 84.9 ble 3 while fixing other parts of the architecture. The first
MS-G3D Net 86.9 88.4 two settings refer to modifying the block adjacency matrix
Table 4: Classification accuracy comparison against state-of-the- Ã(τ ) such that: (1) the blocks Ã on the main diagonal are
art methods on the NTU RGB+D 120 Skeleton dataset. kept, the blocks on superdiagonal/subdiagonal is set to I,
and the rest set to 0; and (2) all blocks but the main diago-
NTU RGB+D 60 nal of Ã are set to I. Intuitively, the first produces “3D grid”
Methods graphs and the second includes extra dense self-edges over
X-Sub (%) X-View (%)
τ frames. Clearly, while all settings allow unified spatial-
IndRNN [23] 81.8 88.0
temporal graph convolutions, cross-spacetime edges as skip
HCN [20] 86.5 91.1
connections are essential for efficient information flow.
ST-GR [18] 86.9 92.3
AS-GCN [21] 86.8 94.2 Joint-Bone Two-Stream Fusion. We verify our method
2s-AGCN [33] 88.5 95.1 under the joint-bone fusion framework on the NTU RGB+D
AGC-LSTM [34] 89.2 95.0 60 dataset in Table 5. Similar to [33], we obtain best per-
DGNN [32] 89.9 96.1 formance when joint and bone features are fused, indicating
GR-GCN [8] 87.5 94.3 the generalizablity of our method to other input modalities.
MS-G3D Net (Joint Only) 89.4 95.0
MS-G3D Net (Bone Only) 90.1 95.3
4.4. Comparison against the State-of-the-Art
MS-G3D Net 91.5 96.2 We compare our full model (Fig. 3(a)) to the state-of-the-
Table 5: Classification accuracy comparison against state-of-the- art in Tables 4, 5, and 6. Table 4 compares non-graph [26,
art methods on the NTU RGB+D 60 Skeleton dataset. 27, 16, 28] and graph-based methods [33]. Table 5 com-
pares non-graph methods [23, 20], graph-based methods
with spatial edges [18, 21, 33, 34, 32] and with spatial-
Kinetics Skeleton 400
Methods temporal edges [8]. Table 6 compares single-stream [50, 21]
Top-1 (%) Top-5 (%)
and multi-stream [18, 33, 32] methods. On all three large-
ST-GCN [50] 30.7 52.8 scale datasets, our method outperforms all existing methods
AS-GCN [21] 34.8 56.5 under all evaluation settings. Notably, our method is the first
ST-GR [18] 33.6 56.1 to apply a multi-pathway design to learn both long-range
2s-AGCN [33] 36.1 58.7 spatial and temporal dependencies and complex regional
DGNN [32] 36.9 59.6 spatial-temporal correlations from skeleton sequences, and
MS-G3D Net 38.0 60.9 the results verify the effectiveness of our approach.
Table 6: Classification accuracy comparison against state-of-the-
art methods on the Kinetics Skeleton 400 dataset. 5. Conclusion
In this work, we present two methods for improving
first observe that all configurations consistently outperform skeleton-based action recognition: a disentangled multi-
the baseline, confirming the stability of MS-G3D as a ro- scale aggregation scheme for graph convolutions that re-
bust feature extractor. We also see that τ = 5 give slightly moves redundant dependencies between different neighbor-
better results, but the gain diminishes at τ = 7 as the ag- hoods, and G3D, a unified spatial-temporal graph convolu-
gregated features become too generic due to the oversized tional operator that directly models spatial-temporal depen-
local spatial-temporal neighborhood, thus counteracting the dencies from skeleton graph sequences. By coupling these
benefits of larger temporal coverage. The dilation rate d has methods, we derive MS-G3D, a powerful feature extrac-
varying effects: (1) when τ = 3, d = 1 underperforms tor that captures multi-scale spatial-temporal features pre-
d ∈ {2, 3}, justifying the need for larger temporal contexts; viously overlooked by factorized modeling. With experi-
(2) larger d has marginal benefits, as its larger temporal cov- ments on three large-scale datasets, we show that our model
erage come at a cost of temporal resolution (thus coarsened outperforms existing methods by a sizable margin.
skeleton motions). We thus observe better results when two Acknowledgements: This work was supported by the Australian Research
G3D pathways with d = (1, 2) are combined, and as ex- Council Grant DP200103223. ZL thanks Weiqing Cao for designing figures.

150
References [14] Sergey Ioffe and Christian Szegedy. Batch normalization:
Accelerating deep network training by reducing internal co-
[1] Sami Abu-El-Haija, Bryan Perozzi, Amol Kapoor, Nazanin variate shift. arXiv preprint arXiv:1502.03167, 2015. 5
Alipourfard, Kristina Lerman, Hrayr Harutyunyan, Greg Ver
[15] Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang,
Steeg, and Aram Galstyan. MixHop: Higher-order graph
Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola,
convolutional architectures via sparsified neighborhood mix-
Tim Green, Trevor Back, Paul Natsev, Mustafa Suleyman,
ing. In Kamalika Chaudhuri and Ruslan Salakhutdinov, ed-
and Andrew Zisserman. The kinetics human action video
itors, Proceedings of the 36th International Conference on
dataset, 2017. 1, 2, 6
Machine Learning, volume 97 of Proceedings of Machine
Learning Research, pages 21–29, Long Beach, California, [16] Qiuhong Ke, Mohammed Bennamoun, Senjian An, Ferdous
USA, 09–15 Jun 2019. PMLR. 2 Sohel, and Farid Boussaid. Learning clip representations for
skeleton-based 3d action recognition. IEEE Transactions on
[2] James Atwood and Don Towsley. Diffusion-convolutional
Image Processing, 27(6):2842–2855, 2018. 8
neural networks. In Advances in Neural Information Pro-
cessing Systems, pages 1993–2001, 2016. 2 [17] Thomas N Kipf and Max Welling. Semi-supervised classi-
fication with graph convolutional networks. arXiv preprint
[3] Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann Le-
arXiv:1609.02907, 2016. 2, 4
Cun. Spectral networks and locally connected networks on
graphs. arXiv preprint arXiv:1312.6203, 2013. 2 [18] Bin Li, Xi Li, Zhongfei Zhang, and Fei Wu. Spatio-
[4] Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. temporal graph routing for skeleton-based action recogni-
Realtime multi-person 2d pose estimation using part affin- tion. In Thirty-Third AAAI Conference on Artificial Intel-
ity fields. 2017 IEEE Conference on Computer Vision and ligence, 2019. 2, 3, 5, 8
Pattern Recognition (CVPR), Jul 2017. 6 [19] Chaolong Li, Zhen Cui, Wenming Zheng, Chunyan Xu, and
[5] Michaël Defferrard, Xavier Bresson, and Pierre Van- Jian Yang. Spatio-temporal graph convolution for skeleton
dergheynst. Convolutional neural networks on graphs with based action recognition. In Thirty-Second AAAI Conference
fast localized spectral filtering. In Advances in neural infor- on Artificial Intelligence, 2018. 1, 2, 3
mation processing systems, pages 3844–3852, 2016. 2 [20] Chao Li, Qiaoyong Zhong, Di Xie, and Shiliang Pu. Co-
[6] Yong Du, Wei Wang, and Liang Wang. Hierarchical recur- occurrence feature learning from skeleton data for action
rent neural network for skeleton based action recognition. In recognition and detection with hierarchical aggregation. Pro-
Proceedings of the IEEE conference on computer vision and ceedings of the Twenty-Seventh International Joint Confer-
pattern recognition, pages 1110–1118, 2015. 1, 3 ence on Artificial Intelligence, Jul 2018. 1, 8
[7] Hongyang Gao and Shuiwang Ji. Graph u-nets. In Proceed- [21] Maosen Li, Siheng Chen, Xu Chen, Ya Zhang, Yanfeng
ings of the 36th International Conference on Machine Learn- Wang, and Qi Tian. Actional-structural graph convolutional
ing, ICML 2019, 9-15 June 2019, Long Beach, California, networks for skeleton-based action recognition. In Proceed-
USA, pages 2083–2092, 2019. 2 ings of the IEEE Conference on Computer Vision and Pattern
[8] Xiang Gao, Wei Hu, Jiaxiang Tang, Jiaying Liu, and Zong- Recognition, pages 3595–3603, 2019. 1, 2, 3, 5, 8
ming Guo. Optimized skeleton-based action recognition via [22] Ruoyu Li, Sheng Wang, Feiyun Zhu, and Junzhou Huang.
sparsified graph regression. In Proceedings of the 27th ACM Adaptive graph convolutional neural networks. In Thirty-
International Conference on Multimedia, MM ’19, pages Second AAAI Conference on Artificial Intelligence, 2018. 2
601–610, New York, NY, USA, 2019. ACM. 3, 8 [23] Shuai Li, Wanqing Li, Chris Cook, Ce Zhu, and Yanbo Gao.
[9] Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noord- Independently recurrent neural network (indrnn): Building
huis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, a longer and deeper rnn. 2018 IEEE/CVF Conference on
Yangqing Jia, and Kaiming He. Accurate, large mini- Computer Vision and Pattern Recognition, Jun 2018. 8
batch sgd: Training imagenet in 1 hour. arXiv preprint [24] Renjie Liao, Zhizhen Zhao, Raquel Urtasun, and Richard S
arXiv:1706.02677, 2017. 6 Zemel. Lanczosnet: Multi-scale deep graph convolutional
[10] Will Hamilton, Zhitao Ying, and Jure Leskovec. Induc- networks. arXiv preprint arXiv:1901.01484, 2019. 2, 3
tive representation learning on large graphs. In Advances in [25] Jun Liu, Amir Shahroudy, Mauricio Lisboa Perez, Gang
Neural Information Processing Systems, pages 1024–1034, Wang, Ling-Yu Duan, and Alex Kot Chichung. Ntu rgb+d
2017. 2 120: A large-scale benchmark for 3d human activity under-
[11] David K Hammond, Pierre Vandergheynst, and Rémi Gri- standing. IEEE Transactions on Pattern Analysis and Ma-
bonval. Wavelets on graphs via spectral graph theory. Ap- chine Intelligence, page 1–1, 2019. 1, 2, 6
plied and Computational Harmonic Analysis, 30(2):129– [26] Jun Liu, Amir Shahroudy, Dong Xu, and Gang Wang.
150, 2011. 2 Spatio-temporal lstm with trust gates for 3d human action
[12] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. recognition. In European Conference on Computer Vision,
Deep residual learning for image recognition. In Proceed- pages 816–833. Springer, 2016. 8
ings of the IEEE conference on computer vision and pattern [27] Jun Liu, Gang Wang, Ling-Yu Duan, Kamila Abdiyeva, and
recognition, pages 770–778, 2016. 5 Alex C Kot. Skeleton-based human action recognition with
[13] Mikael Henaff, Joan Bruna, and Yann LeCun. Deep convo- global context-aware attention lstm networks. IEEE Trans-
lutional networks on graph-structured data, 2015. 2 actions on Image Processing, 27(4):1586–1599, 2017. 8

151
[28] Mengyuan Liu and Junsong Yuan. Recognizing human ac- [41] Petar Veličković, William Fedus, William L Hamilton, Pietro
tions as the evolution of pose estimation maps. In Proceed- Liò, Yoshua Bengio, and R Devon Hjelm. Deep graph info-
ings of the IEEE Conference on Computer Vision and Pattern max. arXiv preprint arXiv:1809.10341, 2018. 2
Recognition, pages 1159–1168, 2018. 8 [42] Raviteja Vemulapalli, Felipe Arrate, and Rama Chellappa.
[29] Sitao Luan, Mingde Zhao, Xiao-Wen Chang, and Doina Pre- Human action recognition by representing 3d skeletons as
cup. Break the ceiling: Stronger multi-scale deep graph con- points in a lie group. In Proceedings of the IEEE conference
volutional networks. arXiv, 1906.02174, 2019. 3 on computer vision and pattern recognition, pages 588–595,
[30] Zhaofan Qiu, Ting Yao, and Tao Mei. Learning spatio- 2014. 1, 3
temporal representation with pseudo-3d residual networks. [43] Jiang Wang, Zicheng Liu, Ying Wu, and Junsong Yuan.
In proceedings of the IEEE International Conference on Mining actionlet ensemble for action recognition with depth
Computer Vision, pages 5533–5541, 2017. 2 cameras. In 2012 IEEE Conference on Computer Vision and
Pattern Recognition, pages 1290–1297. IEEE, 2012. 1, 3
[31] Amir Shahroudy, Jun Liu, Tian-Tsong Ng, and Gang Wang.
Ntu rgb+d: A large scale dataset for 3d human activity anal- [44] Xiaolong Wang and Abhinav Gupta. Videos as space-time
ysis. In IEEE Conference on Computer Vision and Pattern region graphs. In Proceedings of the European Conference
Recognition, June 2016. 1, 2, 3, 6 on Computer Vision (ECCV), pages 399–417, 2018. 2
[45] Felix Wu, Amauri Souza, Tianyi Zhang, Christopher Fifty,
[32] Lei Shi, Yifan Zhang, Jian Cheng, and Hanqing Lu.
Tao Yu, and Kilian Weinberger. Simplifying graph con-
Skeleton-based action recognition with directed graph neural
volutional networks. In Kamalika Chaudhuri and Ruslan
networks. In Proceedings of the IEEE Conference on Com-
Salakhutdinov, editors, Proceedings of the 36th International
puter Vision and Pattern Recognition, pages 7912–7921,
Conference on Machine Learning, volume 97 of Proceed-
2019. 1, 2, 3, 5, 6, 8
ings of Machine Learning Research, pages 6861–6871, Long
[33] Lei Shi, Yifan Zhang, Jian Cheng, and Hanqing Lu. Two- Beach, California, USA, 09–15 Jun 2019. PMLR. 2
stream adaptive graph convolutional networks for skeleton- [46] Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long,
based action recognition. In Proceedings of the IEEE Con- Chengqi Zhang, and Philip S. Yu. A comprehensive survey
ference on Computer Vision and Pattern Recognition, pages on graph neural networks. CoRR, abs/1901.00596, 2019. 2
12026–12035, 2019. 1, 2, 3, 4, 5, 6, 7, 8
[47] Zonghan Wu, Shirui Pan, Guodong Long, Jing Jiang, and
[34] Chenyang Si, Wentao Chen, Wei Wang, Liang Wang, and Chengqi Zhang. Graph wavenet for deep spatial-temporal
Tieniu Tan. An attention enhanced graph convolutional lstm graph modeling. arXiv preprint arXiv:1906.00121, 2019. 2
network for skeleton-based action recognition. In Proceed- [48] Chunyu Xie, Ce Li, Baochang Zhang, Chen Chen, Jun-
ings of the IEEE Conference on Computer Vision and Pattern gong Han, and Jianzhuang Liu. Memory attention networks
Recognition, pages 1227–1236, 2019. 1, 2, 3, 4, 6, 8 for skeleton-based action recognition. Proceedings of the
[35] Chenyang Si, Ya Jing, Wei Wang, Liang Wang, and Tieniu Twenty-Seventh International Joint Conference on Artificial
Tan. Skeleton-based action recognition with spatial reason- Intelligence, Jul 2018. 1, 3
ing and temporal stack learning. In Proceedings of the Eu- [49] Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka.
ropean Conference on Computer Vision (ECCV), pages 103– How powerful are graph neural networks? In International
118, 2018. 1 Conference on Learning Representations (ICLR), 2019. 2
[36] Sijie Song, Cuiling Lan, Junliang Xing, Wenjun Zeng, and [50] Sijie Yan, Yuanjun Xiong, and Dahua Lin. Spatial tempo-
Jiaying Liu. An end-to-end spatio-temporal attention model ral graph convolutional networks for skeleton-based action
for human action recognition from skeleton data. In Thirty- recognition. In Thirty-Second AAAI Conference on Artificial
first AAAI conference on artificial intelligence, 2017. 3 Intelligence, 2018. 1, 2, 3, 4, 5, 6, 8
[37] Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and [51] Zhitao Ying, Jiaxuan You, Christopher Morris, Xiang Ren,
Alex Alemi. Inception-v4, inception-resnet and the impact Will Hamilton, and Jure Leskovec. Hierarchical graph rep-
of residual connections on learning, 2016. 5 resentation learning with differentiable pooling. In Advances
[38] Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, in Neural Information Processing Systems, pages 4800–
and Manohar Paluri. Learning spatiotemporal features with 4810, 2018. 2
3d convolutional networks. In Proceedings of the IEEE inter- [52] Bing Yu, Haoteng Yin, and Zhanxing Zhu. Spatio-temporal
national conference on computer vision, pages 4489–4497, graph convolutional networks: A deep learning framework
2015. 5 for traffic forecasting. arXiv preprint arXiv:1709.04875,
2017. 2
[39] Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann
LeCun, and Manohar Paluri. A closer look at spatiotemporal [53] Fisher Yu and Vladlen Koltun. Multi-scale context aggrega-
convolutions for action recognition. In Proceedings of the tion by dilated convolutions, 2015. 5
IEEE conference on Computer Vision and Pattern Recogni- [54] Pengfei Zhang, Cuiling Lan, Junliang Xing, Wenjun Zeng,
tion, pages 6450–6459, 2018. 2 Jianru Xue, and Nanning Zheng. View adaptive recurrent
neural networks for high performance human action recog-
[40] Petar Veličković, Guillem Cucurull, Arantxa Casanova,
nition from skeleton data. In Proceedings of the IEEE Inter-
Adriana Romero, Pietro Lio, and Yoshua Bengio. Graph at-
national Conference on Computer Vision, pages 2117–2126,
tention networks. In International Conference on Learning
2017. 1, 3
Representations (ICLR), 2018. 2

152

Sim Et Al. - 2024 - Learning To Approximate Adaptive Kernel Convolution On Graphs
No ratings yet
Sim Et Al. - 2024 - Learning To Approximate Adaptive Kernel Convolution On Graphs
9 pages
Graph Neural Networks
100% (1)
Graph Neural Networks
27 pages
MSC Maths PDF
No ratings yet
MSC Maths PDF
23 pages
A Survey of Graph Neural Networks in Various Learning Paradigms Methods, Applications, and Challenges
No ratings yet
A Survey of Graph Neural Networks in Various Learning Paradigms Methods, Applications, and Challenges
70 pages
Graph Neural Networks: A Review of Methods and Applications
No ratings yet
Graph Neural Networks: A Review of Methods and Applications
22 pages
A Comprehensive Survey On Graph Neural Networks
No ratings yet
A Comprehensive Survey On Graph Neural Networks
22 pages
Graph Neural Networks: A Review of Methods and Applications
No ratings yet
Graph Neural Networks: A Review of Methods and Applications
20 pages
Spatial Temporal Graph Convolutional Networks For Skeleton-Based Action Recognition
No ratings yet
Spatial Temporal Graph Convolutional Networks For Skeleton-Based Action Recognition
10 pages
Ishigurognnintroduction201023 201027054344
No ratings yet
Ishigurognnintroduction201023 201027054344
81 pages
KDD Tutorial Part2 Network Embedding and GCN
No ratings yet
KDD Tutorial Part2 Network Embedding and GCN
38 pages
Edgenets: Edge Varying Graph Neural Networks: Elvin Isufi, Fernando Gama and Alejandro Ribeiro
No ratings yet
Edgenets: Edge Varying Graph Neural Networks: Elvin Isufi, Fernando Gama and Alejandro Ribeiro
15 pages
12328-Article Text-15856-1-2-20201228
No ratings yet
12328-Article Text-15856-1-2-20201228
9 pages
Graph Convolutional Networks: A Comprehensive Review: Open Access Research
No ratings yet
Graph Convolutional Networks: A Comprehensive Review: Open Access Research
23 pages
Graph Transformer Networks: Corresponding Author
No ratings yet
Graph Transformer Networks: Corresponding Author
11 pages
Multiscale Spatio-Temporal Graph Neural Networks
No ratings yet
Multiscale Spatio-Temporal Graph Neural Networks
15 pages
Corresponding Author
No ratings yet
Corresponding Author
6 pages
Action Recognition Based On Multi-Level Representation of 3D Shape
No ratings yet
Action Recognition Based On Multi-Level Representation of 3D Shape
9 pages
Action Recognition Using Attention-Joints Graph Convolutional Neural Networks
No ratings yet
Action Recognition Using Attention-Joints Graph Convolutional Neural Networks
9 pages
17.2022-IET Cyber-Syst and Robotics - 2022 - Zheng - Multi Branch Angle Aware Spatial Temporal Graph Convolutional Neural Network
No ratings yet
17.2022-IET Cyber-Syst and Robotics - 2022 - Zheng - Multi Branch Angle Aware Spatial Temporal Graph Convolutional Neural Network
10 pages
Unified Keypoint-Based Action Recognition Framework Via Structured Keypoint Pooling
No ratings yet
Unified Keypoint-Based Action Recognition Framework Via Structured Keypoint Pooling
10 pages
Fusion Graph Convolutional Networks
No ratings yet
Fusion Graph Convolutional Networks
10 pages
Thura2022-06-20 (Research Plan)
No ratings yet
Thura2022-06-20 (Research Plan)
26 pages
Graph Wavenet
No ratings yet
Graph Wavenet
7 pages
1312.6203 Spectral Networks and Locally Connected Networks
No ratings yet
1312.6203 Spectral Networks and Locally Connected Networks
14 pages
Wang - Robust Multi-Feature Learning For Skeleton-Based Action Recognition
No ratings yet
Wang - Robust Multi-Feature Learning For Skeleton-Based Action Recognition
13 pages
UNIK A Unified Framework For Real-World Skeleton-B
No ratings yet
UNIK A Unified Framework For Real-World Skeleton-B
14 pages
Simonovsky Dynamic Edge-Conditioned Filters CVPR 2017 Paper
No ratings yet
Simonovsky Dynamic Edge-Conditioned Filters CVPR 2017 Paper
10 pages
Embedded Features For 1D CNN-based Action Recognition On Depth Maps
No ratings yet
Embedded Features For 1D CNN-based Action Recognition On Depth Maps
13 pages
对时空图卷积网络的几何理解
No ratings yet
对时空图卷积网络的几何理解
16 pages
Rolip2 Report GNN
No ratings yet
Rolip2 Report GNN
6 pages
DGCNN
No ratings yet
DGCNN
8 pages
16542-Article Text-20036-1-2-20210518
No ratings yet
16542-Article Text-20036-1-2-20210518
8 pages
Spectral Networks and Deep Locally Connected Networks On Graphs
No ratings yet
Spectral Networks and Deep Locally Connected Networks On Graphs
14 pages
An Efficient Self Attention Network For Skeleton Based Action Recognition
No ratings yet
An Efficient Self Attention Network For Skeleton Based Action Recognition
10 pages
Geometric Deep Learning On Graphs and Manifolds Using Mixture Model Cnns
No ratings yet
Geometric Deep Learning On Graphs and Manifolds Using Mixture Model Cnns
13 pages
Dynamic Dense Graph Convolutional Network For Skeleton-Based Human Motion Prediction
No ratings yet
Dynamic Dense Graph Convolutional Network For Skeleton-Based Human Motion Prediction
15 pages
Cheng Skeleton-Based Action Recognition With Shift Graph Convolutional Network CVPR 2020 Paper
No ratings yet
Cheng Skeleton-Based Action Recognition With Shift Graph Convolutional Network CVPR 2020 Paper
10 pages
Computer Vision1
No ratings yet
Computer Vision1
7 pages
I3D-Shufflenet Based Human Action Recognition
No ratings yet
I3D-Shufflenet Based Human Action Recognition
14 pages
Songyang ToM 2017
No ratings yet
Songyang ToM 2017
14 pages
3D Human Action Representation Learning Via Cross-View Consistency Pursuit
No ratings yet
3D Human Action Representation Learning Via Cross-View Consistency Pursuit
10 pages
De GCN
No ratings yet
De GCN
14 pages
Graph Wavenet For Deep Spatial-Temporal Graph Modeling: Zonghan Wu Shirui Pan Guodong Long Jing Jiang Chengqi Zhang
No ratings yet
Graph Wavenet For Deep Spatial-Temporal Graph Modeling: Zonghan Wu Shirui Pan Guodong Long Jing Jiang Chengqi Zhang
7 pages
Technical Seminar
No ratings yet
Technical Seminar
13 pages
HALF SEMINAR - Merged
No ratings yet
HALF SEMINAR - Merged
28 pages
GCNN
No ratings yet
GCNN
11 pages
Noor A Lightweight Skeleton-Based 3D-CNN For Real-Time Fall Detection and Action ICCVW 2023 Paper
No ratings yet
Noor A Lightweight Skeleton-Based 3D-CNN For Real-Time Fall Detection and Action ICCVW 2023 Paper
10 pages
Directional Connectivity-Based Segmentation of Medical Images
No ratings yet
Directional Connectivity-Based Segmentation of Medical Images
11 pages
Review of Image Classification Algorithms Based On
No ratings yet
Review of Image Classification Algorithms Based On
10 pages
Survey of Image Based Graph Neural Networks: U. Nazir H. Wang M. Taj
No ratings yet
Survey of Image Based Graph Neural Networks: U. Nazir H. Wang M. Taj
7 pages
A Novel Graph Representation For Skeleton-Based Action Recognition
No ratings yet
A Novel Graph Representation For Skeleton-Based Action Recognition
9 pages
Skeleton-Based Action Recognition Using Graph Conv
No ratings yet
Skeleton-Based Action Recognition Using Graph Conv
18 pages
Graph Element Networks: Adapative Structured Computation and Memory
No ratings yet
Graph Element Networks: Adapative Structured Computation and Memory
11 pages
(2017) Deep Learning Based Human Action Recognition - A Survey
No ratings yet
(2017) Deep Learning Based Human Action Recognition - A Survey
6 pages
AOA Lab Manual
No ratings yet
AOA Lab Manual
60 pages
25439-Article Text-29502-1-2-20230626-2
No ratings yet
25439-Article Text-29502-1-2-20230626-2
9 pages
Defence Transcription
No ratings yet
Defence Transcription
4 pages
Directed Graph Neural Networks
No ratings yet
Directed Graph Neural Networks
2 pages
An Efficient Framework For Human Action Recognition Based On Graph Convolutional Networks
No ratings yet
An Efficient Framework For Human Action Recognition Based On Graph Convolutional Networks
6 pages
Chapter 3 Matrix
No ratings yet
Chapter 3 Matrix
59 pages
Syll - Maths For 12th Class
No ratings yet
Syll - Maths For 12th Class
4 pages
Theoretical Computer Science Cheat Sheet
No ratings yet
Theoretical Computer Science Cheat Sheet
10 pages
Chapter-1 Successive Differentiation: Topic-1.1
No ratings yet
Chapter-1 Successive Differentiation: Topic-1.1
11 pages
School of Mathematics, Thapar University, Patiala
No ratings yet
School of Mathematics, Thapar University, Patiala
2 pages
Chapter 6 - DS
No ratings yet
Chapter 6 - DS
67 pages
G5 - A2SV - Union Find
No ratings yet
G5 - A2SV - Union Find
84 pages
5 Graphs
No ratings yet
5 Graphs
6 pages
WTW 134 Study Guide 2023
No ratings yet
WTW 134 Study Guide 2023
40 pages
DSC - 313301 - Question Bank
No ratings yet
DSC - 313301 - Question Bank
10 pages
Bicubic Interpolation: Xiao Shu
No ratings yet
Bicubic Interpolation: Xiao Shu
44 pages
Practical 25
No ratings yet
Practical 25
72 pages
Extremal Graph Theory: Lectured by A. Thomason Lent Term 2013
No ratings yet
Extremal Graph Theory: Lectured by A. Thomason Lent Term 2013
37 pages
Lecture 05 - Quasi Newthon Methods
No ratings yet
Lecture 05 - Quasi Newthon Methods
10 pages
MATH1151 Final Exam Review
No ratings yet
MATH1151 Final Exam Review
3 pages
Transformations of Graphs
No ratings yet
Transformations of Graphs
11 pages
Appendix B The Boundary Element Method
No ratings yet
Appendix B The Boundary Element Method
3 pages
F1 Working With Functions - Further Functions and Relations
No ratings yet
F1 Working With Functions - Further Functions and Relations
7 pages
Final Exam JUN2013
No ratings yet
Final Exam JUN2013
9 pages
Ch1 PowerPoint 1 2 V1
No ratings yet
Ch1 PowerPoint 1 2 V1
5 pages
Polynomials Roots Investigation
No ratings yet
Polynomials Roots Investigation
6 pages
Linear Interpolation
No ratings yet
Linear Interpolation
4 pages
Algebra 8 Unit 8: Revision 1 A B
No ratings yet
Algebra 8 Unit 8: Revision 1 A B
2 pages
1 Special Types of SQUARE Matrices
No ratings yet
1 Special Types of SQUARE Matrices
3 pages
Homework March5 2019
No ratings yet
Homework March5 2019
3 pages
Mat565 - Table of Laplace Transforms
No ratings yet
Mat565 - Table of Laplace Transforms
1 page
Kosaraju
No ratings yet
Kosaraju
2 pages
Algorithms and Data Structures For Big Data (BDA 5101)
No ratings yet
Algorithms and Data Structures For Big Data (BDA 5101)
1 page
Graph Layout Support for Model-Driven Engineering
From Everand
Graph Layout Support for Model-Driven Engineering
Miro Spönemann
No ratings yet
Mesh Generation: Advances and Applications in Computer Vision Mesh Generation
From Everand
Mesh Generation: Advances and Applications in Computer Vision Mesh Generation
Fouad Sabry
No ratings yet
Attractor Networks: Fundamentals and Applications in Computational Neuroscience
From Everand
Attractor Networks: Fundamentals and Applications in Computational Neuroscience
Fouad Sabry
No ratings yet

Liu Disentangling and Unifying Graph Convolutions For Skeleton-Based Action Recognition CVPR 2020 Paper

Uploaded by

Liu Disentangling and Unifying Graph Convolutions For Skeleton-Based Action Recognition CVPR 2020 Paper

Uploaded by

Disentangling and Unifying Graph Convolutions

for Skeleton-Based Action Recognition

man action dynamics. To capture robust movement patterns

Ares is initialized with random values around zero and is

You might also like