0% found this document useful (0 votes)
18 views10 pages

3D Human Action Representation Learning Via Cross-View Consistency Pursuit

Uploaded by

James Bu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views10 pages

3D Human Action Representation Learning Via Cross-View Consistency Pursuit

Uploaded by

James Bu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

3D Human Action Representation Learning via Cross-View Consistency Pursuit

Linguo Li1,2∗ Minsi Wang1,2∗ Bingbing Ni1,2∗∗ Hang Wang1,2 Jiancheng Yang1,2 Wenjun Zhang1
1
Shanghai Jiao Tong University, Shanghai 200240, China
2
MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University
[email protected], {LLG440982, nibingbing, zhangwenjun}@sjtu.edu.cn
arXiv:2104.14466v2 [cs.CV] 1 May 2021

Abstract Sample 1 Positive Negative Sample 2

In this work, we propose a Cross-view Contrastive Joint


Learning framework for unsupervised 3D skeleton-based
(a) General
action Representation (CrosSCLR), by leveraging multi-
(b) Ours
view complementary supervision signal. CrosSCLR con-
sists of both single-view contrastive learning (Skeleton-
Joint
CLR) and cross-view consistent knowledge mining (CVC-
Different in Pose
KM) modules, integrated in a collaborative learning man-
ner. It is noted that CVC-KM works in such a way that
high-confidence positive/negative samples and their dis- Motion

tributions are exchanged among views according to their


Similar in Motion
embedding similarity, ensuring cross-view consistency in
terms of contrastive context, i.e., similar distributions. Ex- Figure 1. Hand waving in joint and motion form. Two samples are
tensive experiments show that CrosSCLR achieves remark- from the same action class. (a) usual contrastive learning methods
able action recognition results on NTU-60 and NTU-120 regard them as negative pairs. (b) in a multi-view situation, con-
datasets under unsupervised settings, with observed higher- sidering their similar motion patterns, they can be positive pairs.
quality action representations. Our code is available at This motivates us to introduce cross-view contrastive learning for
https://github.com/LinguoLi/CrosSCLR. skeleton representation.

on contrastive learning [55, 4, 11, 18], aiming to leverage


1. Introduction the instance discrimination of samples in latent space.
Although the above approaches improve the skeleton
Human action recognition is an important but challeng-
representation capability to some extent, it is believed that
ing task in computer vision research. Due to the light-
the power of unsupervised methods is by far from fully
weight and robust estimation algorithms [3, 56], 3D skele-
explored. On the one hand, traditional contrastive learn-
ton has become a popular feature representation to study hu-
ing uses only one positive pair generated by data augmen-
man action dynamics. Many 3D action recognition works
tation and even similar samples are regarded as negative
[6, 60, 17, 25, 44, 28, 16] use a fully-supervised manner
samples. Despite the high similarity, the negative samples
and require massive labeled 3D skeleton data. However,
are forced away in embedding space, which is unreason-
annotating data is expensive and time-consuming, which
able for clustering. On the other hand, current unsupervised
prompts people to explore unsupervised methods [65, 26,
methods [1, 65, 20, 26, 47] have not yet explored the rich
39, 47] on skeleton data. Some unsupervised methods
intra-supervision information provided by different skele-
exploit structure completeness within each sample based
ton modalities. Considering that it is easy to obtain skeleton
on pretext tasks, including reconstruction [9, 65], auto-
data in multiple “views”, e.g., joint, motion and bone, com-
regression [20, 47] and jigsaw puzzles [35, 54], but it is
plementary information preserved in different views can as-
unsure that the designed pretext tasks generalize well for
sist the operation to mine positive pairs from similar nega-
downstream tasks. Other unsupervised methods are based
tive samples. As shown in Figure 1, the same hand waving
∗ Equal Contribution actions are different in pose (joint), but similar in motion.
∗∗ Corresponding Author: Bingbing Ni Usual contrastive learning methods regard them as nega-
tive pairs, keeping them away in embedding space. If such 2. Related Work
complementary information, i.e., different in joint but simi-
lar in motion, could be fully utilized and explored, the size Self-Supervised Representation Learning. Self-
of hidden positive pairs in joint can be boosted, enhancing supervised learning is to learn feature representations from
training fidelity. Thus, the cross-view contrastive learning numerous unlabeled data, which usually generates super-
strategy takes advantage of multi-view knowledge, resulting vision by pretext tasks, e.g., jigsaw puzzles [35, 36, 54],
in better-extracted skeleton features. colorization [63], predicting rotation [8, 59]. For se-
quence data, supervision can be generated by frame
To this end, we propose a Cross-view Contrastive Learn- orders [32, 7, 21], space-time cubic puzzles [19] and
ing framework for Skeleton-based action Representation prediction [51, 26], but these methods highly rely on
(CrosSCLR), which exploits multi-view information for the quality of pretext tasks. Recently, contrastive meth-
mining positive samples and pursuing cross-view consis- ods [37, 55, 48, 18] based on instance discrimination have
tency in unsupervised contrastive learning, enabling the been proposed for representation learning. MoCo [11]
model to extract more comprehensive cross-view features. introduces a memory bank to store the embeddings of
First, parallel Contrastive Learning is evoked for each negative samples, and SimCLR [4] uses a much larger
single-view Skeleton action Representation (SkeletonCLR), mini-batch size to compute the embeddings in real time,
yielding multiple single-view embedding features. Second, but they can not capture the cross-view knowledge for
inspired by the fact that the distance of samples in embed- 3D action representation. Concurrent work CoCLR [10]
ding space reflects the similarity of the samples in the orig- leverages multi-modal information for video representation
inal space, we refer to the extreme similarity of samples in via co-training, which doesn’t consider the contrastive
one view to guide the learning process in another view, as context. Our CrosSCLR simultaneously trains models in
shown in Figure 1. More specifically, Cross-View Consis- all views by encouraging cross-view consistency, leading
tent Knowledge Mining (CVC-KM) module is developed to to more representative embeddings.
exam the similarity of samples, and select the most similar
pairs as positive ones to boost the positive set in comple- Skeleton-based Action Recognition. To tackle skeleton-
mentary views, i.e., embedding distance/similarity (confi- based action recognition tasks, early approaches are gener-
dence score) serves as the weight of corresponding mined ally based on hand-craft features [33, 52, 49, 50]. Recent
sample in the contrastive loss. In other words, CVC-KM methods pay more attention to deep neural networks. For
conveys the most prominent knowledge from one view to the sequence structure of skeleton data, many RNN-based
others, introduces complementary pseudo-supervised con- methods [6, 40, 60, 45, 61] were carried out to effectively
straint and promotes information sharing among views. The utilize the temporal feature. Since RNN suffers from gradi-
entire framework excavates positive pairs across views ac- ent vanishing [14], CNN-based models [17, 22] attract re-
cording to the distance between samples in the embedding searchers’ attention, but they need to convert skeleton data
space to promote knowledge exchange among views, so to another form. Further, ST-GCN [57] was proposed to
that the extracted skeleton features will contain multi-view better model the graph structure of skeleton data. Then the
knowledge and are more competitive for various down- attention mechanism [24, 42, 44, 64, 62] and multi-stream
stream tasks. Extensive results on NTU-RGB+D [40, 27] structure [25, 41, 42, 53] are applied to adaptively cap-
datasets demonstrate that our method indeed boosts the 3D ture multi-stream features based on GCNs. We adopt the
action representation learning benefiting from cross-view widely-used ST-GCN as the backbone to extract the skele-
consistency. We summarize our contributions as follows: ton features.
Unsupervised Skeleton Representation. Many unsuper-
• We propose CrosSCLR, a cross-view contrastive learn- vised methods [46, 29, 31, 23] were proposed to capture
ing framework for skeleton-based action representation. action representations in videos. For skeleton data, previ-
• We develop Contrastive Learning for Skeleton-based ac- ous works [58, 1] have achieved some progress in unsu-
tion Representation (SkeletonCLR) to learn the single- pervised representation learning without deep neural net-
view representations of skeleton data. works. Recent deep learning methods [9, 65, 20, 47] are
based on the structure of encoder-decoder or generative ad-
• We use parallel SkeletonCLR models and CVC-KM versarial network (GAN). LongT GAN [65] proposed an
to excavate useful samples across views, enabling the auto-encoder-based GAN for sequential reconstruction and
model to capture more comprehensive representation evaluated it on the action recognition tasks. P&C [47] uses
unsupervisedly. a weak decoder in the encoder-decoder model, forcing the
• We evaluate our model on 3D skeleton datasets, e.g., encoder to learn more discriminative features. MS2 L [26]
NTU-RGB+D 60/120, and achieve remarkable results proposed a multi-task learning scheme for action represen-
under unsupervised settings. tation learning. However, these methods highly depend on
𝑥𝑥 𝑔𝑔 𝐌𝐌
reconstruction or prediction, and they do not exploit the nat- f

Memory Bank
ℎ 𝓏𝓏
ural multi-view knowledge of skeleton data. Thus, we intro-
duce CrosSCLR for unsupervised 3D action representation. ℒ
𝑔𝑔�
3. CrosSCLR 𝑥𝑥� f� ℎ� 𝓏𝓏̂
𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒
Although 3D skeleton has shown its importance in ac- Figure 2. Architecture of single-view SkeletonCLR, which is a
tion recognition, unsupervised skeleton representation has memory augmented contrastive learning framework.
not been well exploited recently. Since the easily-obtained
“multi-view” skeleton information plays a significant role in • A simple projector g and its momentum updated version
action recognition, we expect to exploit them to mine posi- ĝ that project the hidden vector to a lower dimension
tive samples and pursue cross-view consistency in unsuper- space: z = g(h), ẑ = ĝ(ĥ), where z, ẑ ∈ Rcz . The
vised contrastive learning, thus giving rise to a Cross-view projector is a fully-connected (FC) layer with ReLU.
Contrastive Learning (CrosSCLR) framework for Skeleton- • A memory bank M = {mi }M i=1 that stores negative
based action Representation. samples to avoid redundant computation of the embed-
As shown in Figure 3, CrosSCLR contains two key mod- dings. It is a first-in-first-out queue updated per iteration
ules: 1) SkeletonCLR (Section 3.1): a contrastive learning by ẑ. After each inference step, ẑ will enqueue while
framework to unsupervisedly learn single-view representa- the earliest embedding in M will dequeue. During con-
tions, and 2) CVC-KM (Section 3.2): it conveys the most trastive training, M provides numerous negative embed-
prominent knowledge from one view to others, introduces dings while the new calculated ẑ is the positive embed-
complementary pseudo-supervised constraint and promotes ding.
information sharing among views. Finally, the more dis- • An InfoNCE [37] loss for instance discrimination:
criminating representations can be obtained by coopera-
tively training (Section 3.2). exp(z · ẑ/τ )
L = − log PM (1)
exp(z · ẑ/τ ) + i=1 exp(z · mi /τ )
3.1. Single-View 3D Action Representation
where mi ∈ M, τ is the temperature hyper-
Contrastive learning has been widely-used due to its in-
parameter [12], and dot product z · ẑ is to compute their
stance discrimination capability, especially for images [4,
similarity where z, ẑ are normalized.
11] and videos [10]. Inspired by this, we develop Skeleton-
CLR to learn single-view 3D action representations, based Constrained by contrastive loss L, the model is unsupervis-
on the recent advanced practice, MoCo [11]. edly trained to discriminate each sample in the training set.
SkeletonCLR. It is a memory-augmented contrastive learn- At last, we can obtain a strong encoder f that is beneficial
ing method for skeleton representation, which considers to extract single-view distinguishing representations.
one sample’s different augments as its positive samples and Limitations of Single-View Contrastive Learning. The
other samples as negative samples. In each training step, the above SkeletonCLR still suffers the following limitations:
batch embeddings are stored in first-in-first-out memory to 1) Embedding distribution can provide more reliable in-
get rid of redundant computation, serving as negative sam- formation. We expect samples from the same category
ples for the next steps. The positive samples are embedded are embedded closely. However, instance discrimination in
close to each other while the embeddings of negative sam- SkeletonCLR uses only one positive pair and even similar
ples are pushed away. As shown in Figure 2, SkeletonCLR samples are regarded as negative samples. It is unreason-
consists of the following major components: able that the negative samples are forced away in embed-
ding space despite their high embedding similarity. In other
• A data augmentation module T that randomly trans-
words, one positive pair cannot fully describe the relation-
forms the given skeleton sequence into different aug-
ships of samples, and a more reliable embedding distribu-
ments x, x̂ that are considered as positive pairs. For
tion is needed, i,e., positive/negative setting plus embedding
skeleton data, we adopt Shear and Crop as the augmen-
similarity. We aim to mine more representative knowledge
tation strategy (see Section 3.3 and Appendix).
to facilitate contrastive learning, which is also the knowl-
• Two encoders f and fˆ that embed x and x̂ into hid- edge we want to exchange across views. Thus, we introduce
den space: h = f (x; θ) and ĥ = fˆ(x̂; θ̂), where the contrastive context in Section 3.2.
h, ĥ ∈ Rch . fˆ is the momentum updated version of f : 2) Multi-view data can benefit representation learning.
θ̂ ← αθ̂ + (1 − α)θ, where α is a momentum coeffi- SkeletonCLR only relies on single-view data. As shown
cient. SkeletonCLR uses ST-GCN [57] as the backbone in Figure 1, since we don’t have any annotations, differ-
(Details are in Section 3.3). ent samples of the same class are inevitably embedded
into distinct places far from each other, i.e., they distribute context contains not only the information of the most simi-
sparsely/irregularly, bringing much difficulty for linear clas- lar samples but the detailed relationships of samples (distri-
sification. Considering the readily generated multi-view bution).
data of 3D skeleton (see Section 3.3), if such complemen- In Equation (1), the embedding z has positive context
tary information in Figure 1, i.e., different in joint but sim- S+ = {z · ẑ} which does not consider any of neighbors in
ilar in motion, could be fully utilized and explored, the size embedding space except for the augments. Despite the high
of hidden positive pairs in joint can be boosted, enhancing similarity, the negative samples are forced away in embed-
training fidelity. To this end, we inject this consideration ding space, and then samples belonging to the same cate-
into unsupervised contrastive learning framework. gory are difficultly embedded into the same cluster, which is
not efficient to build a “regular” embedding space for down-
3.2. Cross-View Consistent Knowledge Mining
stream classification tasks.
Motivated by the situation in Figure 1 that complemen- High-confidence Knowledge Mining. To solve the above
tary knowledge is preserved in multiple views, we propose issue, we develop the high-confidence Knowledge Mining
the Cross-View Consistent Knowledge Mining (CKC-KM), mechanism (KM), which selects the most similar pairs as
leveraging the high similarity of samples in one view to positive ones to boost the positive sets. It shares similar
guide the learning process in another view. It excavates pos- high-level spirit with neighborhood embedding [13] but per-
itive pairs across views according to the embedding similar- forms differently in an unsupervised contrastive manner.
ity to promote knowledge exchange among views, then the Specifically, it is based on the following observation in
size of hidden positive pairs in each view can be boosted Figure 4 that after single-view contrastive learning, two em-
and the extracted skeleton features will contain multi-view beddings most likely belong to the same category if they are
knowledge, resulting in a more regular embedding space. embedded closely enough; on the contrary, two embeddings
In this section, we first clarify contrastive context as hardly belong to the same class if they locate extremely far
the consistent knowledge across views, and then show how from each other in embedding space. Therefore, we can fa-
to mine high-confidence knowledge, and finally inject its cilitate contrastive learning by setting the most similar em-
cross-view consistency into single-view SkeletonCLR to beddings as positive to make it more clustered:
further benefit the cross-view unsupervised representation.
Contrastive Context as Consistent Knowledge. As dis- Γ(S) = Topk(S) (4)
P
cussed above, the knowledge we want to exchange across exp(z · ẑ/τ ) + i∈N+ exp(z · mi /τ )
views is one sample’s contrastive context, which describes LKM = − log P (5)
exp(z · ẑ/τ ) + i∈N exp(z · mi /τ )
this sample’s relationships with others (distribution) in em-
bedding space under the settings of contrastive learning. where Γ = Topk is the function to select the index of top-
Notice that SkeletonCLR uses a memory bank to store nec- K similar embeddings and N+ is their index set in memory
essary embeddings. Given one sample’s embedding z and bank. Compared to Equation (1), Equation (5) will leads to
corresponding memory bank M, its contrastive context is a more regular space by pulling close more high-confidence
a similarity set S among z and M conditioned on specific positive samples. Additionally, since we don’t have any la-
knowledge miner Γ that generates index set N+ of positive bels, a larger K may harm the contrastive performance (see
samples, Section 4.3).
S = {si }i∈N = {z · mi }i∈N (2) Cross-View Consistency Learning. Considering easily-
(S+ , N+ ) = Γ(S) (3) obtained multi-view skeleton data, complementary infor-
mation preserved in different views can assist the opera-
where S+ = {si }i∈N+ and dot product “·” is to compute tion to mine positive pairs from similar negative samples
the similarity si among embeddings z and mi . N is the in Figure 1. Then the size of hidden positive pairs can be
index set of embeddings in memory bank and N+ is the boosted by cross-view knowledge communication, result-
index set of positive samples selected by knowledge miner ing in better-extracted skeleton features. To this end, we
Γ. Thus contrastive context C(S|N+ ) consists of following design the cross-view consistency learning which not only
two aspects: mines the high-confidence positive samples from comple-
• Embedding Context S: it is the relationship between one mentary view but also lets the embedding context be con-
sample and others in embedding space, i.e., distribution; sistent in multiple views. Its two-view case is illustrated in
Figure 3 for example.
• Contrastive Setting N+ : it is the positive setting mined
Specifically, samples xu and xv are generated from the
by Γ according to the embedding similarity S; same raw data x by the view generation method in Sec-
thus C(S|N+ ) = {S+ , S− } has positive context S+ and tion 3.3, where u and v indicate two types of data views.
negative context S− , where S = S+ ∪ S− . The contrastive After single-view contrastive learning, two SkeletonCLR
(a) (b)
View-𝑢
𝐌 ꞏꞏꞏ Embed 𝑠 𝑠 𝑠 𝑠
① ②
𝑚
SkeletonCLR 𝑢 ꞏꞏꞏ
ℒ𝑢→𝑣 ℒ𝑣→𝑢
𝑣 ꞏꞏꞏ 𝑠 𝑠 𝑠
SkeletonCLR ① ②

View-𝑣 ꞏꞏꞏ Embed


𝑚

Figure 3. (a) CrosSCLR. Given two samples xu , xv generated from the same raw data, e.g., joint and motion, SkeletonCLR models produce
single-view embeddings while cross-view consistent knowledge mining (CVC-KM) exchanges multi-view complementary knowledge. (b)
v
how Lv→u works in embedding space. In step 1, we mine high-confidence knowledge N+ from similarities S v to boost the positive set of
view u, i.e., z shares z ’s neighbors; In step 2, we use the similarities S to supervise the embedding distribution in view u. z u , z v share
u v v

similar relationships with others. Thus, two embedding spaces become similar under the constraint of Lcross .

modules obtain the embeddings z u , z v and corresponding lowing objective:


memory bank Mu , Mv respectively. We can mine the high-
u u
confidence knowledge from two views by (S+ , N+ ) = U X
X U
u v v v u v Lcross = Lu→v
Topk(S ) and (S+ , N+ ) = Topk(S ), where S+ , S+ are (8)
the positive context of z u , z v respectively. u v
CrosSCLR aims to learn the consistent embedding dis-
tribution in different views by encouraging the similar- where U is the number of views and v 6= u.
ity of contrastive context, i.e., exchanging high-confidence In the early training process, the model is not stable
knowledge across views. In Figure 3 (b), if we want to use and strong enough to provide reliable cross-view knowl-
the knowledge of view v to guide view u’s contrastive learn- edge without the supervision of labels. As the unreliable
ing, it contains two aspects: 1) step 1: we select the most information may lead astray, it is not encouraged to enable
similar pairs (positive) in view v as the positive sets in view cross-view communication too early. We perform two-stage
u, i.e., S u , N+
v
→ C(S u |N+ v
). Thus the sample z u shares training for CrosSCLR: 1) each view of the model is indi-
v vidually trained with Equation (1) without cross-view com-
z ’s positive neighbors; 2) step 2: we use the embedding
similarity svi = z v · mvi in view v as the weight of corre- munication. 2) then the model can supply high-confidence
sponding embedding mui in view u to provide the detailed knowledge, so the loss function is replaced with Equation
relational information, i.e., mv→ui = svi mui . Then the sim- (8), starting cross-view knowledge mining.
u v→u
ilarity is computed by z · m = svi (z u · mui ) = sui svi
u v→u
and z has embedding context S = {z u ·mv→u
i }i∈N = 3.3. Model Details
u v
{si si }i∈N . Finally, the overall loss is conducted as:
View Generation of 3D Skeleton. Generally, 3D human
exp(z u · zˆu /τ ) + i∈N v exp(sui svi )/τ )
P
skeleton sequence has T frames with V joints, and each
Lv→u = − log P +
exp(z u · zˆu /τ ) + i∈N exp(sui svi )/τ ) joint has C = 3 coordinate feature, which can be noted as
(6) x ∈ RC×T ×V . Different from videos, the views [41, 42] of
skeleton, e.g., joint, motion, bone, motion of bone, can be
Lcross = Lu→v + Lv→u (7)
easily obtained, which is a natural advantage for skeleton-
where Lv→u means we transfer the contrastive context of based representation learning. Motion is represented as the
z v to that of z u . Since sui , svi are the embedding context of temporal displacement between frames: x:,t+1,: −x:,t,: , and
ziu , ziv , we call Equation (7) as cross-view contrastive con- bone is the distance between two neighboring joints in the
text learning, which constrains the similar distribution of same frame: x:,:,v2 − x:,:,v1 . For simplicity, we use three
two views (see Section 4.3, the results of t-SNE). Compared views: joint, motion and bone in experiments.
to Equation (5), Equation (6) considers the cross-view in- Encoder f . We adopt ST-GCN [57] as encoder, which is
formation, cooperatively using one view’s high-confidence suitable for modeling graph-structure skeleton data by ex-
positive samples and its distribution to instruct the other ploiting the spatial and temporal relations. After a series of
view’s contrastive learning, resulting in more regular space ST-GCN blocks, the output feature xout ∈ Rcout ×tout ×V is
and better extracted skeleton features. applied by an average pooling operation on spatial and tem-
Learning CrosSCLR. For more views, CrosSCLR has fol- poral dimensions, obtaining final representation h ∈ Rch .
4. Experiments NTU-60 (%)
Method View xsub xview
4.1. Datasets
SkeletonCLR Joint 68.3 76.4
NTU-RGB+D 60. NTU-RGB+D 60 (NTU-60) dataset [40] SkeletonCLR Motion 53.3 50.8
is a large-scale dataset of 3D joint coordinate sequences for SkeletonCLR Bone 69.4 67.4
skeleton-based action recognition, containing 56, 578 skele- 2s-SkeletonCLR Joint + Motion 70.5 77.9
ton sequences in 60 action categories. Each skeleton graph 3s-SkeletonCLR Joint + Motion + Bone 75.0 79.8
contains V = 25 body joints as nodes, and their 3D co- CrosSCLR Joint 72.9 79.9
ordinates are initial features. There are two protocols [40] CrosSCLR Motion 72.7 77.6
recommended. 1) Cross-Subject (xsub): training data and CrosSCLR Bone 75.2 78.8
validation data are collected from different subjects. 2) 2s-CrosSCLR Joint + Motion 74.5 82.1
Cross-View (xview): training data and validation data are 3s-CrosSCLR Joint + Motion + Bone 77.8 83.4
collected from different camera views. Table 1. Comparisons of SkeletonCLR and CrosSCLR on each
NTU-RGB+D 120. NTU-RGB+D 120 (NTU-120) view and their ensembles. SkeletonCLR models are trained inde-
dataset [27] is an extended version of NTU-60, containing pendently and “+” means the ensemble model.
113, 945 skeleton sequences in 120 action categories. Two
protocols [27] are recommended. 1) Cross-Subject (xsub): Unsupervised Pre-training. We generate three views of
training data and validation data are collected from different skeleton sequences, i.e., joint, motion and bone. For the
subjects. 2) Cross-Setup (xset): training data and validation encoder, we adopt ST-GCN [57], but the number of chan-
data are collected from different setup IDs. nels in each layer is reduced to 1/4 of the original setting.
For contrastive settings, we follow that in MOCOv2 [5] but
NTU-RGB+D 61-120. NTU-RGB+D 61-120 (NTU-61- reduce the size of memory bank M to 30k. For data aug-
120) dataset is a subset of NTU-120 dataset, containing mentation, We set shear amplitude β = 0.5 and the padding
57, 367 skeleton sequences in the last 60 action categories ratio γ = 6. The model is trained for 300 epochs with the
in NTU-120. The categories in NTU-61-120 do not inter- learning rate 0.1 (multiplied by 0.1 at epoch 250). InfoNCE
sect with those in NTU-60. This dataset is used as external loss in Equation (1) is used in the first 150 epochs, and then
dataset to evaluate the transfer capability of our method. replaced with Lcross in Equation (8) after 150-th epoch. We
4.2. Experimental Settings set K = 1 as the default in the knowledge mining mecha-
nism.
All the experiments are conducted on the PyTorch [38] Linear Evaluation Protocol. The models are verified by
framework. For data pre-processing, we remove the invalid linear evaluation for action recognition task, i.e., attaching
frames of each skeleton sequence and then resize them to the frozen encoder to a linear classifier (a fully-connected
the length of 50 frames by linear interpolation. For opti- layer followed by a softmax layer), and then training the
mization, we use SGD with momentum (0.9) and weight classifier supervisedly. We train models for 100 epochs with
decay (0.0001). The mini-batch size is set to 128. learning rate 3.0 (multiplied by 0.1 at epoch 80).
Data Augmentation T . For skeleton sequence, we choose
Finetune Protocol. We append a linear classifier to the
Shear [39] and Crop [43] as the augmentation strategy.
learnable encoder, and then train the whole model for the
Shear is a linear transformation on the spatial dimension. action recognition task, to compare it with fully-supervised
The transformation matrix is defined as: methods. We train for 100 epochs with learning rate 0.1
(multiplied by 0.1 at epoch 80).
 
1 a12 a13
A =  a21 1 a23  (9)
a31 a32 1 4.3. Ablation Study

where a12 , a13 , a21 , a23 , a31 , a32 are shear factors ran- All experiments in this section are conducted on NTU-60
domly sampled from [−β, β]. β is the shear amplitude. The dataset and follow the unsupervised pre-training and linear
sequence x is multiplied by the transformation matrix A on evaluation protocol in Section 4.2.
the channel dimension. Then, the human pose in 3D coor- Effectiveness of CrosSCLR. In Table 1, we separately pre-
dinate is inclined at a random angle. train SkeletonCLR and jointly pre-train CrosSCLR models
Crop is an augmentation on the temporal dimension that on different skeleton views, e.g., joint, motion and bone.
symmetrically pads some frames to the sequence and then We adopt linear evaluation on each view of the models.
randomly crops it to the original length. The padding length Table 1 reports that 1) CrosSCLR improves the capability
is defined as T /γ, γ is noted as padding ratio. The padding of each single SkeletonCLR model, e.g., CrosSCLR-joint
operation uses the reflection of the original boundary. (79.88) v.s SkeletonCLR-joint (76.44) on xview protocol; 2)
Augmentation NTU-60 (%)
SkeletonCLR NTU-60 (%)
(joint)
Shear β Crop γ xsub xview
top-K xsub xview
0 0 33.3 26.2
0 70.5 77.4 0.2 0 62.7 67.7
CrosSCLR 1 74.5 82.1 0.5 0 66.3 68.8
(joint) 3 73.7 79.9 1.0 0 62.0 66.8
5 72.4 79.2
7 73.0 78.6 0.5 4 67.6 76.3
CrosSCLR 10 64.4 69.9 0.5 6 68.3 76.4
(motion) 0.5 8 69.1 74.7
Epoch 150 Epoch 200 Epoch 250 Epoch 300
Table 2. Results of pre-
training 2s-CrosSCLR
with various K in knowl- Table 3. Ablation study on differ-
Figure 4. The t-SNE visualization of embeddings at different
edge miner Γ. ent data augmentations for Skeleton-
epochs during pre-training. Embeddings from 10 categories are
CLR (joint).
sampled and visualized with different colors. For CrosSCLR,
Lcross starts to be available at epoch 150, so its distribution has
no difference from that of SkeletonCLR before epoch 150, shown Views of NTU-60 (%)
in red boxes. Method Pre-training xsub xview
SkeletonCLR Joint 68.3 76.4
CrosSCLR bridges the performance gap of two views and SkeletonCLR + LKM Joint 69.3 77.4
jointly improves their accuracy, e.g., for SkeletonCLR, joint CrosSCLR w/o. EC Joint + Motion 71.4 78.5
(76.44) v.s motion (50.82) but for CrosSCLR, joint (79.88) CrosSCLR Joint + Motion 72.9 79.9
v.s motion (77.59); 3) CrosSCLR improves the multi-view Table 4. Ablation study on contrastive settings N+ and embedding
ensemble results via cross-view training. In summary, the context (EC). The models are linear evaluated on only joint.
cross-view high-confidence knowledge does help the model
extract more discriminating representations. the representation capability of SkeletonCLR. Additionally,
Qualitative Results. We apply t-SNE [30] with fix set- CrosSCLR achieves worse performance without using em-
tings to show the embedding distribution of SkeletonCLR bedding context (EC), proving the significance of similar-
and CrosSCLR on 150, 200, 250, 300 epochs during pre- ity/distribution among samples.
training in Figure 4. Note that cross-view loss, Equation Effects of Augmentations. SkeletonCLR and CrosSCLR
(8), is available only after epoch 150. From the visual re- are based on contrastive learning, but the data augmentation
sults, we can draw a similar conclusion to that in Table 1. strategy used on skeleton data is rarely explored, especially
Embeddings of CrosSCLR are clustered more closely than for the GCN encoder. We verify the effectiveness of data
that of SkeletonCLR, which is more discriminating. For augmentation and the impact of different augmented inten-
CrosSCLR, the distributions of joint and motion are distinct sities in skeleton-based contrastive learning by conducting
at 150-th epoch but look very similar at 300-th epoch, i.e., experiments on SkeletonCLR, as shown in Table 3. It indi-
consistent distribution. Especially, they both build a more cates the importance of data augmentation in SkeletonCLR.
“regular” space than SkeletonCLR, proving the effective- We choose β = 0.5 and γ = 6 as default settings according
ness of CrosSCLR. to the mean accuracy on xsub and xview protocols.
Effects of Contrastive Setting top-K. As hyper-parameter
K determines the number of mined samples, influencing the
4.4. Comparison
depth of knowledge exchange, we study how K impacts the We compare CrosSCLR with other methods under lin-
performance in cross-view learning. Table 2 shows that K ear evaluation and finetune protocols. Since the backbone
has a great influence on the performance and achieves the in many methods is an RNN-based model, e.g., GRU or
best result when K = 1. However, a larger K decreases the LSTM, we additionally use LSTM (following the setting in
performance, because the not so confident information may [39]) as the encoder for a fair comparison, i.e., CrosSCLR
lead the model astray in an unsupervised case. (LSTM).
Contrastive Setting N+ and Embedding Context S. We Unsupervised Results on NTU-60. In Table 5, LongT
develop following models in Table 4 for comparison: 1) GAN [65] adversarially trains the model by skeleton in-
SkeletonCLR + LKM is a model with single-view knowl- painting pretext task, MS2 L [26] trains the model by multi-
edge mining. 2) CrosSCLR w/o. embedding context (EC) task scheme, i.e, prediction, jiasaw puzzle and instance
is the model only using the contrastive setting N+ for discrimination, AS-CAL [39] uses momentum LSTM en-
cross-view learning, which ignores the embedding con- coder for contrastive learning with single-view skeleton se-
text/distribution, i.e., Siv = 1, ∀i ∈ N in Equation (6). quence, P&C [47] trains a stronger encoder by weakening
The results of SkeletonCLR + LKM show that KM improves decoder, and SeBiReNet [34] constructs a human-like GRU
NTU-60 (%) NTU-60 (%)
Method Encoder Classifier xsub xview Method Label Fraction xsub xview
LongT GAN [65] GRU FC 39.1 48.1 LongT GAN [65] 1% 35.2 -
MS2 L [26] GRU GRU 52.6 - MS2 L [26] 1% 33.1 -
AS-CAL [39] LSTM FC 58.5 64.8 3s-CrosSCLR 1% 51.1 50.0
P&C [47] GRU KNN 50.7 76.3
LongT GAN [65] 10% 62.0 -
SeBiReNet [34] GRU LSTM - 79.7
MS2 L [26] 10% 65.2 -
3s-CrosSCLR (LSTM) LSTM FC 62.8 69.2 3s-CrosSCLR 10% 74.4 77.8
3s-CrosSCLR (LSTM) LSTM LSTM 70.4 79.9
3s-CrosSCLR‡ ST-GCN FC 72.8 80.7 Table 7. Linear classification with fewer labels on NTU-60.
3s-CrosSCLR ST-GCN FC 77.8 83.4
NTU-60 (%) NTU-120 (%)
Table 5. Unsupervised results on NTU-60. These methods are pre- Method xsub xview xsub xset
trained to learn encoder and then follow the linear evaluation pro-
3s-ST-GCN∗ [57] 85.2 91.4 77.2 77.1
tocol to learn the classifiers. “‡” indicates the model pre-trained
3s-CrosSCLR‡ (FT) 85.6 92.0 - -
on NTU-61-120.
3s-CrosSCLR (FT) 86.2 92.5 80.5 80.4

NTU-120 (%) Table 8. Finetuned results on NTU-60 and NTU-120. ST-GCN∗ is


Method Supervision xsub xset the method reproduced by released code. “‡” indicates the model
pre-trained on NTU-61-120. “FT” means finetune protocol.
Part-Aware LSTM [40] Supervised 25.5 26.3
Soft RNN [15] Supervised 36.3 44.9
TSRJI [2] Supervised 67.9 62.8 GCN∗ [57] in Table 8 has the same number of parame-
ST-GCN [57] Supervised 79.7 81.3 ters as 3s-CrosSCLR (1/4 channel with three streams). It
shows that the finetuned model, CrosSCLR (FT) outper-
AS-CAL [39] Unsupervised 48.6 49.2
forms the supervised ST-GCN on both NTU-60 and NTU-
3s-CrosSCLR (LSTM) Unsupervised 53.9 53.2
3s-CrosSCLR Unsupervised 67.9 66.7 120 datasets, indicating the effectiveness of cross-view pre-
training.
Table 6. Unsupervised results on NTU-120. We show and compare
Transfer Ability. We first pre-train CrosSCLR on NTU-
our method with unsupervised and supervised methods.
61-120, and then transfer it to NTU-60 for linear evaluation,
noted as CrosSCLR‡. The model trained under xsub proto-
network to utilize view-independent and pose-independent col is transferred to the xsub protocol of NTU-60; the model
feature. Our CrosSCLR exploits the multi-view knowledge trained under xset protocol is transferred to the xview proto-
by cross-view consistent knowledge mining. Taking a fully- col of NTU-60. In Table 5, it achieves better results than the
connected layer (FC) as the classifier, our model outper- other unsupervised methods, and its supervisedly finetuning
forms other methods with the same classifier. With LSTM result is higher than ST-GCN as shown in Table 8.
classifier and LSTM encoder, our model outperforms the
above methods on both xsub and xview protocols. 5. Conclusion
Results on NTU-120. As few unsupervised results are re- In this work, we propose a Cross-view Contrastive
ported on NTU-120 dataset, we compare our method with Learning framework for unsupervised 3D skeleton-based
unsupervised and supervised methods. As shown in Ta- action representation to exploit multi-view high-confidence
ble 6, TSRJI [2] supervisedly utilizes attention LSTM, AS- knowledge as complementary supervision. It integrates
CAL [39] adopts LSTM for skeleton modeling, and our single-view contrastive learning with cross-view consis-
method defeats the other unsupervised method and some of tent knowledge mining modules which convey the con-
the supervised methods. trastive settings and embedding context among views by
Linear Classification with Fewer Labels. We follow the high-confidence sample mining. Experiments show re-
same protocol as that of MS2 L [26], i.e., pre-training with markable results of CrosSCLR for action recognition on
all training data and then finetuning the classifier with only NTU datasets.
1% and 10% randomly-selected labeled data respectively.
As shown in Table 7, CrosSCLR achieves higher perfor- Acknowledgement
mance than other methods. This work was supported by National Science Founda-
Finetuned Results on NTU-60 and NTU-120. We first tion of China (U20B2072, 61976137). This work was also
unsupervisedly pre-train our model and follow the fine- supported by NSFC (U19B2035), Shanghai Municipal Sci-
tune protocol for evaluation. For fair comparison, ST- ence and Technology Major Project (2021SHZDZX0102).
References [18] Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna,
Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu,
[1] Amor Ben Tanfous, Hassen Drira, and Boulbaba Ben Amor. and Dilip Krishnan. Supervised contrastive learning.
Coding kendall’s shape trajectories for 3d action recognition. arXiv:2004.11362, 2020.
In CVPR, pages 2840–2849, 2018. [19] Dahun Kim, Donghyeon Cho, and In So Kweon. Self-
[2] Carlos Caetano, François Brémond, and William Robson supervised video representation learning with space-time cu-
Schwartz. Skeleton image representation for 3d action recog- bic puzzles. In AAAI, volume 33, pages 8545–8552, 2019.
nition based on tree structure and reference joints. In SIB- [20] Jogendra Nath Kundu, Maharshi Gor, Phani Krishna Uppala,
GRAPI, pages 16–23. IEEE, 2019. and Venkatesh Babu Radhakrishnan. Unsupervised feature
[3] Zhe Cao, Gines Hidalgo, Tomas Simon, Shih-En Wei, and learning of human actions as trajectories in pose embedding
Yaser Sheikh. Openpose: realtime multi-person 2d pose es- manifold. In WACV, pages 1459–1467. IEEE, 2019.
timation using part affinity fields. TPAMI, 43(1):172–186, [21] Hsin-Ying Lee, Jia-Bin Huang, Maneesh Singh, and Ming-
2019. Hsuan Yang. Unsupervised representation learning by sort-
[4] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Ge- ing sequences. In ICCV, pages 667–676, 2017.
offrey Hinton. A simple framework for contrastive learning [22] Chao Li, Qiaoyong Zhong, Di Xie, and Shiliang Pu.
of visual representations. arXiv:2002.05709, 2020. Skeleton-based action recognition with convolutional neural
[5] Xinlei Chen, Haoqi Fan, Ross Girshick, and Kaiming He. networks. In ICME Workshops, pages 597–600. IEEE, 2017.
Improved baselines with momentum contrastive learning. [23] Junnan Li, Yongkang Wong, Qi Zhao, and Mohan S Kankan-
arXiv:2003.04297, 2020. halli. Unsupervised learning of view-invariant action repre-
[6] Yong Du, Wei Wang, and Liang Wang. Hierarchical recur- sentations. In NeurIPS, pages 1254–1264, 2018.
rent neural network for skeleton based action recognition. In [24] Maosen Li, Siheng Chen, Xu Chen, Ya Zhang, Yanfeng
CVPR, pages 1110–1118, 2015. Wang, and Qi Tian. Actional-structural graph convolutional
[7] Basura Fernando, Hakan Bilen, Efstratios Gavves, and networks for skeleton-based action recognition. In CVPR,
Stephen Gould. Self-supervised video representation learn- pages 3595–3603, 2019.
ing with odd-one-out networks. In CVPR, pages 3636–3645, [25] Duohan Liang, Guoliang Fan, Guangfeng Lin, Wanjun Chen,
2017. Xiaorong Pan, and Hong Zhu. Three-stream convolutional
[8] Spyros Gidaris, Praveer Singh, and Nikos Komodakis. Un- neural network with multi-task and ensemble learning for 3d
supervised representation learning by predicting image rota- action recognition. In CVPR Workshops, 2019.
tions. arXiv:1803.07728, 2018. [26] Lilang Lin, Sijie Song, Wenhan Yang, and Jiaying Liu. Ms2l:
[9] Liang-Yan Gui, Yu-Xiong Wang, Xiaodan Liang, and Multi-task self-supervised learning for skeleton based action
José MF Moura. Adversarial geometry-aware human motion recognition. In ACMMM, pages 2490–2498, 2020.
prediction. In ECCV, pages 786–803, 2018. [27] Jun Liu, Amir Shahroudy, Mauricio Lisboa Perez, Gang
Wang, Ling-Yu Duan, and Alex Kot Chichung. Ntu rgb+
[10] Tengda Han, Weidi Xie, and Andrew Zisserman. Self-
d 120: A large-scale benchmark for 3d human activity un-
supervised co-training for video representation learning.
derstanding. TPAMI, 2019.
NeurIPS, 33, 2020.
[28] Ziyu Liu, Hongwen Zhang, Zhenghao Chen, Zhiyong Wang,
[11] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross
and Wanli Ouyang. Disentangling and unifying graph con-
Girshick. Momentum contrast for unsupervised visual rep-
volutions for skeleton-based action recognition. In CVPR,
resentation learning. In CVPR, pages 9729–9738, 2020.
pages 143–152, 2020.
[12] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the [29] Zelun Luo, Boya Peng, De-An Huang, Alexandre Alahi, and
knowledge in a neural network. arXiv:1503.02531, 2015. Li Fei-Fei. Unsupervised learning of long-term motion dy-
[13] Geoffrey E Hinton and Sam Roweis. Stochastic neighbor namics for videos. In CVPR, pages 2203–2212, 2017.
embedding. NIPS, 15:857–864, 2002. [30] Laurens van der Maaten and Geoffrey Hinton. Visualiz-
[14] Sepp Hochreiter, Yoshua Bengio, Paolo Frasconi, Jürgen ing data using t-sne. Journal of machine learning research,
Schmidhuber, et al. Gradient flow in recurrent nets: the dif- 9(Nov):2579–2605, 2008.
ficulty of learning long-term dependencies, 2001. [31] Julieta Martinez, Michael J Black, and Javier Romero. On
[15] Jian-Fang Hu, Wei-Shi Zheng, Lianyang Ma, Gang Wang, human motion prediction using recurrent neural networks. In
Jianhuang Lai, and Jianguo Zhang. Early action prediction CVPR, pages 2891–2900, 2017.
by soft regression. TPAMI, 41(11):2568–2583, 2018. [32] Ishan Misra, C Lawrence Zitnick, and Martial Hebert. Shuf-
[16] Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, fle and learn: unsupervised learning using temporal order
Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, verification. In ECCV, pages 527–544. Springer, 2016.
Tim Green, Trevor Back, Paul Natsev, et al. The kinetics [33] Bingbing Ni, Gang Wang, and Pierre Moulin. Rgbd-hudaact:
human action video dataset. arXiv:1705.06950, 2017. A color-depth video database for human daily activity recog-
[17] Qiuhong Ke, Mohammed Bennamoun, Senjian An, Ferdous nition. In ICCV Workshops, pages 1147–1153. IEEE, 2011.
Sohel, and Farid Boussaid. A new representation of skeleton [34] Qiang Nie, Ziwei Liu, and Yunhui Liu. Unsupervised 3d
sequences for 3d action recognition. In CVPR, pages 3288– human pose representation with viewpoint and pose disen-
3297, 2017. tanglement. ECCV, 2020.
[35] Mehdi Noroozi and Paolo Favaro. Unsupervised learning of [52] Jiang Wang, Zicheng Liu, Ying Wu, and Junsong Yuan.
visual representations by solving jigsaw puzzles. In ECCV, Mining actionlet ensemble for action recognition with depth
pages 69–84. Springer, 2016. cameras. In CVPR, pages 1290–1297. IEEE, 2012.
[36] Mehdi Noroozi, Ananth Vinjimoor, Paolo Favaro, and [53] Minsi Wang, Bingbing Ni, and Xiaokang Yang. Learning
Hamed Pirsiavash. Boosting self-supervised learning via multi-view interactional skeleton graph for action recogni-
knowledge transfer. In CVPR, pages 9359–9367, 2018. tion. TPAMI, 2020.
[37] Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Rep- [54] Chen Wei, Lingxi Xie, Xutong Ren, Yingda Xia, Chi Su,
resentation learning with contrastive predictive coding. Jiaying Liu, Qi Tian, and Alan L Yuille. Iterative reorganiza-
arXiv:1807.03748, 2018. tion with weak spatial constraints: Solving arbitrary jigsaw
[38] Adam Paszke, Sam Gross, Soumith Chintala, Gregory puzzles for unsupervised representation learning. In CVPR,
Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Al- pages 1910–1919, 2019.
ban Desmaison, Luca Antiga, and Adam Lerer. Automatic [55] Zhirong Wu, Yuanjun Xiong, Stella X Yu, and Dahua Lin.
differentiation in pytorch. NeurIPS Workshops, 2017. Unsupervised feature learning via non-parametric instance
[39] Haocong Rao, Shihao Xu, Xiping Hu, Jun Cheng, and Bin discrimination. In CVPR, pages 3733–3742, 2018.
Hu. Augmented skeleton based contrastive action learning [56] Jingwei Xu, Zhenbo Yu, Bingbing Ni, Jiancheng Yang, Xi-
with momentum lstm for unsupervised action recognition. aokang Yang, and Wenjun Zhang. Deep kinematics analysis
arXiv:2008.00188, 2020. for monocular 3d human pose estimation. In CVPR, pages
[40] Amir Shahroudy, Jun Liu, Tian-Tsong Ng, and Gang Wang. 899–908, 2020.
Ntu rgb+ d: A large scale dataset for 3d human activity anal- [57] Sijie Yan, Yuanjun Xiong, and Dahua Lin. Spatial tempo-
ysis. In CVPR, pages 1010–1019, 2016. ral graph convolutional networks for skeleton-based action
recognition. In AAAI, 2018.
[41] Lei Shi, Yifan Zhang, Jian Cheng, and Hanqing Lu.
Skeleton-based action recognition with directed graph neu- [58] Mihai Zanfir, Marius Leordeanu, and Cristian Sminchisescu.
ral networks. In CVPR, pages 7912–7921, 2019. The moving pose: An efficient 3d kinematics descriptor for
low-latency action recognition and detection. In ICCV, pages
[42] Lei Shi, Yifan Zhang, Jian Cheng, and Hanqing Lu. Two-
2752–2759, 2013.
stream adaptive graph convolutional networks for skeleton-
[59] Xiaohua Zhai, Avital Oliver, Alexander Kolesnikov, and Lu-
based action recognition. In CVPR, pages 12026–12035,
cas Beyer. S4l: Self-supervised semi-supervised learning. In
2019.
ICCV, pages 1476–1485, 2019.
[43] Connor Shorten and Taghi M Khoshgoftaar. A survey on
[60] Pengfei Zhang, Cuiling Lan, Junliang Xing, Wenjun Zeng,
image data augmentation for deep learning. Journal of Big
Jianru Xue, and Nanning Zheng. View adaptive recurrent
Data, 6(1):60, 2019.
neural networks for high performance human action recog-
[44] Chenyang Si, Wentao Chen, Wei Wang, Liang Wang, and nition from skeleton data. In ICCV, pages 2117–2126, 2017.
Tieniu Tan. An attention enhanced graph convolutional lstm
[61] Pengfei Zhang, Cuiling Lan, Junliang Xing, Wenjun Zeng,
network for skeleton-based action recognition. In CVPR,
Jianru Xue, and Nanning Zheng. View adaptive neural net-
pages 1227–1236, 2019.
works for high performance skeleton-based human action
[45] Sijie Song, Cuiling Lan, Junliang Xing, Wenjun Zeng, and recognition. TPAMI, 41(8):1963–1978, 2019.
Jiaying Liu. Spatio-temporal attention-based lstm networks
[62] Pengfei Zhang, Cuiling Lan, Wenjun Zeng, Junliang Xing,
for 3d action recognition and detection. TIP, 27(7):3459–
Jianru Xue, and Nanning Zheng. Semantics-guided neural
3471, 2018.
networks for efficient skeleton-based human action recogni-
[46] Nitish Srivastava, Elman Mansimov, and Ruslan Salakhudi- tion. In CVPR, pages 1112–1121, 2020.
nov. Unsupervised learning of video representations using [63] Richard Zhang, Phillip Isola, and Alexei A Efros. Color-
lstms. In ICML, pages 843–852, 2015. ful image colorization. In ECCV, pages 649–666. Springer,
[47] Kun Su, Xiulong Liu, and Eli Shlizerman. Predict & cluster: 2016.
Unsupervised skeleton based action recognition. In CVPR, [64] Xikun Zhang, Chang Xu, and Dacheng Tao. Context aware
pages 9631–9640, 2020. graph convolution for skeleton-based action recognition. In
[48] Yonglong Tian, Dilip Krishnan, and Phillip Isola. Con- CVPR, pages 14333–14342, 2020.
trastive multiview coding. arXiv:1906.05849, 2019. [65] Nenggan Zheng, Jun Wen, Risheng Liu, Liangqu Long, Jian-
[49] Raviteja Vemulapalli, Felipe Arrate, and Rama Chellappa. hua Dai, and Zhefeng Gong. Unsupervised representation
Human action recognition by representing 3d skeletons as learning with long-term dynamics for skeleton based action
points in a lie group. In CVPR, pages 588–595, 2014. recognition. In AAAI, 2018.
[50] Raviteja Vemulapalli and Rama Chellapa. Rolling rotations
for recognizing human actions from 3d skeletal data. In
CVPR, pages 4471–4479, 2016.
[51] Jiangliu Wang, Jianbo Jiao, Linchao Bao, Shengfeng He,
Yunhui Liu, and Wei Liu. Self-supervised spatio-temporal
representation learning for videos by predicting motion and
appearance statistics. In CVPR, pages 4006–4015, 2019.

You might also like