3D Human Action Representation Learning Via Cross-View Consistency Pursuit

Uploaded by

James Bu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views10 pages

3D Human Action Representation Learning Via Cross-View Consistency Pursuit

Uploaded by

James Bu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

3D Human Action Representation Learning via Cross-View Consistency Pursuit

Linguo Li1,2∗ Minsi Wang1,2∗ Bingbing Ni1,2∗∗ Hang Wang1,2 Jiancheng Yang1,2 Wenjun Zhang1
1
Shanghai Jiao Tong University, Shanghai 200240, China
2
MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University
[email protected], {LLG440982, nibingbing, zhangwenjun}@sjtu.edu.cn
arXiv:2104.14466v2 [cs.CV] 1 May 2021

Abstract Sample 1 Positive Negative Sample 2

In this work, we propose a Cross-view Contrastive Joint

Learning framework for unsupervised 3D skeleton-based
(a) General
action Representation (CrosSCLR), by leveraging multi-
(b) Ours
view complementary supervision signal. CrosSCLR con-
sists of both single-view contrastive learning (Skeleton-
Joint
CLR) and cross-view consistent knowledge mining (CVC-
Different in Pose
KM) modules, integrated in a collaborative learning man-
ner. It is noted that CVC-KM works in such a way that
high-confidence positive/negative samples and their dis- Motion

tributions are exchanged among views according to their

Similar in Motion
embedding similarity, ensuring cross-view consistency in
terms of contrastive context, i.e., similar distributions. Ex- Figure 1. Hand waving in joint and motion form. Two samples are
tensive experiments show that CrosSCLR achieves remark- from the same action class. (a) usual contrastive learning methods
able action recognition results on NTU-60 and NTU-120 regard them as negative pairs. (b) in a multi-view situation, con-
datasets under unsupervised settings, with observed higher- sidering their similar motion patterns, they can be positive pairs.
quality action representations. Our code is available at This motivates us to introduce cross-view contrastive learning for
https://github.com/LinguoLi/CrosSCLR. skeleton representation.

on contrastive learning [55, 4, 11, 18], aiming to leverage

1. Introduction the instance discrimination of samples in latent space.
Although the above approaches improve the skeleton
Human action recognition is an important but challeng-
representation capability to some extent, it is believed that
ing task in computer vision research. Due to the light-
the power of unsupervised methods is by far from fully
weight and robust estimation algorithms [3, 56], 3D skele-
explored. On the one hand, traditional contrastive learn-
ton has become a popular feature representation to study hu-
ing uses only one positive pair generated by data augmen-
man action dynamics. Many 3D action recognition works
tation and even similar samples are regarded as negative
[6, 60, 17, 25, 44, 28, 16] use a fully-supervised manner
samples. Despite the high similarity, the negative samples
and require massive labeled 3D skeleton data. However,
are forced away in embedding space, which is unreason-
annotating data is expensive and time-consuming, which
able for clustering. On the other hand, current unsupervised
prompts people to explore unsupervised methods [65, 26,
methods [1, 65, 20, 26, 47] have not yet explored the rich
39, 47] on skeleton data. Some unsupervised methods
intra-supervision information provided by different skele-
exploit structure completeness within each sample based
ton modalities. Considering that it is easy to obtain skeleton
on pretext tasks, including reconstruction [9, 65], auto-
data in multiple “views”, e.g., joint, motion and bone, com-
regression [20, 47] and jigsaw puzzles [35, 54], but it is
plementary information preserved in different views can as-
unsure that the designed pretext tasks generalize well for
sist the operation to mine positive pairs from similar nega-
downstream tasks. Other unsupervised methods are based
tive samples. As shown in Figure 1, the same hand waving
∗ Equal Contribution actions are different in pose (joint), but similar in motion.
∗∗ Corresponding Author: Bingbing Ni Usual contrastive learning methods regard them as nega-
tive pairs, keeping them away in embedding space. If such 2. Related Work
complementary information, i.e., different in joint but simi-
lar in motion, could be fully utilized and explored, the size Self-Supervised Representation Learning. Self-
of hidden positive pairs in joint can be boosted, enhancing supervised learning is to learn feature representations from
training fidelity. Thus, the cross-view contrastive learning numerous unlabeled data, which usually generates super-
strategy takes advantage of multi-view knowledge, resulting vision by pretext tasks, e.g., jigsaw puzzles [35, 36, 54],
in better-extracted skeleton features. colorization [63], predicting rotation [8, 59]. For se-
quence data, supervision can be generated by frame
To this end, we propose a Cross-view Contrastive Learn- orders [32, 7, 21], space-time cubic puzzles [19] and
ing framework for Skeleton-based action Representation prediction [51, 26], but these methods highly rely on
(CrosSCLR), which exploits multi-view information for the quality of pretext tasks. Recently, contrastive meth-
mining positive samples and pursuing cross-view consis- ods [37, 55, 48, 18] based on instance discrimination have
tency in unsupervised contrastive learning, enabling the been proposed for representation learning. MoCo [11]
model to extract more comprehensive cross-view features. introduces a memory bank to store the embeddings of
First, parallel Contrastive Learning is evoked for each negative samples, and SimCLR [4] uses a much larger
single-view Skeleton action Representation (SkeletonCLR), mini-batch size to compute the embeddings in real time,
yielding multiple single-view embedding features. Second, but they can not capture the cross-view knowledge for
inspired by the fact that the distance of samples in embed- 3D action representation. Concurrent work CoCLR [10]
ding space reflects the similarity of the samples in the orig- leverages multi-modal information for video representation
inal space, we refer to the extreme similarity of samples in via co-training, which doesn’t consider the contrastive
one view to guide the learning process in another view, as context. Our CrosSCLR simultaneously trains models in
shown in Figure 1. More specifically, Cross-View Consis- all views by encouraging cross-view consistency, leading
tent Knowledge Mining (CVC-KM) module is developed to to more representative embeddings.
exam the similarity of samples, and select the most similar
pairs as positive ones to boost the positive set in comple- Skeleton-based Action Recognition. To tackle skeleton-
mentary views, i.e., embedding distance/similarity (confi- based action recognition tasks, early approaches are gener-
dence score) serves as the weight of corresponding mined ally based on hand-craft features [33, 52, 49, 50]. Recent
sample in the contrastive loss. In other words, CVC-KM methods pay more attention to deep neural networks. For
conveys the most prominent knowledge from one view to the sequence structure of skeleton data, many RNN-based
others, introduces complementary pseudo-supervised con- methods [6, 40, 60, 45, 61] were carried out to effectively
straint and promotes information sharing among views. The utilize the temporal feature. Since RNN suffers from gradi-
entire framework excavates positive pairs across views ac- ent vanishing [14], CNN-based models [17, 22] attract re-
cording to the distance between samples in the embedding searchers’ attention, but they need to convert skeleton data
space to promote knowledge exchange among views, so to another form. Further, ST-GCN [57] was proposed to
that the extracted skeleton features will contain multi-view better model the graph structure of skeleton data. Then the
knowledge and are more competitive for various down- attention mechanism [24, 42, 44, 64, 62] and multi-stream
stream tasks. Extensive results on NTU-RGB+D [40, 27] structure [25, 41, 42, 53] are applied to adaptively cap-
datasets demonstrate that our method indeed boosts the 3D ture multi-stream features based on GCNs. We adopt the
action representation learning benefiting from cross-view widely-used ST-GCN as the backbone to extract the skele-
consistency. We summarize our contributions as follows: ton features.
Unsupervised Skeleton Representation. Many unsuper-
• We propose CrosSCLR, a cross-view contrastive learn- vised methods [46, 29, 31, 23] were proposed to capture
ing framework for skeleton-based action representation. action representations in videos. For skeleton data, previ-
• We develop Contrastive Learning for Skeleton-based ac- ous works [58, 1] have achieved some progress in unsu-
tion Representation (SkeletonCLR) to learn the single- pervised representation learning without deep neural net-
view representations of skeleton data. works. Recent deep learning methods [9, 65, 20, 47] are
based on the structure of encoder-decoder or generative ad-
• We use parallel SkeletonCLR models and CVC-KM versarial network (GAN). LongT GAN [65] proposed an
to excavate useful samples across views, enabling the auto-encoder-based GAN for sequential reconstruction and
model to capture more comprehensive representation evaluated it on the action recognition tasks. P&C [47] uses
unsupervisedly. a weak decoder in the encoder-decoder model, forcing the
• We evaluate our model on 3D skeleton datasets, e.g., encoder to learn more discriminative features. MS2 L [26]
NTU-RGB+D 60/120, and achieve remarkable results proposed a multi-task learning scheme for action represen-
under unsupervised settings. tation learning. However, these methods highly depend on
𝑥𝑥 𝑔𝑔 𝐌𝐌
reconstruction or prediction, and they do not exploit the nat- f

Memory Bank
ℎ 𝓏𝓏
ural multi-view knowledge of skeleton data. Thus, we intro-
duce CrosSCLR for unsupervised 3D action representation. ℒ
𝑔𝑔�
3. CrosSCLR 𝑥𝑥� f� ℎ� 𝓏𝓏̂
𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒
Although 3D skeleton has shown its importance in ac- Figure 2. Architecture of single-view SkeletonCLR, which is a
tion recognition, unsupervised skeleton representation has memory augmented contrastive learning framework.
not been well exploited recently. Since the easily-obtained
“multi-view” skeleton information plays a significant role in • A simple projector g and its momentum updated version
action recognition, we expect to exploit them to mine posi- ĝ that project the hidden vector to a lower dimension
tive samples and pursue cross-view consistency in unsuper- space: z = g(h), ẑ = ĝ(ĥ), where z, ẑ ∈ Rcz . The
vised contrastive learning, thus giving rise to a Cross-view projector is a fully-connected (FC) layer with ReLU.
Contrastive Learning (CrosSCLR) framework for Skeleton- • A memory bank M = {mi }M i=1 that stores negative
based action Representation. samples to avoid redundant computation of the embed-
As shown in Figure 3, CrosSCLR contains two key mod- dings. It is a first-in-first-out queue updated per iteration
ules: 1) SkeletonCLR (Section 3.1): a contrastive learning by ẑ. After each inference step, ẑ will enqueue while
framework to unsupervisedly learn single-view representa- the earliest embedding in M will dequeue. During con-
tions, and 2) CVC-KM (Section 3.2): it conveys the most trastive training, M provides numerous negative embed-
prominent knowledge from one view to others, introduces dings while the new calculated ẑ is the positive embed-
complementary pseudo-supervised constraint and promotes ding.
information sharing among views. Finally, the more dis- • An InfoNCE [37] loss for instance discrimination:
criminating representations can be obtained by coopera-
tively training (Section 3.2). exp(z · ẑ/τ )
L = − log PM (1)
exp(z · ẑ/τ ) + i=1 exp(z · mi /τ )
3.1. Single-View 3D Action Representation
where mi ∈ M, τ is the temperature hyper-
Contrastive learning has been widely-used due to its in-
parameter [12], and dot product z · ẑ is to compute their
stance discrimination capability, especially for images [4,
similarity where z, ẑ are normalized.
11] and videos [10]. Inspired by this, we develop Skeleton-
CLR to learn single-view 3D action representations, based Constrained by contrastive loss L, the model is unsupervis-
on the recent advanced practice, MoCo [11]. edly trained to discriminate each sample in the training set.
SkeletonCLR. It is a memory-augmented contrastive learn- At last, we can obtain a strong encoder f that is beneficial
ing method for skeleton representation, which considers to extract single-view distinguishing representations.
one sample’s different augments as its positive samples and Limitations of Single-View Contrastive Learning. The
other samples as negative samples. In each training step, the above SkeletonCLR still suffers the following limitations:
batch embeddings are stored in first-in-first-out memory to 1) Embedding distribution can provide more reliable in-
get rid of redundant computation, serving as negative sam- formation. We expect samples from the same category
ples for the next steps. The positive samples are embedded are embedded closely. However, instance discrimination in
close to each other while the embeddings of negative sam- SkeletonCLR uses only one positive pair and even similar
ples are pushed away. As shown in Figure 2, SkeletonCLR samples are regarded as negative samples. It is unreason-
consists of the following major components: able that the negative samples are forced away in embed-
ding space despite their high embedding similarity. In other
• A data augmentation module T that randomly trans-
words, one positive pair cannot fully describe the relation-
forms the given skeleton sequence into different aug-
ships of samples, and a more reliable embedding distribu-
ments x, x̂ that are considered as positive pairs. For
tion is needed, i,e., positive/negative setting plus embedding
skeleton data, we adopt Shear and Crop as the augmen-
similarity. We aim to mine more representative knowledge
tation strategy (see Section 3.3 and Appendix).
to facilitate contrastive learning, which is also the knowl-
• Two encoders f and fˆ that embed x and x̂ into hid- edge we want to exchange across views. Thus, we introduce
den space: h = f (x; θ) and ĥ = fˆ(x̂; θ̂), where the contrastive context in Section 3.2.
h, ĥ ∈ Rch . fˆ is the momentum updated version of f : 2) Multi-view data can benefit representation learning.
θ̂ ← αθ̂ + (1 − α)θ, where α is a momentum coeffi- SkeletonCLR only relies on single-view data. As shown
cient. SkeletonCLR uses ST-GCN [57] as the backbone in Figure 1, since we don’t have any annotations, differ-
(Details are in Section 3.3). ent samples of the same class are inevitably embedded
into distinct places far from each other, i.e., they distribute context contains not only the information of the most simi-
sparsely/irregularly, bringing much difficulty for linear clas- lar samples but the detailed relationships of samples (distri-
sification. Considering the readily generated multi-view bution).
data of 3D skeleton (see Section 3.3), if such complemen- In Equation (1), the embedding z has positive context
tary information in Figure 1, i.e., different in joint but sim- S+ = {z · ẑ} which does not consider any of neighbors in
ilar in motion, could be fully utilized and explored, the size embedding space except for the augments. Despite the high
of hidden positive pairs in joint can be boosted, enhancing similarity, the negative samples are forced away in embed-
training fidelity. To this end, we inject this consideration ding space, and then samples belonging to the same cate-
into unsupervised contrastive learning framework. gory are difficultly embedded into the same cluster, which is
not efficient to build a “regular” embedding space for down-
3.2. Cross-View Consistent Knowledge Mining
stream classification tasks.
Motivated by the situation in Figure 1 that complemen- High-confidence Knowledge Mining. To solve the above
tary knowledge is preserved in multiple views, we propose issue, we develop the high-confidence Knowledge Mining
the Cross-View Consistent Knowledge Mining (CKC-KM), mechanism (KM), which selects the most similar pairs as
leveraging the high similarity of samples in one view to positive ones to boost the positive sets. It shares similar
guide the learning process in another view. It excavates pos- high-level spirit with neighborhood embedding [13] but per-
itive pairs across views according to the embedding similar- forms differently in an unsupervised contrastive manner.
ity to promote knowledge exchange among views, then the Specifically, it is based on the following observation in
size of hidden positive pairs in each view can be boosted Figure 4 that after single-view contrastive learning, two em-
and the extracted skeleton features will contain multi-view beddings most likely belong to the same category if they are
knowledge, resulting in a more regular embedding space. embedded closely enough; on the contrary, two embeddings
In this section, we first clarify contrastive context as hardly belong to the same class if they locate extremely far
the consistent knowledge across views, and then show how from each other in embedding space. Therefore, we can fa-
to mine high-confidence knowledge, and finally inject its cilitate contrastive learning by setting the most similar em-
cross-view consistency into single-view SkeletonCLR to beddings as positive to make it more clustered:
further benefit the cross-view unsupervised representation.
Contrastive Context as Consistent Knowledge. As dis- Γ(S) = Topk(S) (4)
P
cussed above, the knowledge we want to exchange across exp(z · ẑ/τ ) + i∈N+ exp(z · mi /τ )
views is one sample’s contrastive context, which describes LKM = − log P (5)
exp(z · ẑ/τ ) + i∈N exp(z · mi /τ )
this sample’s relationships with others (distribution) in em-
bedding space under the settings of contrastive learning. where Γ = Topk is the function to select the index of top-
Notice that SkeletonCLR uses a memory bank to store nec- K similar embeddings and N+ is their index set in memory
essary embeddings. Given one sample’s embedding z and bank. Compared to Equation (1), Equation (5) will leads to
corresponding memory bank M, its contrastive context is a more regular space by pulling close more high-confidence
a similarity set S among z and M conditioned on specific positive samples. Additionally, since we don’t have any la-
knowledge miner Γ that generates index set N+ of positive bels, a larger K may harm the contrastive performance (see
samples, Section 4.3).
S = {si }i∈N = {z · mi }i∈N (2) Cross-View Consistency Learning. Considering easily-
(S+ , N+ ) = Γ(S) (3) obtained multi-view skeleton data, complementary infor-
mation preserved in different views can assist the opera-
where S+ = {si }i∈N+ and dot product “·” is to compute tion to mine positive pairs from similar negative samples
the similarity si among embeddings z and mi . N is the in Figure 1. Then the size of hidden positive pairs can be
index set of embeddings in memory bank and N+ is the boosted by cross-view knowledge communication, result-
index set of positive samples selected by knowledge miner ing in better-extracted skeleton features. To this end, we
Γ. Thus contrastive context C(S|N+ ) consists of following design the cross-view consistency learning which not only
two aspects: mines the high-confidence positive samples from comple-
• Embedding Context S: it is the relationship between one mentary view but also lets the embedding context be con-
sample and others in embedding space, i.e., distribution; sistent in multiple views. Its two-view case is illustrated in
Figure 3 for example.
• Contrastive Setting N+ : it is the positive setting mined
Specifically, samples xu and xv are generated from the
by Γ according to the embedding similarity S; same raw data x by the view generation method in Sec-
thus C(S|N+ ) = {S+ , S− } has positive context S+ and tion 3.3, where u and v indicate two types of data views.
negative context S− , where S = S+ ∪ S− . The contrastive After single-view contrastive learning, two SkeletonCLR
(a) (b)
View-𝑢
𝐌 ꞏꞏꞏ Embed 𝑠 𝑠 𝑠 𝑠
① ②
𝑚
SkeletonCLR 𝑢 ꞏꞏꞏ
ℒ𝑢→𝑣 ℒ𝑣→𝑢
𝑣 ꞏꞏꞏ 𝑠 𝑠 𝑠
SkeletonCLR ① ②

View-𝑣 ꞏꞏꞏ Embed

𝑚

Figure 3. (a) CrosSCLR. Given two samples xu , xv generated from the same raw data, e.g., joint and motion, SkeletonCLR models produce
single-view embeddings while cross-view consistent knowledge mining (CVC-KM) exchanges multi-view complementary knowledge. (b)
v
how Lv→u works in embedding space. In step 1, we mine high-confidence knowledge N+ from similarities S v to boost the positive set of
view u, i.e., z shares z ’s neighbors; In step 2, we use the similarities S to supervise the embedding distribution in view u. z u , z v share
u v v

similar relationships with others. Thus, two embedding spaces become similar under the constraint of Lcross .

modules obtain the embeddings z u , z v and corresponding lowing objective:

memory bank Mu , Mv respectively. We can mine the high-
u u
confidence knowledge from two views by (S+ , N+ ) = U X
X U
u v v v u v Lcross = Lu→v
Topk(S ) and (S+ , N+ ) = Topk(S ), where S+ , S+ are (8)
the positive context of z u , z v respectively. u v
CrosSCLR aims to learn the consistent embedding dis-
tribution in different views by encouraging the similar- where U is the number of views and v 6= u.
ity of contrastive context, i.e., exchanging high-confidence In the early training process, the model is not stable
knowledge across views. In Figure 3 (b), if we want to use and strong enough to provide reliable cross-view knowl-
the knowledge of view v to guide view u’s contrastive learn- edge without the supervision of labels. As the unreliable
ing, it contains two aspects: 1) step 1: we select the most information may lead astray, it is not encouraged to enable
similar pairs (positive) in view v as the positive sets in view cross-view communication too early. We perform two-stage
u, i.e., S u , N+
v
→ C(S u |N+ v
). Thus the sample z u shares training for CrosSCLR: 1) each view of the model is indi-
v vidually trained with Equation (1) without cross-view com-
z ’s positive neighbors; 2) step 2: we use the embedding
similarity svi = z v · mvi in view v as the weight of corre- munication. 2) then the model can supply high-confidence
sponding embedding mui in view u to provide the detailed knowledge, so the loss function is replaced with Equation
relational information, i.e., mv→ui = svi mui . Then the sim- (8), starting cross-view knowledge mining.
u v→u
ilarity is computed by z · m = svi (z u · mui ) = sui svi
u v→u
and z has embedding context S = {z u ·mv→u
i }i∈N = 3.3. Model Details
u v
{si si }i∈N . Finally, the overall loss is conducted as:
View Generation of 3D Skeleton. Generally, 3D human
exp(z u · zˆu /τ ) + i∈N v exp(sui svi )/τ )
P
skeleton sequence has T frames with V joints, and each
Lv→u = − log P +
exp(z u · zˆu /τ ) + i∈N exp(sui svi )/τ ) joint has C = 3 coordinate feature, which can be noted as
(6) x ∈ RC×T ×V . Different from videos, the views [41, 42] of
skeleton, e.g., joint, motion, bone, motion of bone, can be
Lcross = Lu→v + Lv→u (7)
easily obtained, which is a natural advantage for skeleton-
where Lv→u means we transfer the contrastive context of based representation learning. Motion is represented as the
z v to that of z u . Since sui , svi are the embedding context of temporal displacement between frames: x:,t+1,: −x:,t,: , and
ziu , ziv , we call Equation (7) as cross-view contrastive con- bone is the distance between two neighboring joints in the
text learning, which constrains the similar distribution of same frame: x:,:,v2 − x:,:,v1 . For simplicity, we use three
two views (see Section 4.3, the results of t-SNE). Compared views: joint, motion and bone in experiments.
to Equation (5), Equation (6) considers the cross-view in- Encoder f . We adopt ST-GCN [57] as encoder, which is
formation, cooperatively using one view’s high-confidence suitable for modeling graph-structure skeleton data by ex-
positive samples and its distribution to instruct the other ploiting the spatial and temporal relations. After a series of
view’s contrastive learning, resulting in more regular space ST-GCN blocks, the output feature xout ∈ Rcout ×tout ×V is
and better extracted skeleton features. applied by an average pooling operation on spatial and tem-
Learning CrosSCLR. For more views, CrosSCLR has fol- poral dimensions, obtaining final representation h ∈ Rch .
4. Experiments NTU-60 (%)
Method View xsub xview
4.1. Datasets
SkeletonCLR Joint 68.3 76.4
NTU-RGB+D 60. NTU-RGB+D 60 (NTU-60) dataset [40] SkeletonCLR Motion 53.3 50.8
is a large-scale dataset of 3D joint coordinate sequences for SkeletonCLR Bone 69.4 67.4
skeleton-based action recognition, containing 56, 578 skele- 2s-SkeletonCLR Joint + Motion 70.5 77.9
ton sequences in 60 action categories. Each skeleton graph 3s-SkeletonCLR Joint + Motion + Bone 75.0 79.8
contains V = 25 body joints as nodes, and their 3D co- CrosSCLR Joint 72.9 79.9
ordinates are initial features. There are two protocols [40] CrosSCLR Motion 72.7 77.6
recommended. 1) Cross-Subject (xsub): training data and CrosSCLR Bone 75.2 78.8
validation data are collected from different subjects. 2) 2s-CrosSCLR Joint + Motion 74.5 82.1
Cross-View (xview): training data and validation data are 3s-CrosSCLR Joint + Motion + Bone 77.8 83.4
collected from different camera views. Table 1. Comparisons of SkeletonCLR and CrosSCLR on each
NTU-RGB+D 120. NTU-RGB+D 120 (NTU-120) view and their ensembles. SkeletonCLR models are trained inde-
dataset [27] is an extended version of NTU-60, containing pendently and “+” means the ensemble model.
113, 945 skeleton sequences in 120 action categories. Two
protocols [27] are recommended. 1) Cross-Subject (xsub): Unsupervised Pre-training. We generate three views of
training data and validation data are collected from different skeleton sequences, i.e., joint, motion and bone. For the
subjects. 2) Cross-Setup (xset): training data and validation encoder, we adopt ST-GCN [57], but the number of chan-
data are collected from different setup IDs. nels in each layer is reduced to 1/4 of the original setting.
For contrastive settings, we follow that in MOCOv2 [5] but
NTU-RGB+D 61-120. NTU-RGB+D 61-120 (NTU-61- reduce the size of memory bank M to 30k. For data aug-
120) dataset is a subset of NTU-120 dataset, containing mentation, We set shear amplitude β = 0.5 and the padding
57, 367 skeleton sequences in the last 60 action categories ratio γ = 6. The model is trained for 300 epochs with the
in NTU-120. The categories in NTU-61-120 do not inter- learning rate 0.1 (multiplied by 0.1 at epoch 250). InfoNCE
sect with those in NTU-60. This dataset is used as external loss in Equation (1) is used in the first 150 epochs, and then
dataset to evaluate the transfer capability of our method. replaced with Lcross in Equation (8) after 150-th epoch. We
4.2. Experimental Settings set K = 1 as the default in the knowledge mining mecha-
nism.
All the experiments are conducted on the PyTorch [38] Linear Evaluation Protocol. The models are verified by
framework. For data pre-processing, we remove the invalid linear evaluation for action recognition task, i.e., attaching
frames of each skeleton sequence and then resize them to the frozen encoder to a linear classifier (a fully-connected
the length of 50 frames by linear interpolation. For opti- layer followed by a softmax layer), and then training the
mization, we use SGD with momentum (0.9) and weight classifier supervisedly. We train models for 100 epochs with
decay (0.0001). The mini-batch size is set to 128. learning rate 3.0 (multiplied by 0.1 at epoch 80).
Data Augmentation T . For skeleton sequence, we choose
Finetune Protocol. We append a linear classifier to the
Shear [39] and Crop [43] as the augmentation strategy.
learnable encoder, and then train the whole model for the
Shear is a linear transformation on the spatial dimension. action recognition task, to compare it with fully-supervised
The transformation matrix is defined as: methods. We train for 100 epochs with learning rate 0.1
(multiplied by 0.1 at epoch 80).
 
1 a12 a13
A =  a21 1 a23  (9)
a31 a32 1 4.3. Ablation Study

where a12 , a13 , a21 , a23 , a31 , a32 are shear factors ran- All experiments in this section are conducted on NTU-60
domly sampled from [−β, β]. β is the shear amplitude. The dataset and follow the unsupervised pre-training and linear
sequence x is multiplied by the transformation matrix A on evaluation protocol in Section 4.2.
the channel dimension. Then, the human pose in 3D coor- Effectiveness of CrosSCLR. In Table 1, we separately pre-
dinate is inclined at a random angle. train SkeletonCLR and jointly pre-train CrosSCLR models
Crop is an augmentation on the temporal dimension that on different skeleton views, e.g., joint, motion and bone.
symmetrically pads some frames to the sequence and then We adopt linear evaluation on each view of the models.
randomly crops it to the original length. The padding length Table 1 reports that 1) CrosSCLR improves the capability
is defined as T /γ, γ is noted as padding ratio. The padding of each single SkeletonCLR model, e.g., CrosSCLR-joint
operation uses the reflection of the original boundary. (79.88) v.s SkeletonCLR-joint (76.44) on xview protocol; 2)
Augmentation NTU-60 (%)
SkeletonCLR NTU-60 (%)
(joint)
Shear β Crop γ xsub xview
top-K xsub xview
0 0 33.3 26.2
0 70.5 77.4 0.2 0 62.7 67.7
CrosSCLR 1 74.5 82.1 0.5 0 66.3 68.8
(joint) 3 73.7 79.9 1.0 0 62.0 66.8
5 72.4 79.2
7 73.0 78.6 0.5 4 67.6 76.3
CrosSCLR 10 64.4 69.9 0.5 6 68.3 76.4
(motion) 0.5 8 69.1 74.7
Epoch 150 Epoch 200 Epoch 250 Epoch 300
Table 2. Results of pre-
training 2s-CrosSCLR
with various K in knowl- Table 3. Ablation study on differ-
Figure 4. The t-SNE visualization of embeddings at different
edge miner Γ. ent data augmentations for Skeleton-
epochs during pre-training. Embeddings from 10 categories are
CLR (joint).
sampled and visualized with different colors. For CrosSCLR,
Lcross starts to be available at epoch 150, so its distribution has
no difference from that of SkeletonCLR before epoch 150, shown Views of NTU-60 (%)
in red boxes. Method Pre-training xsub xview
SkeletonCLR Joint 68.3 76.4
CrosSCLR bridges the performance gap of two views and SkeletonCLR + LKM Joint 69.3 77.4
jointly improves their accuracy, e.g., for SkeletonCLR, joint CrosSCLR w/o. EC Joint + Motion 71.4 78.5
(76.44) v.s motion (50.82) but for CrosSCLR, joint (79.88) CrosSCLR Joint + Motion 72.9 79.9
v.s motion (77.59); 3) CrosSCLR improves the multi-view Table 4. Ablation study on contrastive settings N+ and embedding
ensemble results via cross-view training. In summary, the context (EC). The models are linear evaluated on only joint.
cross-view high-confidence knowledge does help the model
extract more discriminating representations. the representation capability of SkeletonCLR. Additionally,
Qualitative Results. We apply t-SNE [30] with fix set- CrosSCLR achieves worse performance without using em-
tings to show the embedding distribution of SkeletonCLR bedding context (EC), proving the significance of similar-
and CrosSCLR on 150, 200, 250, 300 epochs during pre- ity/distribution among samples.
training in Figure 4. Note that cross-view loss, Equation Effects of Augmentations. SkeletonCLR and CrosSCLR
(8), is available only after epoch 150. From the visual re- are based on contrastive learning, but the data augmentation
sults, we can draw a similar conclusion to that in Table 1. strategy used on skeleton data is rarely explored, especially
Embeddings of CrosSCLR are clustered more closely than for the GCN encoder. We verify the effectiveness of data
that of SkeletonCLR, which is more discriminating. For augmentation and the impact of different augmented inten-
CrosSCLR, the distributions of joint and motion are distinct sities in skeleton-based contrastive learning by conducting
at 150-th epoch but look very similar at 300-th epoch, i.e., experiments on SkeletonCLR, as shown in Table 3. It indi-
consistent distribution. Especially, they both build a more cates the importance of data augmentation in SkeletonCLR.
“regular” space than SkeletonCLR, proving the effective- We choose β = 0.5 and γ = 6 as default settings according
ness of CrosSCLR. to the mean accuracy on xsub and xview protocols.
Effects of Contrastive Setting top-K. As hyper-parameter
K determines the number of mined samples, influencing the
4.4. Comparison
depth of knowledge exchange, we study how K impacts the We compare CrosSCLR with other methods under lin-
performance in cross-view learning. Table 2 shows that K ear evaluation and finetune protocols. Since the backbone
has a great influence on the performance and achieves the in many methods is an RNN-based model, e.g., GRU or
best result when K = 1. However, a larger K decreases the LSTM, we additionally use LSTM (following the setting in
performance, because the not so confident information may [39]) as the encoder for a fair comparison, i.e., CrosSCLR
lead the model astray in an unsupervised case. (LSTM).
Contrastive Setting N+ and Embedding Context S. We Unsupervised Results on NTU-60. In Table 5, LongT
develop following models in Table 4 for comparison: 1) GAN [65] adversarially trains the model by skeleton in-
SkeletonCLR + LKM is a model with single-view knowl- painting pretext task, MS2 L [26] trains the model by multi-
edge mining. 2) CrosSCLR w/o. embedding context (EC) task scheme, i.e, prediction, jiasaw puzzle and instance
is the model only using the contrastive setting N+ for discrimination, AS-CAL [39] uses momentum LSTM en-
cross-view learning, which ignores the embedding con- coder for contrastive learning with single-view skeleton se-
text/distribution, i.e., Siv = 1, ∀i ∈ N in Equation (6). quence, P&C [47] trains a stronger encoder by weakening
The results of SkeletonCLR + LKM show that KM improves decoder, and SeBiReNet [34] constructs a human-like GRU
NTU-60 (%) NTU-60 (%)
Method Encoder Classifier xsub xview Method Label Fraction xsub xview
LongT GAN [65] GRU FC 39.1 48.1 LongT GAN [65] 1% 35.2 -
MS2 L [26] GRU GRU 52.6 - MS2 L [26] 1% 33.1 -
AS-CAL [39] LSTM FC 58.5 64.8 3s-CrosSCLR 1% 51.1 50.0
P&C [47] GRU KNN 50.7 76.3
LongT GAN [65] 10% 62.0 -
SeBiReNet [34] GRU LSTM - 79.7
MS2 L [26] 10% 65.2 -
3s-CrosSCLR (LSTM) LSTM FC 62.8 69.2 3s-CrosSCLR 10% 74.4 77.8
3s-CrosSCLR (LSTM) LSTM LSTM 70.4 79.9
3s-CrosSCLR‡ ST-GCN FC 72.8 80.7 Table 7. Linear classification with fewer labels on NTU-60.
3s-CrosSCLR ST-GCN FC 77.8 83.4
NTU-60 (%) NTU-120 (%)
Table 5. Unsupervised results on NTU-60. These methods are pre- Method xsub xview xsub xset
trained to learn encoder and then follow the linear evaluation pro-
3s-ST-GCN∗ [57] 85.2 91.4 77.2 77.1
tocol to learn the classifiers. “‡” indicates the model pre-trained
3s-CrosSCLR‡ (FT) 85.6 92.0 - -
on NTU-61-120.
3s-CrosSCLR (FT) 86.2 92.5 80.5 80.4

NTU-120 (%) Table 8. Finetuned results on NTU-60 and NTU-120. ST-GCN∗ is

Method Supervision xsub xset the method reproduced by released code. “‡” indicates the model
pre-trained on NTU-61-120. “FT” means finetune protocol.
Part-Aware LSTM [40] Supervised 25.5 26.3
Soft RNN [15] Supervised 36.3 44.9
TSRJI [2] Supervised 67.9 62.8 GCN∗ [57] in Table 8 has the same number of parame-
ST-GCN [57] Supervised 79.7 81.3 ters as 3s-CrosSCLR (1/4 channel with three streams). It
shows that the finetuned model, CrosSCLR (FT) outper-
AS-CAL [39] Unsupervised 48.6 49.2
forms the supervised ST-GCN on both NTU-60 and NTU-
3s-CrosSCLR (LSTM) Unsupervised 53.9 53.2
3s-CrosSCLR Unsupervised 67.9 66.7 120 datasets, indicating the effectiveness of cross-view pre-
training.
Table 6. Unsupervised results on NTU-120. We show and compare
Transfer Ability. We first pre-train CrosSCLR on NTU-
our method with unsupervised and supervised methods.
61-120, and then transfer it to NTU-60 for linear evaluation,
noted as CrosSCLR‡. The model trained under xsub proto-
network to utilize view-independent and pose-independent col is transferred to the xsub protocol of NTU-60; the model
feature. Our CrosSCLR exploits the multi-view knowledge trained under xset protocol is transferred to the xview proto-
by cross-view consistent knowledge mining. Taking a fully- col of NTU-60. In Table 5, it achieves better results than the
connected layer (FC) as the classifier, our model outper- other unsupervised methods, and its supervisedly finetuning
forms other methods with the same classifier. With LSTM result is higher than ST-GCN as shown in Table 8.
classifier and LSTM encoder, our model outperforms the
above methods on both xsub and xview protocols. 5. Conclusion
Results on NTU-120. As few unsupervised results are re- In this work, we propose a Cross-view Contrastive
ported on NTU-120 dataset, we compare our method with Learning framework for unsupervised 3D skeleton-based
unsupervised and supervised methods. As shown in Ta- action representation to exploit multi-view high-confidence
ble 6, TSRJI [2] supervisedly utilizes attention LSTM, AS- knowledge as complementary supervision. It integrates
CAL [39] adopts LSTM for skeleton modeling, and our single-view contrastive learning with cross-view consis-
method defeats the other unsupervised method and some of tent knowledge mining modules which convey the con-
the supervised methods. trastive settings and embedding context among views by
Linear Classification with Fewer Labels. We follow the high-confidence sample mining. Experiments show re-
same protocol as that of MS2 L [26], i.e., pre-training with markable results of CrosSCLR for action recognition on
all training data and then finetuning the classifier with only NTU datasets.
1% and 10% randomly-selected labeled data respectively.
As shown in Table 7, CrosSCLR achieves higher perfor- Acknowledgement
mance than other methods. This work was supported by National Science Founda-
Finetuned Results on NTU-60 and NTU-120. We first tion of China (U20B2072, 61976137). This work was also
unsupervisedly pre-train our model and follow the fine- supported by NSFC (U19B2035), Shanghai Municipal Sci-
tune protocol for evaluation. For fair comparison, ST- ence and Technology Major Project (2021SHZDZX0102).
References [18] Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna,
Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu,
[1] Amor Ben Tanfous, Hassen Drira, and Boulbaba Ben Amor. and Dilip Krishnan. Supervised contrastive learning.
Coding kendall’s shape trajectories for 3d action recognition. arXiv:2004.11362, 2020.
In CVPR, pages 2840–2849, 2018. [19] Dahun Kim, Donghyeon Cho, and In So Kweon. Self-
[2] Carlos Caetano, François Brémond, and William Robson supervised video representation learning with space-time cu-
Schwartz. Skeleton image representation for 3d action recog- bic puzzles. In AAAI, volume 33, pages 8545–8552, 2019.
nition based on tree structure and reference joints. In SIB- [20] Jogendra Nath Kundu, Maharshi Gor, Phani Krishna Uppala,
GRAPI, pages 16–23. IEEE, 2019. and Venkatesh Babu Radhakrishnan. Unsupervised feature
[3] Zhe Cao, Gines Hidalgo, Tomas Simon, Shih-En Wei, and learning of human actions as trajectories in pose embedding
Yaser Sheikh. Openpose: realtime multi-person 2d pose es- manifold. In WACV, pages 1459–1467. IEEE, 2019.
timation using part affinity fields. TPAMI, 43(1):172–186, [21] Hsin-Ying Lee, Jia-Bin Huang, Maneesh Singh, and Ming-
2019. Hsuan Yang. Unsupervised representation learning by sort-
[4] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Ge- ing sequences. In ICCV, pages 667–676, 2017.
offrey Hinton. A simple framework for contrastive learning [22] Chao Li, Qiaoyong Zhong, Di Xie, and Shiliang Pu.
of visual representations. arXiv:2002.05709, 2020. Skeleton-based action recognition with convolutional neural
[5] Xinlei Chen, Haoqi Fan, Ross Girshick, and Kaiming He. networks. In ICME Workshops, pages 597–600. IEEE, 2017.
Improved baselines with momentum contrastive learning. [23] Junnan Li, Yongkang Wong, Qi Zhao, and Mohan S Kankan-
arXiv:2003.04297, 2020. halli. Unsupervised learning of view-invariant action repre-
[6] Yong Du, Wei Wang, and Liang Wang. Hierarchical recur- sentations. In NeurIPS, pages 1254–1264, 2018.
rent neural network for skeleton based action recognition. In [24] Maosen Li, Siheng Chen, Xu Chen, Ya Zhang, Yanfeng
CVPR, pages 1110–1118, 2015. Wang, and Qi Tian. Actional-structural graph convolutional
[7] Basura Fernando, Hakan Bilen, Efstratios Gavves, and networks for skeleton-based action recognition. In CVPR,
Stephen Gould. Self-supervised video representation learn- pages 3595–3603, 2019.
ing with odd-one-out networks. In CVPR, pages 3636–3645, [25] Duohan Liang, Guoliang Fan, Guangfeng Lin, Wanjun Chen,
2017. Xiaorong Pan, and Hong Zhu. Three-stream convolutional
[8] Spyros Gidaris, Praveer Singh, and Nikos Komodakis. Un- neural network with multi-task and ensemble learning for 3d
supervised representation learning by predicting image rota- action recognition. In CVPR Workshops, 2019.
tions. arXiv:1803.07728, 2018. [26] Lilang Lin, Sijie Song, Wenhan Yang, and Jiaying Liu. Ms2l:
[9] Liang-Yan Gui, Yu-Xiong Wang, Xiaodan Liang, and Multi-task self-supervised learning for skeleton based action
José MF Moura. Adversarial geometry-aware human motion recognition. In ACMMM, pages 2490–2498, 2020.
prediction. In ECCV, pages 786–803, 2018. [27] Jun Liu, Amir Shahroudy, Mauricio Lisboa Perez, Gang
Wang, Ling-Yu Duan, and Alex Kot Chichung. Ntu rgb+
[10] Tengda Han, Weidi Xie, and Andrew Zisserman. Self-
d 120: A large-scale benchmark for 3d human activity un-
supervised co-training for video representation learning.
derstanding. TPAMI, 2019.
NeurIPS, 33, 2020.
[28] Ziyu Liu, Hongwen Zhang, Zhenghao Chen, Zhiyong Wang,
[11] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross
and Wanli Ouyang. Disentangling and unifying graph con-
Girshick. Momentum contrast for unsupervised visual rep-
volutions for skeleton-based action recognition. In CVPR,
resentation learning. In CVPR, pages 9729–9738, 2020.
pages 143–152, 2020.
[12] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the [29] Zelun Luo, Boya Peng, De-An Huang, Alexandre Alahi, and
knowledge in a neural network. arXiv:1503.02531, 2015. Li Fei-Fei. Unsupervised learning of long-term motion dy-
[13] Geoffrey E Hinton and Sam Roweis. Stochastic neighbor namics for videos. In CVPR, pages 2203–2212, 2017.
embedding. NIPS, 15:857–864, 2002. [30] Laurens van der Maaten and Geoffrey Hinton. Visualiz-
[14] Sepp Hochreiter, Yoshua Bengio, Paolo Frasconi, Jürgen ing data using t-sne. Journal of machine learning research,
Schmidhuber, et al. Gradient flow in recurrent nets: the dif- 9(Nov):2579–2605, 2008.
ficulty of learning long-term dependencies, 2001. [31] Julieta Martinez, Michael J Black, and Javier Romero. On
[15] Jian-Fang Hu, Wei-Shi Zheng, Lianyang Ma, Gang Wang, human motion prediction using recurrent neural networks. In
Jianhuang Lai, and Jianguo Zhang. Early action prediction CVPR, pages 2891–2900, 2017.
by soft regression. TPAMI, 41(11):2568–2583, 2018. [32] Ishan Misra, C Lawrence Zitnick, and Martial Hebert. Shuf-
[16] Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, fle and learn: unsupervised learning using temporal order
Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, verification. In ECCV, pages 527–544. Springer, 2016.
Tim Green, Trevor Back, Paul Natsev, et al. The kinetics [33] Bingbing Ni, Gang Wang, and Pierre Moulin. Rgbd-hudaact:
human action video dataset. arXiv:1705.06950, 2017. A color-depth video database for human daily activity recog-
[17] Qiuhong Ke, Mohammed Bennamoun, Senjian An, Ferdous nition. In ICCV Workshops, pages 1147–1153. IEEE, 2011.
Sohel, and Farid Boussaid. A new representation of skeleton [34] Qiang Nie, Ziwei Liu, and Yunhui Liu. Unsupervised 3d
sequences for 3d action recognition. In CVPR, pages 3288– human pose representation with viewpoint and pose disen-
3297, 2017. tanglement. ECCV, 2020.
[35] Mehdi Noroozi and Paolo Favaro. Unsupervised learning of [52] Jiang Wang, Zicheng Liu, Ying Wu, and Junsong Yuan.
visual representations by solving jigsaw puzzles. In ECCV, Mining actionlet ensemble for action recognition with depth
pages 69–84. Springer, 2016. cameras. In CVPR, pages 1290–1297. IEEE, 2012.
[36] Mehdi Noroozi, Ananth Vinjimoor, Paolo Favaro, and [53] Minsi Wang, Bingbing Ni, and Xiaokang Yang. Learning
Hamed Pirsiavash. Boosting self-supervised learning via multi-view interactional skeleton graph for action recogni-
knowledge transfer. In CVPR, pages 9359–9367, 2018. tion. TPAMI, 2020.
[37] Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Rep- [54] Chen Wei, Lingxi Xie, Xutong Ren, Yingda Xia, Chi Su,
resentation learning with contrastive predictive coding. Jiaying Liu, Qi Tian, and Alan L Yuille. Iterative reorganiza-
arXiv:1807.03748, 2018. tion with weak spatial constraints: Solving arbitrary jigsaw
[38] Adam Paszke, Sam Gross, Soumith Chintala, Gregory puzzles for unsupervised representation learning. In CVPR,
Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Al- pages 1910–1919, 2019.
ban Desmaison, Luca Antiga, and Adam Lerer. Automatic [55] Zhirong Wu, Yuanjun Xiong, Stella X Yu, and Dahua Lin.
differentiation in pytorch. NeurIPS Workshops, 2017. Unsupervised feature learning via non-parametric instance
[39] Haocong Rao, Shihao Xu, Xiping Hu, Jun Cheng, and Bin discrimination. In CVPR, pages 3733–3742, 2018.
Hu. Augmented skeleton based contrastive action learning [56] Jingwei Xu, Zhenbo Yu, Bingbing Ni, Jiancheng Yang, Xi-
with momentum lstm for unsupervised action recognition. aokang Yang, and Wenjun Zhang. Deep kinematics analysis
arXiv:2008.00188, 2020. for monocular 3d human pose estimation. In CVPR, pages
[40] Amir Shahroudy, Jun Liu, Tian-Tsong Ng, and Gang Wang. 899–908, 2020.
Ntu rgb+ d: A large scale dataset for 3d human activity anal- [57] Sijie Yan, Yuanjun Xiong, and Dahua Lin. Spatial tempo-
ysis. In CVPR, pages 1010–1019, 2016. ral graph convolutional networks for skeleton-based action
recognition. In AAAI, 2018.
[41] Lei Shi, Yifan Zhang, Jian Cheng, and Hanqing Lu.
Skeleton-based action recognition with directed graph neu- [58] Mihai Zanfir, Marius Leordeanu, and Cristian Sminchisescu.
ral networks. In CVPR, pages 7912–7921, 2019. The moving pose: An efficient 3d kinematics descriptor for
low-latency action recognition and detection. In ICCV, pages
[42] Lei Shi, Yifan Zhang, Jian Cheng, and Hanqing Lu. Two-
2752–2759, 2013.
stream adaptive graph convolutional networks for skeleton-
[59] Xiaohua Zhai, Avital Oliver, Alexander Kolesnikov, and Lu-
based action recognition. In CVPR, pages 12026–12035,
cas Beyer. S4l: Self-supervised semi-supervised learning. In
2019.
ICCV, pages 1476–1485, 2019.
[43] Connor Shorten and Taghi M Khoshgoftaar. A survey on
[60] Pengfei Zhang, Cuiling Lan, Junliang Xing, Wenjun Zeng,
image data augmentation for deep learning. Journal of Big
Jianru Xue, and Nanning Zheng. View adaptive recurrent
Data, 6(1):60, 2019.
neural networks for high performance human action recog-
[44] Chenyang Si, Wentao Chen, Wei Wang, Liang Wang, and nition from skeleton data. In ICCV, pages 2117–2126, 2017.
Tieniu Tan. An attention enhanced graph convolutional lstm
[61] Pengfei Zhang, Cuiling Lan, Junliang Xing, Wenjun Zeng,
network for skeleton-based action recognition. In CVPR,
Jianru Xue, and Nanning Zheng. View adaptive neural net-
pages 1227–1236, 2019.
works for high performance skeleton-based human action
[45] Sijie Song, Cuiling Lan, Junliang Xing, Wenjun Zeng, and recognition. TPAMI, 41(8):1963–1978, 2019.
Jiaying Liu. Spatio-temporal attention-based lstm networks
[62] Pengfei Zhang, Cuiling Lan, Wenjun Zeng, Junliang Xing,
for 3d action recognition and detection. TIP, 27(7):3459–
Jianru Xue, and Nanning Zheng. Semantics-guided neural
3471, 2018.
networks for efficient skeleton-based human action recogni-
[46] Nitish Srivastava, Elman Mansimov, and Ruslan Salakhudi- tion. In CVPR, pages 1112–1121, 2020.
nov. Unsupervised learning of video representations using [63] Richard Zhang, Phillip Isola, and Alexei A Efros. Color-
lstms. In ICML, pages 843–852, 2015. ful image colorization. In ECCV, pages 649–666. Springer,
[47] Kun Su, Xiulong Liu, and Eli Shlizerman. Predict & cluster: 2016.
Unsupervised skeleton based action recognition. In CVPR, [64] Xikun Zhang, Chang Xu, and Dacheng Tao. Context aware
pages 9631–9640, 2020. graph convolution for skeleton-based action recognition. In
[48] Yonglong Tian, Dilip Krishnan, and Phillip Isola. Con- CVPR, pages 14333–14342, 2020.
trastive multiview coding. arXiv:1906.05849, 2019. [65] Nenggan Zheng, Jun Wen, Risheng Liu, Liangqu Long, Jian-
[49] Raviteja Vemulapalli, Felipe Arrate, and Rama Chellappa. hua Dai, and Zhefeng Gong. Unsupervised representation
Human action recognition by representing 3d skeletons as learning with long-term dynamics for skeleton based action
points in a lie group. In CVPR, pages 588–595, 2014. recognition. In AAAI, 2018.
[50] Raviteja Vemulapalli and Rama Chellapa. Rolling rotations
for recognizing human actions from 3d skeletal data. In
CVPR, pages 4471–4479, 2016.
[51] Jiangliu Wang, Jianbo Jiao, Linchao Bao, Shengfeng He,
Yunhui Liu, and Wei Liu. Self-supervised spatio-temporal
representation learning for videos by predicting motion and
appearance statistics. In CVPR, pages 4006–4015, 2019.

HALLIBURTON-MWD-LWD Services Overview
100% (3)
HALLIBURTON-MWD-LWD Services Overview
8 pages
GP RM E FEM PreStressedPlateBridge EC
No ratings yet
GP RM E FEM PreStressedPlateBridge EC
85 pages
Actionlet Dependent Contrastive Learning For Unsupervised Skeleton Based Action Recognition
No ratings yet
Actionlet Dependent Contrastive Learning For Unsupervised Skeleton Based Action Recognition
10 pages
Weakly Supervised Contrastive Learning
No ratings yet
Weakly Supervised Contrastive Learning
10 pages
Wang - Robust Multi-Feature Learning For Skeleton-Based Action Recognition
No ratings yet
Wang - Robust Multi-Feature Learning For Skeleton-Based Action Recognition
13 pages
Computer Vision1
No ratings yet
Computer Vision1
7 pages
Ufc Sports Data
No ratings yet
Ufc Sports Data
10 pages
A Novel Graph Representation For Skeleton-Based Action Recognition
No ratings yet
A Novel Graph Representation For Skeleton-Based Action Recognition
9 pages
Paper 10625
No ratings yet
Paper 10625
10 pages
Skeleton-Based Action Recognition Using Graph Conv
No ratings yet
Skeleton-Based Action Recognition Using Graph Conv
18 pages
Multimodal Human Action Recognition Based On A Fusion of Dynamic Images Using CNN Descriptors
No ratings yet
Multimodal Human Action Recognition Based On A Fusion of Dynamic Images Using CNN Descriptors
8 pages
Xiang Generative Action Description Prompts For Skeleton-Based Action Recognition ICCV 2023 Paper
No ratings yet
Xiang Generative Action Description Prompts For Skeleton-Based Action Recognition ICCV 2023 Paper
10 pages
Unified Keypoint-Based Action Recognition Framework Via Structured Keypoint Pooling
No ratings yet
Unified Keypoint-Based Action Recognition Framework Via Structured Keypoint Pooling
10 pages
An Efficient Framework For Human Action Recognition Based On Graph Convolutional Networks
No ratings yet
An Efficient Framework For Human Action Recognition Based On Graph Convolutional Networks
6 pages
Liu Disentangling and Unifying Graph Convolutions For Skeleton-Based Action Recognition CVPR 2020 Paper
No ratings yet
Liu Disentangling and Unifying Graph Convolutions For Skeleton-Based Action Recognition CVPR 2020 Paper
10 pages
An Efficient Self Attention Network For Skeleton Based Action Recognition
No ratings yet
An Efficient Self Attention Network For Skeleton Based Action Recognition
10 pages
Songyang ToM 2017
No ratings yet
Songyang ToM 2017
14 pages
Cheng Skeleton-Based Action Recognition With Shift Graph Convolutional Network CVPR 2020 Paper
No ratings yet
Cheng Skeleton-Based Action Recognition With Shift Graph Convolutional Network CVPR 2020 Paper
10 pages
(2017) Deep Learning Based Human Action Recognition - A Survey
No ratings yet
(2017) Deep Learning Based Human Action Recognition - A Survey
6 pages
A Review of Research On Human Behavior Recognition Methods Based On Deep Learning
No ratings yet
A Review of Research On Human Behavior Recognition Methods Based On Deep Learning
5 pages
Jeon Mining Better Samples For Contrastive Learning of Temporal Correspondence CVPR 2021 Paper
No ratings yet
Jeon Mining Better Samples For Contrastive Learning of Temporal Correspondence CVPR 2021 Paper
11 pages
Technical Seminar
No ratings yet
Technical Seminar
13 pages
Embedded Features For 1D CNN-based Action Recognition On Depth Maps
No ratings yet
Embedded Features For 1D CNN-based Action Recognition On Depth Maps
13 pages
Action Recognition Using Attention-Joints Graph Convolutional Neural Networks
No ratings yet
Action Recognition Using Attention-Joints Graph Convolutional Neural Networks
9 pages
12328-Article Text-15856-1-2-20201228
No ratings yet
12328-Article Text-15856-1-2-20201228
9 pages
Sensors: Human Interaction Classification in Sliding Video Windows Using Skeleton Data Tracking and Feature Extraction
No ratings yet
Sensors: Human Interaction Classification in Sliding Video Windows Using Skeleton Data Tracking and Feature Extraction
20 pages
Combining Multiple Sources of Knowledge in Deep Cnns For Action Recognition
No ratings yet
Combining Multiple Sources of Knowledge in Deep Cnns For Action Recognition
8 pages
A Survey On Contrastive Self-Supervised Learning
No ratings yet
A Survey On Contrastive Self-Supervised Learning
21 pages
Research Paper
No ratings yet
Research Paper
7 pages
Aberdam Sequence-to-Sequence Contrastive Learning For Text Recognition CVPR 2021 Paper
No ratings yet
Aberdam Sequence-to-Sequence Contrastive Learning For Text Recognition CVPR 2021 Paper
11 pages
Actnetformer: Transformer-Resnet Hybrid Method For Semi-Supervised Action Recognition in Videos
No ratings yet
Actnetformer: Transformer-Resnet Hybrid Method For Semi-Supervised Action Recognition in Videos
22 pages
Graph Contrastive Learning With Augmentations
No ratings yet
Graph Contrastive Learning With Augmentations
12 pages
Faraki Cross-Domain Similarity Learning For Face Recognition in Unseen Domains CVPR 2021 Paper
No ratings yet
Faraki Cross-Domain Similarity Learning For Face Recognition in Unseen Domains CVPR 2021 Paper
10 pages
Corresponding Author
No ratings yet
Corresponding Author
6 pages
NeurIPS 2021 Multi View Contrastive Graph Clustering Paper
No ratings yet
NeurIPS 2021 Multi View Contrastive Graph Clustering Paper
12 pages
SimCLR: Simple Framework For Contrastive Learning of Visual Representaitons
No ratings yet
SimCLR: Simple Framework For Contrastive Learning of Visual Representaitons
20 pages
Tensor Decomposition For Time-Series Classification Via A Simple Pseudo-Laplacian Contrast
No ratings yet
Tensor Decomposition For Time-Series Classification Via A Simple Pseudo-Laplacian Contrast
11 pages
I3D-Shufflenet Based Human Action Recognition
No ratings yet
I3D-Shufflenet Based Human Action Recognition
14 pages
De GCN
No ratings yet
De GCN
14 pages
Journaluploads 498can Skeletal Joint Positional Ordering Influence Action Recognition On Spectrally Graded CNNs A Perspective On Achieving Joint Order Independent Learning
No ratings yet
Journaluploads 498can Skeletal Joint Positional Ordering Influence Action Recognition On Spectrally Graded CNNs A Perspective On Achieving Joint Order Independent Learning
16 pages
Human Action Recognition by Learning Bases of Action Attributes and Parts
No ratings yet
Human Action Recognition by Learning Bases of Action Attributes and Parts
8 pages
MotionBERT - A Unified Perspective On Learning Human Motion Representations
No ratings yet
MotionBERT - A Unified Perspective On Learning Human Motion Representations
18 pages
Han 2022 efficient-3D-CNNs
No ratings yet
Han 2022 efficient-3D-CNNs
20 pages
UNIK A Unified Framework For Real-World Skeleton-B
No ratings yet
UNIK A Unified Framework For Real-World Skeleton-B
14 pages
3D Convolutional Neural Networks For Human Action Recognition
No ratings yet
3D Convolutional Neural Networks For Human Action Recognition
11 pages
Spatial Temporal Graph Convolutional Networks For Skeleton-Based Action Recognition
No ratings yet
Spatial Temporal Graph Convolutional Networks For Skeleton-Based Action Recognition
10 pages
Human Action Recognition On Raw Depth Maps
No ratings yet
Human Action Recognition On Raw Depth Maps
13 pages
Sim CLR
No ratings yet
Sim CLR
11 pages
SIDDHARTH SHARMA - FinalProjectReport - Siddharth
No ratings yet
SIDDHARTH SHARMA - FinalProjectReport - Siddharth
27 pages
(SDM2022) Neural Graph Matching For Pre-Training Graph Neural Networks
No ratings yet
(SDM2022) Neural Graph Matching For Pre-Training Graph Neural Networks
9 pages
21bce5309 Ai Final-Report
No ratings yet
21bce5309 Ai Final-Report
8 pages
Skeletonbased Human ActionInteraction Classification in Sparse Image Sequences
No ratings yet
Skeletonbased Human ActionInteraction Classification in Sparse Image Sequences
14 pages
Action Recognition 2
No ratings yet
Action Recognition 2
6 pages
Contrastive Visual Clustering For Improving Instance Level - 2024 - Pattern Reco
No ratings yet
Contrastive Visual Clustering For Improving Instance Level - 2024 - Pattern Reco
9 pages
Sun Human Action Recognition ICCV 2015 Paper
No ratings yet
Sun Human Action Recognition ICCV 2015 Paper
9 pages
Camgoz SubUNets End-To-End Hand ICCV 2017 Paper
No ratings yet
Camgoz SubUNets End-To-End Hand ICCV 2017 Paper
10 pages
AE2-Nets Autoencoder in Autoencoder Networks
No ratings yet
AE2-Nets Autoencoder in Autoencoder Networks
9 pages
3D Convolutional Neural Networks For Human Action Recognition
No ratings yet
3D Convolutional Neural Networks For Human Action Recognition
11 pages
Tsai - Deep Learning-Based Real-Time Multiple-Person Action Recognition System - 21
No ratings yet
Tsai - Deep Learning-Based Real-Time Multiple-Person Action Recognition System - 21
17 pages
Xiao-Song2018 Article ActionRecognitionBasedOnHierar PDF
No ratings yet
Xiao-Song2018 Article ActionRecognitionBasedOnHierar PDF
14 pages
Activity Recognition: Fundamentals and Applications
From Everand
Activity Recognition: Fundamentals and Applications
Fouad Sabry
No ratings yet
Machine Learning - Advanced Concepts
From Everand
Machine Learning - Advanced Concepts
Derrick Mwiti
No ratings yet
202550876663IF Chibuzor
No ratings yet
202550876663IF Chibuzor
1 page
Tax Invoice Cum Acknowledgement Receipt of PAN Application (Form 49A)
No ratings yet
Tax Invoice Cum Acknowledgement Receipt of PAN Application (Form 49A)
1 page
RL Quadcopter Movement Control Using Image Processing Techniques
No ratings yet
RL Quadcopter Movement Control Using Image Processing Techniques
4 pages
Datatool Alarm Manual
No ratings yet
Datatool Alarm Manual
20 pages
Question A - Merged
No ratings yet
Question A - Merged
14 pages
Eg - Points & Lines - MCQ
No ratings yet
Eg - Points & Lines - MCQ
6 pages
Java Practical
No ratings yet
Java Practical
56 pages
Pre RMA Bench Test Instructions PDF en
No ratings yet
Pre RMA Bench Test Instructions PDF en
50 pages
Dca
No ratings yet
Dca
8 pages
Draft - R1-2312083 Summary of UE Features For NR NTN - v002 - DCM - HW&HiSi
No ratings yet
Draft - R1-2312083 Summary of UE Features For NR NTN - v002 - DCM - HW&HiSi
23 pages
T34 Catlogue - Catalogue - V2 - 2023
No ratings yet
T34 Catlogue - Catalogue - V2 - 2023
8 pages
Technical Information Sheet: Lashing Points
No ratings yet
Technical Information Sheet: Lashing Points
2 pages
Brakes Volvo Trucks
No ratings yet
Brakes Volvo Trucks
2 pages
Science Technology and Society Final Examination
100% (2)
Science Technology and Society Final Examination
9 pages
Resume: Lokam Srikanth Contact No: +91 8463931010
No ratings yet
Resume: Lokam Srikanth Contact No: +91 8463931010
2 pages
Computer 10 4th MY ANSWER
No ratings yet
Computer 10 4th MY ANSWER
11 pages
G-Low Dvor
No ratings yet
G-Low Dvor
39 pages
حسابات الموتور
No ratings yet
حسابات الموتور
1 page
Started On State Completed On Time Taken Marks Grade 5.00 100
No ratings yet
Started On State Completed On Time Taken Marks Grade 5.00 100
3 pages
Exception 20240408
No ratings yet
Exception 20240408
7 pages
Full Ordinary Differential Equations Principles and Applications Cambridge IISc Series 1st Edition A. K. Nandakumaran PDF All Chapters
No ratings yet
Full Ordinary Differential Equations Principles and Applications Cambridge IISc Series 1st Edition A. K. Nandakumaran PDF All Chapters
65 pages
CUBO - Work Schedule
No ratings yet
CUBO - Work Schedule
1 page
Cisco 500-444 Exam Dumps
No ratings yet
Cisco 500-444 Exam Dumps
6 pages
Essay and Hackathon
No ratings yet
Essay and Hackathon
2 pages
Term Project GEN 351: Derry Ardiansyah Civil Engineering 61070503201
No ratings yet
Term Project GEN 351: Derry Ardiansyah Civil Engineering 61070503201
11 pages
Go Bag Policy March 2023
No ratings yet
Go Bag Policy March 2023
5 pages
Verified PDF Download Discrete Time Signal Processing 3rd Edition by Alan V Oppenheim Ebook and TestBank Bundle Fast Instant Download
No ratings yet
Verified PDF Download Discrete Time Signal Processing 3rd Edition by Alan V Oppenheim Ebook and TestBank Bundle Fast Instant Download
408 pages
Welding
No ratings yet
Welding
3 pages

3D Human Action Representation Learning Via Cross-View Consistency Pursuit

Uploaded by

3D Human Action Representation Learning Via Cross-View Consistency Pursuit

Uploaded by

3D Human Action Representation Learning via Cross-View Consistency Pursuit

Abstract Sample 1 Positive Negative Sample 2

In this work, we propose a Cross-view Contrastive Joint

tributions are exchanged among views according to their

on contrastive learning [55, 4, 11, 18], aiming to leverage

View-𝑣 ꞏꞏꞏ Embed

modules obtain the embeddings z u , z v and corresponding lowing objective:

NTU-120 (%) Table 8. Finetuned results on NTU-60 and NTU-120. ST-GCN∗ is

You might also like