3D Human Action Representation Learning Via Cross-View Consistency Pursuit
3D Human Action Representation Learning Via Cross-View Consistency Pursuit
Linguo Li1,2∗ Minsi Wang1,2∗ Bingbing Ni1,2∗∗ Hang Wang1,2 Jiancheng Yang1,2 Wenjun Zhang1
1
Shanghai Jiao Tong University, Shanghai 200240, China
2
MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University
[email protected], {LLG440982, nibingbing, zhangwenjun}@sjtu.edu.cn
arXiv:2104.14466v2 [cs.CV] 1 May 2021
Memory Bank
ℎ 𝓏𝓏
ural multi-view knowledge of skeleton data. Thus, we intro-
duce CrosSCLR for unsupervised 3D action representation. ℒ
𝑔𝑔�
3. CrosSCLR 𝑥𝑥� f� ℎ� 𝓏𝓏̂
𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒
Although 3D skeleton has shown its importance in ac- Figure 2. Architecture of single-view SkeletonCLR, which is a
tion recognition, unsupervised skeleton representation has memory augmented contrastive learning framework.
not been well exploited recently. Since the easily-obtained
“multi-view” skeleton information plays a significant role in • A simple projector g and its momentum updated version
action recognition, we expect to exploit them to mine posi- ĝ that project the hidden vector to a lower dimension
tive samples and pursue cross-view consistency in unsuper- space: z = g(h), ẑ = ĝ(ĥ), where z, ẑ ∈ Rcz . The
vised contrastive learning, thus giving rise to a Cross-view projector is a fully-connected (FC) layer with ReLU.
Contrastive Learning (CrosSCLR) framework for Skeleton- • A memory bank M = {mi }M i=1 that stores negative
based action Representation. samples to avoid redundant computation of the embed-
As shown in Figure 3, CrosSCLR contains two key mod- dings. It is a first-in-first-out queue updated per iteration
ules: 1) SkeletonCLR (Section 3.1): a contrastive learning by ẑ. After each inference step, ẑ will enqueue while
framework to unsupervisedly learn single-view representa- the earliest embedding in M will dequeue. During con-
tions, and 2) CVC-KM (Section 3.2): it conveys the most trastive training, M provides numerous negative embed-
prominent knowledge from one view to others, introduces dings while the new calculated ẑ is the positive embed-
complementary pseudo-supervised constraint and promotes ding.
information sharing among views. Finally, the more dis- • An InfoNCE [37] loss for instance discrimination:
criminating representations can be obtained by coopera-
tively training (Section 3.2). exp(z · ẑ/τ )
L = − log PM (1)
exp(z · ẑ/τ ) + i=1 exp(z · mi /τ )
3.1. Single-View 3D Action Representation
where mi ∈ M, τ is the temperature hyper-
Contrastive learning has been widely-used due to its in-
parameter [12], and dot product z · ẑ is to compute their
stance discrimination capability, especially for images [4,
similarity where z, ẑ are normalized.
11] and videos [10]. Inspired by this, we develop Skeleton-
CLR to learn single-view 3D action representations, based Constrained by contrastive loss L, the model is unsupervis-
on the recent advanced practice, MoCo [11]. edly trained to discriminate each sample in the training set.
SkeletonCLR. It is a memory-augmented contrastive learn- At last, we can obtain a strong encoder f that is beneficial
ing method for skeleton representation, which considers to extract single-view distinguishing representations.
one sample’s different augments as its positive samples and Limitations of Single-View Contrastive Learning. The
other samples as negative samples. In each training step, the above SkeletonCLR still suffers the following limitations:
batch embeddings are stored in first-in-first-out memory to 1) Embedding distribution can provide more reliable in-
get rid of redundant computation, serving as negative sam- formation. We expect samples from the same category
ples for the next steps. The positive samples are embedded are embedded closely. However, instance discrimination in
close to each other while the embeddings of negative sam- SkeletonCLR uses only one positive pair and even similar
ples are pushed away. As shown in Figure 2, SkeletonCLR samples are regarded as negative samples. It is unreason-
consists of the following major components: able that the negative samples are forced away in embed-
ding space despite their high embedding similarity. In other
• A data augmentation module T that randomly trans-
words, one positive pair cannot fully describe the relation-
forms the given skeleton sequence into different aug-
ships of samples, and a more reliable embedding distribu-
ments x, x̂ that are considered as positive pairs. For
tion is needed, i,e., positive/negative setting plus embedding
skeleton data, we adopt Shear and Crop as the augmen-
similarity. We aim to mine more representative knowledge
tation strategy (see Section 3.3 and Appendix).
to facilitate contrastive learning, which is also the knowl-
• Two encoders f and fˆ that embed x and x̂ into hid- edge we want to exchange across views. Thus, we introduce
den space: h = f (x; θ) and ĥ = fˆ(x̂; θ̂), where the contrastive context in Section 3.2.
h, ĥ ∈ Rch . fˆ is the momentum updated version of f : 2) Multi-view data can benefit representation learning.
θ̂ ← αθ̂ + (1 − α)θ, where α is a momentum coeffi- SkeletonCLR only relies on single-view data. As shown
cient. SkeletonCLR uses ST-GCN [57] as the backbone in Figure 1, since we don’t have any annotations, differ-
(Details are in Section 3.3). ent samples of the same class are inevitably embedded
into distinct places far from each other, i.e., they distribute context contains not only the information of the most simi-
sparsely/irregularly, bringing much difficulty for linear clas- lar samples but the detailed relationships of samples (distri-
sification. Considering the readily generated multi-view bution).
data of 3D skeleton (see Section 3.3), if such complemen- In Equation (1), the embedding z has positive context
tary information in Figure 1, i.e., different in joint but sim- S+ = {z · ẑ} which does not consider any of neighbors in
ilar in motion, could be fully utilized and explored, the size embedding space except for the augments. Despite the high
of hidden positive pairs in joint can be boosted, enhancing similarity, the negative samples are forced away in embed-
training fidelity. To this end, we inject this consideration ding space, and then samples belonging to the same cate-
into unsupervised contrastive learning framework. gory are difficultly embedded into the same cluster, which is
not efficient to build a “regular” embedding space for down-
3.2. Cross-View Consistent Knowledge Mining
stream classification tasks.
Motivated by the situation in Figure 1 that complemen- High-confidence Knowledge Mining. To solve the above
tary knowledge is preserved in multiple views, we propose issue, we develop the high-confidence Knowledge Mining
the Cross-View Consistent Knowledge Mining (CKC-KM), mechanism (KM), which selects the most similar pairs as
leveraging the high similarity of samples in one view to positive ones to boost the positive sets. It shares similar
guide the learning process in another view. It excavates pos- high-level spirit with neighborhood embedding [13] but per-
itive pairs across views according to the embedding similar- forms differently in an unsupervised contrastive manner.
ity to promote knowledge exchange among views, then the Specifically, it is based on the following observation in
size of hidden positive pairs in each view can be boosted Figure 4 that after single-view contrastive learning, two em-
and the extracted skeleton features will contain multi-view beddings most likely belong to the same category if they are
knowledge, resulting in a more regular embedding space. embedded closely enough; on the contrary, two embeddings
In this section, we first clarify contrastive context as hardly belong to the same class if they locate extremely far
the consistent knowledge across views, and then show how from each other in embedding space. Therefore, we can fa-
to mine high-confidence knowledge, and finally inject its cilitate contrastive learning by setting the most similar em-
cross-view consistency into single-view SkeletonCLR to beddings as positive to make it more clustered:
further benefit the cross-view unsupervised representation.
Contrastive Context as Consistent Knowledge. As dis- Γ(S) = Topk(S) (4)
P
cussed above, the knowledge we want to exchange across exp(z · ẑ/τ ) + i∈N+ exp(z · mi /τ )
views is one sample’s contrastive context, which describes LKM = − log P (5)
exp(z · ẑ/τ ) + i∈N exp(z · mi /τ )
this sample’s relationships with others (distribution) in em-
bedding space under the settings of contrastive learning. where Γ = Topk is the function to select the index of top-
Notice that SkeletonCLR uses a memory bank to store nec- K similar embeddings and N+ is their index set in memory
essary embeddings. Given one sample’s embedding z and bank. Compared to Equation (1), Equation (5) will leads to
corresponding memory bank M, its contrastive context is a more regular space by pulling close more high-confidence
a similarity set S among z and M conditioned on specific positive samples. Additionally, since we don’t have any la-
knowledge miner Γ that generates index set N+ of positive bels, a larger K may harm the contrastive performance (see
samples, Section 4.3).
S = {si }i∈N = {z · mi }i∈N (2) Cross-View Consistency Learning. Considering easily-
(S+ , N+ ) = Γ(S) (3) obtained multi-view skeleton data, complementary infor-
mation preserved in different views can assist the opera-
where S+ = {si }i∈N+ and dot product “·” is to compute tion to mine positive pairs from similar negative samples
the similarity si among embeddings z and mi . N is the in Figure 1. Then the size of hidden positive pairs can be
index set of embeddings in memory bank and N+ is the boosted by cross-view knowledge communication, result-
index set of positive samples selected by knowledge miner ing in better-extracted skeleton features. To this end, we
Γ. Thus contrastive context C(S|N+ ) consists of following design the cross-view consistency learning which not only
two aspects: mines the high-confidence positive samples from comple-
• Embedding Context S: it is the relationship between one mentary view but also lets the embedding context be con-
sample and others in embedding space, i.e., distribution; sistent in multiple views. Its two-view case is illustrated in
Figure 3 for example.
• Contrastive Setting N+ : it is the positive setting mined
Specifically, samples xu and xv are generated from the
by Γ according to the embedding similarity S; same raw data x by the view generation method in Sec-
thus C(S|N+ ) = {S+ , S− } has positive context S+ and tion 3.3, where u and v indicate two types of data views.
negative context S− , where S = S+ ∪ S− . The contrastive After single-view contrastive learning, two SkeletonCLR
(a) (b)
View-𝑢
𝐌 ꞏꞏꞏ Embed 𝑠 𝑠 𝑠 𝑠
① ②
𝑚
SkeletonCLR 𝑢 ꞏꞏꞏ
ℒ𝑢→𝑣 ℒ𝑣→𝑢
𝑣 ꞏꞏꞏ 𝑠 𝑠 𝑠
SkeletonCLR ① ②
Figure 3. (a) CrosSCLR. Given two samples xu , xv generated from the same raw data, e.g., joint and motion, SkeletonCLR models produce
single-view embeddings while cross-view consistent knowledge mining (CVC-KM) exchanges multi-view complementary knowledge. (b)
v
how Lv→u works in embedding space. In step 1, we mine high-confidence knowledge N+ from similarities S v to boost the positive set of
view u, i.e., z shares z ’s neighbors; In step 2, we use the similarities S to supervise the embedding distribution in view u. z u , z v share
u v v
similar relationships with others. Thus, two embedding spaces become similar under the constraint of Lcross .
where a12 , a13 , a21 , a23 , a31 , a32 are shear factors ran- All experiments in this section are conducted on NTU-60
domly sampled from [−β, β]. β is the shear amplitude. The dataset and follow the unsupervised pre-training and linear
sequence x is multiplied by the transformation matrix A on evaluation protocol in Section 4.2.
the channel dimension. Then, the human pose in 3D coor- Effectiveness of CrosSCLR. In Table 1, we separately pre-
dinate is inclined at a random angle. train SkeletonCLR and jointly pre-train CrosSCLR models
Crop is an augmentation on the temporal dimension that on different skeleton views, e.g., joint, motion and bone.
symmetrically pads some frames to the sequence and then We adopt linear evaluation on each view of the models.
randomly crops it to the original length. The padding length Table 1 reports that 1) CrosSCLR improves the capability
is defined as T /γ, γ is noted as padding ratio. The padding of each single SkeletonCLR model, e.g., CrosSCLR-joint
operation uses the reflection of the original boundary. (79.88) v.s SkeletonCLR-joint (76.44) on xview protocol; 2)
Augmentation NTU-60 (%)
SkeletonCLR NTU-60 (%)
(joint)
Shear β Crop γ xsub xview
top-K xsub xview
0 0 33.3 26.2
0 70.5 77.4 0.2 0 62.7 67.7
CrosSCLR 1 74.5 82.1 0.5 0 66.3 68.8
(joint) 3 73.7 79.9 1.0 0 62.0 66.8
5 72.4 79.2
7 73.0 78.6 0.5 4 67.6 76.3
CrosSCLR 10 64.4 69.9 0.5 6 68.3 76.4
(motion) 0.5 8 69.1 74.7
Epoch 150 Epoch 200 Epoch 250 Epoch 300
Table 2. Results of pre-
training 2s-CrosSCLR
with various K in knowl- Table 3. Ablation study on differ-
Figure 4. The t-SNE visualization of embeddings at different
edge miner Γ. ent data augmentations for Skeleton-
epochs during pre-training. Embeddings from 10 categories are
CLR (joint).
sampled and visualized with different colors. For CrosSCLR,
Lcross starts to be available at epoch 150, so its distribution has
no difference from that of SkeletonCLR before epoch 150, shown Views of NTU-60 (%)
in red boxes. Method Pre-training xsub xview
SkeletonCLR Joint 68.3 76.4
CrosSCLR bridges the performance gap of two views and SkeletonCLR + LKM Joint 69.3 77.4
jointly improves their accuracy, e.g., for SkeletonCLR, joint CrosSCLR w/o. EC Joint + Motion 71.4 78.5
(76.44) v.s motion (50.82) but for CrosSCLR, joint (79.88) CrosSCLR Joint + Motion 72.9 79.9
v.s motion (77.59); 3) CrosSCLR improves the multi-view Table 4. Ablation study on contrastive settings N+ and embedding
ensemble results via cross-view training. In summary, the context (EC). The models are linear evaluated on only joint.
cross-view high-confidence knowledge does help the model
extract more discriminating representations. the representation capability of SkeletonCLR. Additionally,
Qualitative Results. We apply t-SNE [30] with fix set- CrosSCLR achieves worse performance without using em-
tings to show the embedding distribution of SkeletonCLR bedding context (EC), proving the significance of similar-
and CrosSCLR on 150, 200, 250, 300 epochs during pre- ity/distribution among samples.
training in Figure 4. Note that cross-view loss, Equation Effects of Augmentations. SkeletonCLR and CrosSCLR
(8), is available only after epoch 150. From the visual re- are based on contrastive learning, but the data augmentation
sults, we can draw a similar conclusion to that in Table 1. strategy used on skeleton data is rarely explored, especially
Embeddings of CrosSCLR are clustered more closely than for the GCN encoder. We verify the effectiveness of data
that of SkeletonCLR, which is more discriminating. For augmentation and the impact of different augmented inten-
CrosSCLR, the distributions of joint and motion are distinct sities in skeleton-based contrastive learning by conducting
at 150-th epoch but look very similar at 300-th epoch, i.e., experiments on SkeletonCLR, as shown in Table 3. It indi-
consistent distribution. Especially, they both build a more cates the importance of data augmentation in SkeletonCLR.
“regular” space than SkeletonCLR, proving the effective- We choose β = 0.5 and γ = 6 as default settings according
ness of CrosSCLR. to the mean accuracy on xsub and xview protocols.
Effects of Contrastive Setting top-K. As hyper-parameter
K determines the number of mined samples, influencing the
4.4. Comparison
depth of knowledge exchange, we study how K impacts the We compare CrosSCLR with other methods under lin-
performance in cross-view learning. Table 2 shows that K ear evaluation and finetune protocols. Since the backbone
has a great influence on the performance and achieves the in many methods is an RNN-based model, e.g., GRU or
best result when K = 1. However, a larger K decreases the LSTM, we additionally use LSTM (following the setting in
performance, because the not so confident information may [39]) as the encoder for a fair comparison, i.e., CrosSCLR
lead the model astray in an unsupervised case. (LSTM).
Contrastive Setting N+ and Embedding Context S. We Unsupervised Results on NTU-60. In Table 5, LongT
develop following models in Table 4 for comparison: 1) GAN [65] adversarially trains the model by skeleton in-
SkeletonCLR + LKM is a model with single-view knowl- painting pretext task, MS2 L [26] trains the model by multi-
edge mining. 2) CrosSCLR w/o. embedding context (EC) task scheme, i.e, prediction, jiasaw puzzle and instance
is the model only using the contrastive setting N+ for discrimination, AS-CAL [39] uses momentum LSTM en-
cross-view learning, which ignores the embedding con- coder for contrastive learning with single-view skeleton se-
text/distribution, i.e., Siv = 1, ∀i ∈ N in Equation (6). quence, P&C [47] trains a stronger encoder by weakening
The results of SkeletonCLR + LKM show that KM improves decoder, and SeBiReNet [34] constructs a human-like GRU
NTU-60 (%) NTU-60 (%)
Method Encoder Classifier xsub xview Method Label Fraction xsub xview
LongT GAN [65] GRU FC 39.1 48.1 LongT GAN [65] 1% 35.2 -
MS2 L [26] GRU GRU 52.6 - MS2 L [26] 1% 33.1 -
AS-CAL [39] LSTM FC 58.5 64.8 3s-CrosSCLR 1% 51.1 50.0
P&C [47] GRU KNN 50.7 76.3
LongT GAN [65] 10% 62.0 -
SeBiReNet [34] GRU LSTM - 79.7
MS2 L [26] 10% 65.2 -
3s-CrosSCLR (LSTM) LSTM FC 62.8 69.2 3s-CrosSCLR 10% 74.4 77.8
3s-CrosSCLR (LSTM) LSTM LSTM 70.4 79.9
3s-CrosSCLR‡ ST-GCN FC 72.8 80.7 Table 7. Linear classification with fewer labels on NTU-60.
3s-CrosSCLR ST-GCN FC 77.8 83.4
NTU-60 (%) NTU-120 (%)
Table 5. Unsupervised results on NTU-60. These methods are pre- Method xsub xview xsub xset
trained to learn encoder and then follow the linear evaluation pro-
3s-ST-GCN∗ [57] 85.2 91.4 77.2 77.1
tocol to learn the classifiers. “‡” indicates the model pre-trained
3s-CrosSCLR‡ (FT) 85.6 92.0 - -
on NTU-61-120.
3s-CrosSCLR (FT) 86.2 92.5 80.5 80.4