1. Introduction
Action recognition [
1,
2,
3,
4,
5] constitutes a pivotal branch within the computer vision field, dedicated to identifying human or object behaviors and actions through the analysis of visual information contained in video sequences or real-time video streams. This technology plays a crucial role in diverse applications such as human–computer interaction [
6,
7,
8,
9,
10], health rehabilitation [
11,
12,
13], and sports analysis [
14,
15,
16]. The advent of depth sensors, exemplified by the Kinect [
17], has facilitated easy access to human skeleton joint data. Currently, skeleton-based action recognition has garnered substantial interest for its computational efficiency and inherent robustness against variations in lighting conditions, viewpoints, and background noise.
Research on skeleton-based action recognition [
18,
19,
20,
21] from the perspective of network architecture can be broadly categorized into four types: methods based on Recurrent Neural Networks (RNNs) [
22,
23,
24], methods based on Convolutional Neural Networks (CNNs) [
15,
25], methods based on Graph Neural Networks (GCNs) [
26,
27,
28], and Transformer-based methods [
9,
19,
29]. A frequently employed pipeline is to convert raw skeleton data into data formats associated with point sequences or graphical structures, subsequently applying the aforementioned deep learning techniques for feature extraction. RNN-based methods [
30,
31] recursively process data sequences, effectively capturing temporal dependencies, but may suffer from challenges with complex spatio-temporal data and long-term dependencies. CNN-based methods [
18,
32] perform convolutional operations within designated spatial or spatio-temporal windows to progressively extract higher-level features, exhibiting translation invariance. GCN-based methods [
33,
34,
35,
36] leverage the graph topology of the human skeleton to capture the relationships between different nodes. However, this approach is constrained in its ability to identify relationships between nodes that are not directly edge-connected (e.g., “head” and “feet”). Transformer-based methods [
20,
29] benefit from the self-attention mechanism, offering advantages in modeling long-distance dependencies and unrelated nodes, and have gradually become one of the most popular research frameworks in the community. Consequently, this work aims to explore a more effective skeleton activity representation based on Transformer (
Figure 1).
To enhance the skeleton-based activity representation, researchers often introduce additional modalities, such as video (RGB) and depth image sequences [
37,
38,
39], as supplementary information. Nevertheless, the additional processing and computation of modal data, as previously described, will result in extra computational overheads. Therefore, we expect to discuss a balanced learning strategy between performance and cost to effectively represent skeleton activity. Xiang et al. [
21] proposed a cross-modal skeleton activity recognition method called Generative Action-description Prompts (GAP), which introduces a pre-trained large language model to generate textual descriptions of body parts’ actions and serves as supervised information to constrain the optimization of different body parts in the skeleton modality. On the one hand, GAP prompts further reflection on the role of textual descriptions in skeleton-based action recognition. There are visual semantic similarities among different body actions; for instance, “side kick” and “kicking” both involve leg movements, but skeleton data alone fails to effectively capture the nuanced motion patterns of these fine-grained behaviors [
4]. Language, however, could provide a more nuanced and discerning form of guidance. On the other hand, there is implicit synergy among local body movements when a specific action occurs. For instance, there are simultaneous spatio-temporal displacements of the “head” and “hands” during the action “sneeze”. Consequently, how to sufficiently mine the semantic associations among these local body movements poses a significant challenge.
To alleviate the above two problems, we propose a fine-grained cross-modal skeleton action recognition approach, namely Linguistic-Driven Partial Semantic Relevance Learning (LPSR), which consists of two major components: the Partial Semantic Consistency Constraints (PSCC) and the Cyclic Attention Interaction Module (CAIM). In PSCC, we leverage the current state-of-the-art large language model to generate more detailed local body movement descriptions, as well as the global description of the action, by using skeleton point visualizations and text labels as inputs. Multiple local body descriptions guide the model to learn finer-grained representations of skeleton body movements, where the Kullback–Leibler (KL) consistency loss is used to construct local semantic consistency associations across modalities. Global textual descriptions are then (as key and value) associated with the global skeleton feature to learn a more discriminative action feature via cross-attention. Furthermore, considering the semantic synergy between local body movements, we design the CAIM module to model the implicit relations between them. The local body parts studied in this paper include the “head”, “arm”, “hand”, “hip”, “leg”, and “foot”. The selection of these parts is mainly based on the division of the human body into 25 nodes based on the dataset. We locally segment the human body based on the information provided by these nodes. In summary, the main contributions of this paper are summarized as follows:
We propose a novel Linguistic-Driven Partial Semantic Relevance Learning framework (LPSR) for skeleton-based action recognition. The framework leverages the powerful zero-sample capability of multi-modal large language models to generate global and local textual descriptions of skeleton actions, and furthermore constructs cross-modal partial semantic consistency constraints to guide the model to learn a more discriminative representation;
We propose a Cyclic Attention Interaction Module (CAIM) to mine the implicit semantic associations between different body movements, fully exploiting the potential of synergistic relationships of local body movements in global action understanding.
We conduct extensive ablation studies on two popular benchmarks NTU-60 and NTU-120, and the experimental results demonstrate the effectiveness of the proposed method in this work. In addition, compared with previous Transformer-based methods, our method also achieves state-of-the-art results under the same setup conditions.
2. Related Works
Skeleton-based Action Recognition. Skeleton-based Action Recognition [
35,
40,
41,
42] is a technique for recognizing human movements by capturing and analyzing the movements of human skeleton joints. Human joint trajectories [
27,
43] offer a detailed perspective on human movement, largely due to the spatial information they encompass and their strong correlation with adjacent joint nodes. However, representing skeleton information has its challenges: it is often sparse and noisy. This sparsity becomes evident when distinguishing between similar actions, like ‘brushing teeth’ and ‘brushing hair’, which are almost identical in body movement and heavily rely on hand movements for accurate identification [
4]. Recently, deep learning, propelled by advances in high-performance computing and technology, has shown remarkable capabilities in extracting complex features. One area where deep learning is particularly effective is in processing time-series data through Recurrent Neural Networks (RNNs) [
44,
45,
46,
47]. RNNs excel in learning dynamic dependencies within such data. However, they face limitations in modeling spatial dependencies among skeleton joints. To address this, Du et al. [
24] proposed an innovative solution: an end-to-end hierarchical RNN framework. Complementing this approach, Yang et al. [
48] introduced the concept of group sparse regularization. This technique centers on investigating the concurrent characteristics of skeleton joints, providing a more profound comprehension of their interrelations.
In addition to the RNN-based approach, Convolutional Neural Networks (CNNs) [
25,
32,
43,
49] are well-regarded for their excellent capability in extracting features and learning spatial dimensions, and have been successfully utilized to process spatio-temporal data in skeleton analyses. Wang et al. [
43] and Li et al. [
18] encode the skeleton sequence data into an image and then feed it into a CNN for action recognition, giving a skeleton spectrogram and a joint trajectory map, respectively. Wang et al. [
50] converted skeleton joints into multiple 2D pseudo-images to suit the CNN’s input needs, enabling the network to capture spatio-temporal characteristics. Additionally, Xu et al. [
49] introduced a solely CNN-based structure known as Topologyaware CNN, designed to enhance the modeling of irregular skeleton topologies by CNNs.
Yet, the aforementioned techniques struggle to grasp the inter-joint correlations, Yan et al. [
26] depict the human body as a graph, characterizing joint connections with an adjacency matrix, and introduce the Spatio-Temporal Graph Convolutional Network (ST-GCN). This network addresses the temporal and spatial dimensions of the convolution and processes the skeleton data for efficient modeling. In addition, combining semantic information of human joints and frames [
21,
51] has been shown to enrich the expressiveness of skeleton features, thus improving recognition accuracy. Diverging from these graph-centric methods, our approach models skeleton data using Linguistic-Driven Semantic Relevance Learning, offering a distinctive outlook that could yield novel insights and advancements in the domain of action recognition and pose estimation.
Transformer-based Action Recognition. In recent years, there has been a notable shift in Natural Language Processing (NLP) [
1,
51,
52] towards the adoption of Transformer structures [
53] as a replacement for traditional network architectures. Due to the powerful long-range temporal modeling capabilities of Transformers with self-attention modules, there has been a growing interest in utilizing Transformers for action recognition tasks. While most existing approaches in this area utilize video frames as input tokens [
54,
55], a limited number of techniques integrate skeleton data [
9,
19] within the Transformer architecture. Nonetheless, the computational demands for Transformer-based action recognition are substantial, given the self-attention mechanism’s application to numerous 3D tokens in videos. Self-attention is becoming increasingly popular in computer vision and has been applied to a variety of tasks, including image classification and segmentation [
56,
57], object detection [
58], and action recognition [
20,
52]. In video action recognition, ref. [
52] used self-attention to learn spatio-temporal features from frame-level patch sequences. Ref. [
20] uses self-attention in skeleton-based action recognition instead of regular graph convolution. In contrast, our approach solely relies on self-attention to model skeleton data and calculates the correlation of all joints across multiple consecutive frames simultaneously.
Language Model in Skeleton-Based Action Recognition. Significant progress has been made in advanced natural language processing systems based on deep learning techniques with the introduction of models such as Bidirectional Encoder Representations from Transformers (BERT) [
59]. These models are pre-trained to understand and generate complex text [
60,
61,
62,
63], capturing linguistic nuances and deeper meanings. Despite its effectiveness, the application of BERT was initially constrained to single-task adaptations, which limited its efficiency. In response to this limitation, the concept of Prompt Learning (PL) was introduced. This technique [
63,
64] enhances the adaptability of pre-trained LLMs to multiple tasks by adding specific textual parameters to the model’s input.
The principles of PL and transformer-based learning have been extended to Skeleton-Based Action Recognition. A notable example is GAP [
21], which uses the Contrastive Language–Image Pretraining (CLIP) training method for skeleton action recognition and incorporates an additional transformer layer that significantly improves bone-based action recognition. In this framework, a cue learning (PL) technique is employed to construct bone-to-text correspondences, i.e., textual cues are used to allow GPT-3 [
61] to generate detailed descriptions for different skeleton action categories for multimodal representation learning. This advancement demonstrates the great potential of transformer-based modeling and PL techniques for enhancing human action understanding and recognition using skeleton data. In contrast, we use GPT-4 [
60] as a knowledge engine to enhance the understanding of actions. Textual cues and intuitive motion dynamics diagrams are input to generate global descriptions of human motion and local descriptions of different limb motions in an action to further optimize local behavioral learning, thus improving the quality of the learned representations. In addition, we aggregate global skeleton point representations and textual representations to form a cross-modal behavioral representation with broader applicability.
4. Experiments
In this section, extensive comparative experiments are conducted to demonstrate the effectiveness of our proposed method. The evaluation begins with a detailed description of the datasets utilized in our study. Following this, we outline the experimental setup. Subsequently, we conduct ablation studies using the NTU RGB+D skeleton data to determine the individual contributions of each component of our method. The final phase of our evaluation involves a comparison of the proposed method with existing state-of-the-art approaches, utilizing both NTU RGB+D 60 and NTU RGB+D 120 skeleton data sets.
4.1. Datasets
NTU RGB+D 60. The NTU RGB+D 60 dataset [
65], a comprehensive resource for 3D human activity analysis, was developed and released by researchers at Nanyang Technological University, Singapore. This large-scale dataset comprises a diverse array of data types, including RGB, depth, infrared, and skeleton data. It encompasses 56,880 samples, representing a wide range of 60 human activity categories. The extensive size and varied nature of this dataset facilitate rigorous cross-subject (X-sub) and cross-view (X-View) evaluations, X-sub divides the dataset according to the person ID. The training set and the test set contain 20 subsets, respectively. X-View divides the dataset according to camera ID, substantially contributing to advancements in the field of 3D human activity analysis.
NTU RGB+D 120. The NTU RGB+D 120 dataset [
66] represents an extension of the NTU RGB+D 60 dataset, encompassing all the data from the NTU RGB+D 60 and incorporating an additional 60 categories. This expansion results in a comprehensive collection of 120 categories, with a total of 57,600 newly added video samples, bringing the aggregate number of samples in the dataset to 114,480. It features high-resolution RGB videos at 1920 × 1080 pixels, while the depth maps and IR videos are captured at a resolution of 512 × 424. The 3D skeleton data includes the coordinates of 25 body joints per frame. For experimental assessment, the dataset offers two benchmarks: (1) cross-subject (X-sub) and (2) cross-setup (X-Set), catering to a wide range of research needs in the field. For X-Sub, the 106 subjects are split into training and testing groups. Each group consists of 53 subjects. The X-Set takes samples with even collection setup IDs as the training set and samples with odd setup IDs as the test set.
4.2. Experimental Setup
We follow the data processing procedure of [
34] for NTU RGB+D 60 and NTU RGB+D 120. The skeleton encoder uses STTformer as the backbone network to extract the skeletal features and utilizes the Stochastic Gradient Descent (SGD) optimizer with a momentum of
, a standard cross-entropy as the classification loss, weight decay of
, and batch size of 110. The learning rate is set to
initially and reduced by a factor of 10 at 60 and 80 epochs. For the text encoder, we load the pre-training weights of the text encoder to perform the inference process on the text descriptions (without training), and encode the text features. The temperature for contrastive loss is set to
. Additionally, a warm-up strategy is applied during the first five epochs. We use PyTorch and all experiments are conducted on 2×Titan RTX 3090 GPUs. For a fair comparison, all settings are the same, except for the exploration subjects.
4.3. Ablation Study
In this section, We investigate the effectiveness of the proposed method through several experiments on the bone mode of the NTU-RGB+D 60 skeleton dataset.
Ablation study for Cyclic Attention Interaction Module (CAIM). To validate the potential synergy of limb motion, we design the CAIM module and perform ablation validation, and the results are recorded in
Table 1. The notation “partial features (mean)” indicates that the global skeleton features are decoupled (to obtain head, hands, arms, hips, legs, feet) and then directly fusion with an average pooling layer, aggregating each partial limb feature in the temporal (
T) and the joint (V) dimensions. The experimental results validate the effectiveness of our proposed module for CAIM. In contrast, the direct fusion of multiple partial limb features (mean) has limited performance improvement. Using CAIM to mine the synergy of each limb’s motion with other nodes has a positive impact on the action recognition of skeleton sequence.
Ablation study for Partial Semantic Consistency Constraints (PSCC). In order to verify the consistency constraints effect of local language descriptions on limb motion and the enhancement of global descriptions for the global skeleton representation, several ablation experiments are conducted. Firstly, the outcomes of the experiments utilizing partial and global descriptions, respectively, are present
Table 2. The recognition of skeleton models without accompanying description information yielded the lowest accuracy. Following the introduction of partial descriptions, we observe a significant performance improvement, indicating that more detailed descriptive information about partial motion can effectively guide the model to learn more discriminative skeleton representations. Furthermore, the utilization of global descriptions also enhances the recognition performance. Notably, the optimal result is achieved by combining partial descriptions and global descriptions.
Furthermore, we assess the validity of different partial descriptions for the prediction, as shown in
Table 3. The results obtained using a single local description are marginally higher than the baseline. The highest gain is achieved by using all six local partial descriptions corresponding to limb motions.
Finally, we demonstrate the ablations of distinct text encoders and record the results, as illustrated in
Table 4. A comparison is conducted between four text encoders: BERT [
59], DistilBERT [
67], RoBERTa [
68] and CLIP [
63]. The results indicated that RoBERTa exhibited the most optimal performance. Given its commendable balance between efficiency and accuracy, RoBERTa was selected as the text encoder for this study.
Ablation studies for different modules. We perform distinct ablation studies on separate sub-components of the Cyclic Attention Interaction Module (CAIM) and Partial Semantic Consistency Constraints (PSCC) in prior experiments as a complement; this part provides ablation confirmation of the overall framework, as shown in
Table 5. The integration of the CAIM module into the baseline model has been found to enhance its performance, indicating that cyclic attention interaction improves the model’s effectiveness. This improvement can be attributed to the CAIM module is capacity to effectively explore the implicit semantic relationships between different limb motions, thereby fully leveraging the synergistic potential of local limb motions within the global action context. Furthermore, the PSCC module improves performance by capitalizing on linguistic supervision and domain-specific knowledge of global action and local limb motions. This enables the model to learn more discriminative representations of skeleton action. The complete LPSR approach achieves optimal performance across both X-Sub and X-View. While each component of LPSR contributes differently to the overall performance, their combined effect significantly enhances the model’s accuracy when processing skeleton data.
Visualization results. In order to illustrate the efficacy of our methodology in a more visually compelling manner, we selected 20 action categories each from NTU60 and NTU120 to compare the baseline and our method using confusion matrices, as illustrated in
Figure 4. In NTU60, actions such as “reading”,“taking off a shoe”,“playing with a phone”, and “typing on a keyboard” exhibited poorer classification performance. Our method significantly outperforms the baseline for these actions due to the text branch, which generates descriptions for different body parts involved in these actions. However, the performance of our proposed method for recognizing actions such as “tear up paper”, “phone call”, and “cutting paper” is degraded, probably due to the difficulty in recognizing objects in the skeleton. The generated text descriptions are mainly related to objects and local limbs, e.g., “paper cutting” and “paper tearing” both involve paper and hand descriptions, but due to the fine-grained nature of the skeletal data (small discriminative differences between actions), the final prediction results may be be guided by the text to favor incorrect categories with the same objects or localized limbs. Overall, our language-assisted action recognition method shows marked improvement.
4.4. Comparison with the State-of-the-Art Methods
We compared the performance of the LPSR method we developed with the current state-of-the-art methods on two datasets, NTU RGB+D 60 and NTU RGB+D 120. The results of the comparison of recognition accuracy are shown in
Table 6. In our study, four different data integration strategies were used: bone, bone motion, joint, and joint motion. Meanwhile, we compared other state-of-the-art methods, including those based on LSTM, GCN and Transformer.
In comparing LSTM-based approaches, it is evident that our proposed LPSR framework shows a marked improvement over traditional LSTM-based models when applied to the dataset in question. The core limitation of LSTM-based methods lies in their struggle to effectively capture the spatial relationships between joints and bodily segments. On the other hand, GCN-based methods adeptly leverage the spatio-temporal characteristics of skeleton data, leading to superior recognition capabilities. When juxtaposed with a specific GCN-based approach, our LPSR methodology demonstrates distinct advantages, primarily due to the employment of linguistic supervision that steers the recognition of behavior. This supervision harnesses actionable insights from the interplay of movements and body parts, enriching the model’s representational power. Moreover, LPSR sets a new benchmark against Transformer-based counterparts. Ultimately, the consistent outperformance of LPSR across varied datasets underscores its efficacy and robustness as a state-of-the-art method in behavior recognition.
5. Conclusions
This study proposes a novel Linguistic-Driven Partial Semantic Relevance Learning framework (LPSR) for skeleton-based action recognition, which contains two major sub-modules: Cyclic Attention Interaction Module (CAIM) and Partial Semantic Consistency Constraints (PSCC). In comparison to previous methods, we introduce a more comprehensive multi-modal large-scale language model to generate more detailed linguistic descriptions of global actions and partial limb motions. Further, in PSCC, we generate multiple local body descriptions to guide the model to learn finer-grained representations of skeleton body motions. In addition, considering the semantic synergy between partial body motions, we propose the CAIM module to model the implicit relations between them. Extensive ablation experiments demonstrate the efficacy of the method present this paper, achieving comparable performance to the current state-of-the-art methods.
One limitation of our current approach to skeletal action recognition is its reliance on fully supervised conditions, which constrains its applicability in real-world scenarios where annotated data may be scarce. Future research will explore recognizing skeletal behaviors under weakly supervised or unsupervised conditions to broaden the practical utility of our methods. Another limitation is the small difference between the training and test set distributions in our skeletal action recognition task, which hampers the model’s performance when generalizing to new, unseen action classes. Consequently, enhancing the classification performance and generalization capabilities of our model in zero-shot skeletal behavior recognition will be a primary focus of our future work.