Gaitpt: Skeletons Are All You Need For Gait Recognition

GaitPT: Skeletons Are All You Need For Gait Recognition
Andy Catruna, Adrian Cosma, Emilian Radoi

University Politehnica of Bucharest
[email protected], ioan [email protected], [email protected]
arXiv:2308.10623v1 [cs.CV] 21 Aug 2023
Abstract pattern, has emerged as a promising alternative for per-

son recognition [41]. Its study has gained significant atten-
The analysis of patterns of walking is an important area tion [17] as walking and movement have the potential to be
of research that has numerous applications in security, used as a unique fingerprint enabling individual recognition
healthcare, sports and human-computer interaction. Lately, in a wide range of uncooperative scenarios. Gait has been
walking patterns have been regarded as a unique finger- used so far to identify individuals [39], demographics [6,31]
printing method for automatic person identification at a dis- and estimate emotions [36].
tance. In this work, we propose a novel gait recognition ar- Most approaches for gait-based person recognition have
chitecture called Gait Pyramid Transformer (GaitPT) that relied on the use of human silhouettes [7,14,26], which con-
leverages pose estimation skeletons to capture unique walk- tain both gait and appearance information. The use of sil-
ing patterns, without relying on appearance information. houettes raises major privacy concerns [35], as they encode
GaitPT adopts a hierarchical transformer architecture that appearance information such as body composition, clothing
effectively extracts both spatial and temporal features of and hairstyle. It is unclear how much the appearance infor-
movement in an anatomically consistent manner, guided by mation contained in silhouettes contributes to the successful
the structure of the human skeleton. Our results show that recognition of a person on top of the gait information [61],
GaitPT achieves state-of-the-art performance compared to which by definition implies temporal variation in the form
other skeleton-based gait recognition works, in both con- of movement. For instance, Xu et al. [56] were able to con-
trolled and in-the-wild scenarios. GaitPT obtains 82.6% struct a gait model that estimates the gender of an individual
average accuracy on CASIA-B, surpassing other works by using only a single silhouette image. Such a model cannot
a margin of 6%. Moreover, it obtains 52.16% Rank-1 ac- be considered a gait-processing model, since it does not en-
curacy on GREW, outperforming both skeleton-based and code any form of temporal variation.
appearance-based approaches.
The use of human skeletons extracted with pretrained
pose estimation models [15, 57] was proposed as a way
to obtain gait information containing predominantly move-
1. Introduction ment data [10, 22, 47] and removing all appearance in-
Person re-identification is a long-standing problem in the formation except for limb length. This approach makes
field of computer vision and biometrics, that has prevalent gait recognition more privacy-friendly in comparison to
real-world applications in areas such as public monitoring, sillhouete-based analysis. Methods typically use graph net-
security systems, and access control [3, 20]. Currently, the works [29, 58] to process human skeleton sequences, bor-
most widespread methods of person identification typically rowing methods from human action recognition tasks. Con-
involve processing biometric identifiers such as facial fea- currently, there has been recent growing interest in the use
tures [33], fingerprints [32], or iris patterns [2]. While these of vision transformers for computer vision tasks [13, 51].
methods are effective in controlled environments, they are Models based on the transformer architecture [49] have
heavily reliant on the active cooperation of the subjects, achieved state-of-the-art results on a wide range of prob-
which is often not feasible in real-world scenarios, espe- lems, including image classification [28], object detection
cially in public spaces. Facial recognition systems require [5], and segmentation [55]. However, their application to
the face of the individual to be fully visible under reason- gait-based person recognition remains relatively underde-
ably high resolution to work reliably [23], and methods that veloped, with limited success reported in literature [9, 10].
rely on clothing matching [59] are bound to be ineffective Inspired by the recent advances in vision transformers
across multiple days. [28, 51], we propose a hierarchical transformer model for
Gait analysis, the study of a person’s unique walking skeleton gait recognition, which we call GaitPT. It is de-
signed to capture both spatial and temporal information of man model, over detected individuals across video frames.
human movement in an anatomically consistent manner, by Generally, the anatomical model is the human skeleton ob-
modelling the micro and macro movements of individual tained from pretrained human pose estimation networks
joints, iteratively combining them until the motion of the [4, 15, 44, 57].
entire body is processed. The hierarchical structure of our Appearance-based Approaches. Efforts in appearance-
model enables a gradual abstraction of the gait information, based methods have utilized sequences of silhouettes that
resulting in a more effective representation of the full move- are processed by a convolutional network [7, 14, 19, 26, 52].
ment pattern compared to previous graph-based [22,46] and For instance, the authors of GaitSet [7] propose an approach
transformer-based [10] approaches. that represents gait as an unordered set of silhouettes. By
We evaluate our model on three popular gait recogni- adopting this representation, they argue that their approach
tion benchmarks: CASIA-B benchmark [60], a restrictive is more adaptable to varying frame arrangements and dif-
dataset captured in controlled scenarios, GREW [63] one ferent walking directions and variations compared to vanilla
of the largest gait recognition benchmark in the wild, and silhouette sequences. GaitSet leverages convolution layers
Gait3D [62] an in-the-wild dataset containing uncoopera- to extract image-level features from each silhouette and uses
tive and erratic gait sequences. Our experiments show that Set Pooling to aggregate them into a set-level feature. To
GaitPT outperforms previous skeleton-based state-of-the- produce the final output, the authors employ a custom ver-
art methods [10, 22, 24, 25, 47]. The results indicate that our sion of Horizontal Pyramid Matching [16]. Fan et al. [14]
proposed hierarchical transformer architecture is a promis- observed that different parts of the human silhouette exhibit
ing approach for gait-based person recognition. unique spatio-temporal patterns, and thus require their spe-
This paper makes the following contributions: cific expression. The authors proposed an architecture with
a specialized type of convolution with a restricted receptive
• We propose Gait Pyramid Transformer (GaitPT), a
field which enables processing specific parts of the body,
novel skeleton-based gait processing architecture that
more specifically the head, the upper body, and the legs.
achieves state-of-the-art results for skeleton gait recog-
GLN [19] utilizes a convolutional feature pyramid to learn
nition. We use anatomical priors in designing spatial
compact gait representations. According to Lin et al. [26],
and temporal attention blocks, which enable the hier-
existing methods for gait analysis face a trade-off between
archical processing of human movement.
capturing global versus local information. Methods that fo-
• We perform an extensive evaluation for gait recogni- cus on global features may overlook important local details
tion performance in both laboratory-controlled scenar- such as small movements, while methods that extract lo-
ios, as well as in realistic surveillance scenarios of cal features may not fully capture the relationships between
erratic and uncooperative walking. We evaluate our them, losing the global context in the process. To address
architecture on CASIA-B [60], on GREW [63] and this limitation, the authors propose the GaitGL architecture,
on Gait3D [62], three popular gait recognition bench- which incorporates a two-stream network that captures both
marks, obtaining an average accuracy increase of 6% global and local gait features simultaneously.
over previous state-of-the-art methods. Both GaitSet [14], GLN [19] and GaitGL [26] recognize
that efficiently processing the hierarchical relationship be-
• We conduct an ablation study on the architectural tween local and global features is an essential characteristic
choices of GaitPT with the most impact on perfor- of a performant gait recognition method. Our GaitPT ar-
mance. In addition, we show that, for skeleton-based chitecture builds upon this idea, utilizing an anatomically
gait recognition, downstream accuracy is highly corre- informed model based on the human skeleton to process the
lated with the upstream performance of the pose esti- micro and macro movements of a walking individual. We
mation model - the use of a state-of-the-art pose esti- chose to use a model-based approach since skeletons en-
mation model can result in upwards of 20% accuracy code mainly movement data and can be considered privacy-
gain. friendly while the use of silhouettes more invasive as the
recognition performance can be influenced by appearance
2. Related Work features.
The approaches for gait-based person recognition are Model-based Approaches. Processing a sequence of hu-
classified into two main categories: appearance-based [7, man skeletons generally entails the use of a graph net-
14, 26, 54] and model-based [10–12, 24, 46, 47]. Recent work, developed primarily for skeleton action recognition
appearance-based solutions process sequences of human problems. Networks such as ST-GCN [58] and MS-G3D
silhouettes, utilize background subtraction [64] or instance [29] have been repurposed for gait recognition in several
segmentation [8, 18] models. Model-based approaches at- works [11, 22, 46, 47]. For instance, Li et al. [22] propose
tempt to fit a custom model, such as an anatomical hu- JointsGait, an architecture that leverages graph convolu-
tions. They utilize the ST-GCN architecture [58] to capture and contains the coordinates (xi , y i ). We insert an addi-
the spatio-temporal features from sequences of skeletons. tional 18th coordinate by duplicating the nose coordinate
Furthermore, they use a Joints Relationship Pyramid Map- for symmetry purposes in our architecture. A walking
ping which maps the extracted features into a more discrim- sequence is obtained by concatenating consecutive poses
inative space by exploiting the areas of the body which nat- Z = {p1 , p2 , ..., pn }, Z ∈ Rn×18×2 .
urally work together. Teepe et al. [46, 47] employ the Res-
GCN [43] architecture to capture spatio-temporal features 3.1. Gait Pyramid Transformer
from skeleton sequences. They train their architecture with
A high level overview of the proposed architecture
the supervised contrastive objective [21] and leverage mul-
GaitPT is shown in Figure 1. Our model operates in a hier-
tiple augmentation techniques such as flipping, mirroring,
archical manner by first computing the movement of indi-
and adding noise to the data. Cosma and Radoi [11] lever-
vidual joints, followed by individual limbs, then groups of
age surveillance footage to create a large-scale skeleton
limbs, and finally the full body. This approach introduces
dataset, which they use to pretrain an ST-GCN architecture
an inductive bias to the architecture, enabling the model
in a self-supervised manner, with good downstream transfer
to capture anatomically informed unique walking patterns.
capabilities. Some works defer to using plain CNN / MLP
The GaitPT architecture builds upon the spatial and tem-
models to process the sequence of skeletons. PoseGait, in-
poral attention-based modules from the work of Plizzari et
troduced by Liao et al. [24] involves extracting human key-
al. [34]. However, while their architecture only computes
points from each video frame and computing hand-crafted
interactions at the joint-level, our model uses a hierarchical
features, including the angle of the limbs, the length of the
approach similar to that of PVT [51] to capture both local
limbs and joint motion, to facilitate the extraction of gait
and global unique gait features. The main building blocks
features. The model utilizes a CNN to model the temporal
of GaitPT are the Spatial Encoder, the Temporal Encoder,
relationship between these features across frames, enabling
and the Joint Merging module.
effective recognition of gait patterns. Lima et al. [12] utilize
Spatial Attention computes the interactions between
a multilayer perceptron on individual skeletons to capture
the joints of individual skeletons by considering each pose
the spatial information. They use skeleton normalization
as a separate sequence. Having the input sequence Z =
based on the corresponding neck coordinate to remove in-
{p1 , p2 , ..., pn }, spatial attention uses the same computa-
formation about the position in the image. The final embed-
tions as in Equation ?? for each pi , i = 1..n. This is done
ding is obtained by temporally aggregating all the spatial
with a Reshape operation that transforms Z into a list of
features extracted with the MLP.
individual poses. Figure 2 shows a visualization of the in-
With the prevalent use of the transformer model [49]
teractions computed by the spatial encoder of the GaitPT
across most areas of deep learning, some works are ex-
architecture across all stages. In Stage 4 there is no spatial
ploring the use of transformer models for gait recognition
attention done as the partitioning at that level is at the full
[9, 10]. GaitFormer [10] was the first application of the
body level.
transformer architecture to gait analysis problems to pro-
Temporal Attention computes the relation between
cess sequences of silhouettes. However, the authors only
the same feature vector at different time steps. Given a
employ temporal attention by flattening each skeleton, ig-
walking sequence Z = {p1 , p2 , ..., pn }, where pi =
noring spatial relationships and low-level movements.
{ji1 , ji2 ..., jim } (m - number of feature vectors in pose which
Different from Cosma et al. [10], we utilize an anatom-
differs based on the stage) temporal attention utilizes the
ically informed model to construct spatial and temporal at-
same computations as in Equation ?? with sequences of the
tention blocks, such that micro and macro movements are
form z = {j1i , j2i , ..., jni }, where i = 1..m. In other words,
processed hierarchically. Our experiments show that our
it receives as input sequences of the same feature vector at
architecture is effective in modelling gait, outperforming
all the time steps from the gait sequence. This is also made
previous state-of-the-art graph-based methods as well as
possible with a Reshape operation that transforms the out-
transformer-based methods by a large margin.
put of the Spatial Attention module into lists that describe
the evolution of individual embeddings (corresponding to
3. Method
joints or groups of joints) in time. Figure 3 shows a vi-
In line with other works [10, 47] in skeleton-based gait sualization of the interactions computed across time steps
analysis, we operate on sequences of human skeletons ex- between the same feature vectors.
tracted from RGB images using pretrained pose estima- The Joint Merging module takes as input a pose or inter-
tion models. Human pose estimators return a pose p mediary feature map and combines groups of vectors based
which represents a set of 17 coordinates of the most im- on their anatomical relationship. These groups are specifi-
portant joints of the body. These can be defined as p = cally chosen so that they correspond to joints that naturally
{j 1 , j 2 ..., j 17 }, where j i stands for the ith joint in the pose work together in the human motion. As shown in Figure
Figure 1. Overview of the GaitPT architecture. The model uses spatial and temporal attention to incrementally learn the natural motion
of the human body. In Stage 1 it computes the spatio-temporal interactions at the joint level, in Stage 2 at the limb level, in Stage 3 at the
group of limbs level and in the final Stage computes the interaction between full skeletons.
ing schemes in Section 4.4 to obtain the most suitable candi-

date. In the final stage, only temporal attention is performed
at the level of the full body.
Given a sequence of tokens (Zl ∈ Rn×d ) as input to a
GaitPT layer, the output Zl+1 is computed as:
Zl′ = Reshape(JointM erge(Zl ))

Zl′′ = Reshape(SpatialAttention(Zl′ )) (1)
Figure 2. Visualization of the Spatial Attention across all stages in
the GaitPT architecture. Spatial Attention is applied across multi- Zl+1 = Reshape(T emporalAttention(Zl′′ ))
ple joints / limbs of the body in the same time step. Spatial Atten-
tion is performed at the joint level at Stage 1, at the limb level at where SF stands for Spatial Features extracted by the Spa-
Stage 2, and at the level of groups of limbs at Stage 3. tial Encoder module and T F for Temporal Features ex-
tracted by the Temporal Transformer. The Spatial and Tem-
poral Attention operations are exactly the same as the ones
3, the merging module initially combines individual joints in Equation ??. The Reshape operation is utilized to obtain
to create limbs. For instance, the left shoulder is merged the correct token sequence for the corresponding encoder
with the left elbow and left wrist to form the whole left or for the output of the layer. In line with ViT architec-
arm. Subsequently, tokens associated with individual limbs tures [13, 48, 51] we incorporate an extra class token for
are united, such as the merging of the left and right legs to each encoder to obtain an embedding that captures discrim-
form the lower body region. Lastly, all remaining tokens inative features. The class outputs obtained from each stage
are merged to form a single token that encapsulates the in- are aggreagted and projected with a linear layer to a lower
formation of the entire body. dimension to obtain the final embedding.
In total the GaitPT architecture has 4 stages. In the first
3.2. Implementation Details
stage, spatio-temporal interactions are computed at the joint
level. In the second stage, after the Merging module, the in- To obtain gait embeddings that capture discriminative
put token sequences of the spatial and temporal encoders features of walking sequences, we employ the triplet loss
correspond to limbs. The third stage introduces the limb objective [38] with a margin of 0.02 to train the GaitPT
group level of interaction. In this stage the limbs are com- architecture. The training objective takes as input an em-
bined into the following groups: head area, upper body area, bedding triplet consisting of an anchor embedding (a), a
and lower body area. This partitioning scheme is similar to positive embedding (p), and negative embedding (n), yield-
that of GaitPart [14]. However, while their approach splits ing a triplet (a, p, n). The objective maximizes the distance
the image based on manually designed values which may between the anchor and negative embeddings, while mini-
not always yield these 3 parts specifically, our partitioning mizing the distance between the anchor and positive embed-
precisely divides the body into these 3 groups as it is based ding, constructing an euclidean space where inference can
on the pose estimation skeleton. We study multiple group- be done based on nearest neighbors. The triplet loss is de-
ment and a more realistic, ”in-the-wild” scenario. CASIA-
B benchmark [60] is used to evaluate our model against
other state-of-the-art approaches in gait recognition for con-
trolled scenarios, while GREW [63] and Gait3D [62] are
used to test whether our architecture generalizes to uncon-
strained environments. CASIA-B, GREW, and Gait3D are
available upon request, and can be accessed through a re-
lease agreement.
4.1. Evaluation in Controlled Scenarios

CASIA-B consists of walking video data from 124
unique subjects. Each subject has 6 normal (NM) walk-
ing sessions, 2 sessions while carrying a bag (BG), and 2
sessions while wearing a coat (CL). Each session consists
of 11 videos recorded with multiple synchronized cameras
to obtain various walking angles. In total, each subject has
110 videos. We follow the same partitioning of train and
test as Teepe et al. [47] where the first 74 subjects are uti-
lized for training while the remaining 50 for testing. The
evaluation protocol requires that the first 4 normal (NM#1-
4) sessions are utilized as the gallery samples. The re-
maining normal sessions (NM#5-6) and all the bag carry-
ing (BG#1-2) and coat wearing (CL#1-2) scenarios form
Figure 3. Visualization of the Temporal Attention across all stages 3 separate probe sets. Gallery-probe pairs are constructed
in the GaitPT architecture. Temporal Attention is applied to the
based on each individual angle, excluding the same view
same joints / limbs across different time-steps. Initially, Temporal
Attention is performed at the joint level at Stage 1, at the limb
scenario. For each gallery-probe pair, the accuracy of the
level at Stage 2, at the level of groups of limbs at Stage 3 and at model is measured individually, and the reported result for
the whole body level in Stage 4. each specific probe angle is the average accuracy across all
the gallery angles. We do minimal preprocessing consisting
of normalizing the pose estimation coordinates by the width
fined as L(a, p, n) = d(a, p) − d(a, n) + m, where m is the of the video frames. We also remove sequences which have
margin which enforces a minimum distance between posi- less than 60 total frames which is in line with the work of
tive and negative embeddings and d is the distance function. Teepe et al. [46, 47].
Each stage linearly projects the input feature vectors to a We present a comparative analysis of our model with
higher dimension. After Stage 1 the embedding dimension other state-of-the-art skeleton-based approaches [10, 22, 25,
C is equal to 32 and is multiplied by a factor of 2 in every 46, 47], that use qualitatively different types of architec-
following stage. The spatial and temporal encoders at each tures to model skeleton sequences. PoseGait [25] utilizes
stage consist of 3 transformer encoder blocks, each with in- a simple MLP at the skeleton level and aggregates infor-
ternal feed-forward dimension of 4 ∗ C. The final output of mation across time. JointsGait [22] and GaitGraph [47]
the architecture is an embedding of size 256. are two graph-based models that utilize as a backbone net-
The training is performed using the AdamW optimizer work a ST-GCN [58] and a ResGCN [43] respectively.
[30] and a cyclic learning rate schedule [42], which starts GaitGraph2 [46] is a subsequent improvement to Gait-
from a minimum of 0.0001 and reaches a maximum of 0.01, Graph, which incorporates pre-computed features from the
with an exponential decay rate of 0.995 and a step size of skeleton information such as the motion velocity, the bone
15. We used a single NVIDIA A100 with 40GB of VRAM length and the bone angle. GaitFormer [10] is the only
for our experiments. A complete training run takes 1.5h other transformer-based architecture for gait recognition,
for CASIA-B, 12h for GREW and 3h for Gait3D. In total, but only utilizes temporal attention. The authors evaluate
GaitPT has 4M parameters. their model on a different CASIA-B split than our bench-
mark. We utilize the open-source implementation and train
the model under the same conditions as GaitPT.
4. Experiments and Results
In Table 1, we showcase results on the CASIA-B dataset,
To evaluate the performance of our model, we chose with methods following the same evaluation protocol. Our
two different scenarios: a controlled laboratory environ- proposed architecture, GaitPT, surpasses the previous state-
Scenario Method 0◦ 18◦ 36◦ 54◦ 72◦ 90◦ 108◦ 126◦ 144◦ 162◦ 180◦ Mean
PoseGait [24] 48.5 62.7 66.6 66.2 61.9 59.8 63.6 65.7 66 58 46.5 60.5
GaitFormer [10] 71.0 74.7 77.5 77.1 77.4 77.3 78.1 71.5 79.4 74.0 66.5 74.9
NM#5-6 GaitGraph [47] 85.3 88.5 91.0 92.5 87.2 86.5 88.4 89.2 87.9 85.9 81.9 87.7
GaitGraph2 [46] 78.5 82.9 85.8 85.6 83.1 81.5 84.3 83.2 84.2 81.6 71.8 82.0
GaitPT (Ours) 93.5 ± 0.5 92.0 ± 1.0 94.3 ± 0.6 93.9 ± 1.7 92.5 ± 0.7 92.3 ± 1.3 92.4 ± 0.9 92.9 ± 1.2 92.9 ± 1.4 92.5 ± 1.6 86.5 ± 1.5 92.3 ± 0.7
PoseGait [24] 29.1 39.8 46.5 46.8 42.7 42.2 42.7 42.2 42.3 35.2 26.7 39.6
GaitFormer [10] 63.2 64.4 66.0 60.7 62.3 62.0 61.2 54.2 55.9 61.3 51.3 60.2
BG#1-2 GaitGraph [47] 75.8 76.7 75.9 76.1 71.4 73.9 78.0 74.7 75.4 75.4 69.2 74.8
GaitGraph2 [46] 69.9 75.9 78.1 79.3 71.4 71.7 74.3 76.2 73.2 73.4 61.7 73.2
GaitPT (Ours) 83.4 ± 4.0 80.5 ± 3.9 83.4 ± 1.9 82.1 ± 1.4 76.2 ± 2.0 75.7 ± 2.0 78.3 ± 3.0 79.1 ± 3.4 80.8 ± 2.6 82.0 ± 1.5 74.9 ± 3.0 79.7 ± 2.5
PoseGait [24] 21.3 28.2 34.7 33.8 33.8 34.9 31 31 32.7 26.3 19.7 29.8
GaitFormer [10] 47.1 45.6 44.6 45.3 48.8 46.2 52.3 41.5 43.8 46.0 45.7 46.1
CL#1-2 GaitGraph [47] 69.6 66.1 68.8 67.2 64.5 62.0 69.5 65.6 65.7 66.1 64.3 66.3
GaitGraph2 [46] 57.1 61.1 68.9 66 67.8 65.4 68.1 67.2 63.7 63.6 50.4 63.6
GaitPT (Ours) 76.0 ± 2.0 77.5 ± 1.5 76.5 ± 3.4 77.6 ± 1.6 73.3 ± 2.8 76.6 ± 1.4 77.0 ± 3.0 75.3 ± 2.0 73.4 ± 3.0 76.3 ± 3.4 74.8 ± 1.8 75.8 ± 2.0
Table 1. GaitPT comparison to other skeleton-based architectures on CASIA-B. We report the average recognition accuracy for individual
probe angles excluding identical-view cases. For our model, we show the mean and standard deviation computed across 3 runs. GaitPT
obtains an average improvement of over 6% mean accuracy compared to the previous state-of-the-art.
of-the-art, GaitGraph [47], which utilizes graph convolu- vant walking sequences of the dataset, we utilize a sequence
tions for spatio-temporal feature extraction. GaitFormer length of 30 for both training and testing, which is in line
[10] lags behind other graph-based methods, proving that with other works in gait recognition [62]. As the GREW
spatial attention is a crucial component in gait recognition. authors do not release the labels for the testing set, we re-
Our results demonstrate the effectiveness of a hierarchical port the results obtained on the public leaderboard1 for the
approach to motion understanding in the context of gait GREW Competition.
recognition in controlled scenarios. Table 2 displays the comparison between GaitPT
and other methods, including both skeleton-based and
4.2. Evaluation In the Wild appearance-based approaches, on the GREW test set in
As the GaitPT architecture achieves adequate recogni- terms of Rank-1, Rank-5, Rank-10, and Rank-20 accuracy.
tion performance in controlled settings, we study its capa- Rank-K accuracy computes the percentage of samples for
bilities in more difficult real-world scenarios. which the correct label is among the top K predictions made
by the model. The results of the other models are taken from
Method R-1 Acc. (%) R-5 Acc. (%) R-10 Acc. (%) R-20 Acc. (%) the GREW [63] paper. GaitPT outperforms skeleton-based
GEINet [40] 6.82 13.42 16.97 21.01
TS-CNN [53] 13.55 24.55 30.15 37.01
approaches such as PoseGait [24] and GaitGraph [47] by
GaitSet [7] 46.28 63.58 70.26 76.82 over 50% in Rank-1 Accuracy. Moreover, it manages to out-
GaitPart [14] 44.01 60.68 67.25 73.47
PoseGait [24] 0.23 1.05 2.23 4.28
perform appearance-based state-of-the-art methods such as
GaitGraph [47] 1.31 3.46 5.08 7.51 GaitSet [7] and GaitPart [14] by approximately 6% and 8%
GaitPT (Ours) 52.16 ± 0.5 68.44 ± 0.6 74.07 ± 0.5 78.33 ± 0.4
in terms of Rank-1 Accuracy. These results demonstrate the
capabilities of GaitPT to generalize in unconstrained set-
Table 2. Comparison between GaitPT and other methods on the tings and the fact that skeleton-based data can be advanta-
GREW benchmark in terms of Rank-1, Rank-5, Rank-10, and geous for in-the-wild scenarios.
Rank-20 Recognition Accuracy. For our model, we show the mean Gait3D [62] is a large-scale dataset obtained in uncon-
and standard deviation computed across 3 runs. GaitPT manages strained settings consisting of 4000 unique subjects and
to outperform by a significant margin both skeleton-based and over 25,000 walking sequences. The dataset includes 3D
appearance-based state-of-the-art methods for in-the-wild scenar-
meshes, 3D skeletons, 2D skeletons, and silhouette images
ios. Table adapted from [63].
obtained from all recorded sequences.
We train our architecture using the provided 2D skele-
GREW [63] is one of the largest benchmarks for gait tons obtained through HRNet [44], following the same eval-
recognition in the wild, containing 26,000 unique identities uation protocol as Zheng et al. [62] in which 3000 subjects
and over 128,000 gait sequences. The videos for this dataset are placed in the training set and the remaining 1000 in the
were obtained from 882 cameras in public spaces and the gallery-probe sets. Similarly to the evaluation on CASIA-
labelling was done by 20 annotators for over 3 months. B, we normalize the 2D skeletons based on the dimensions
GREW releases silhouettes, optical flow, 3D skeletons, and of the image. In line with the methodology employed by
2D skeletons for the recorded walking sequences. the authors of Gait3D, we utilize a sequence length of 30
We train the GaitPT architecture on the provided 2D
skeletons which are normalized based on the dimensions 1 https://codalab.lisn.upsaclay.fr/competitions/
of first image in the sequence. Based on the smallest rele- 3409#results

during training and testing. Stage 1 Stage 2 Stage 3 Stage 4 Mean Accuracy
- - - ✓ 57.42 ± 1.5
Table 3 displays the results of the Gait3D test set in terms ✓ - - ✓ 73.48 ± 3.8
of Rank-1 and Rank-5 accuracy.. Results for PoseGait [25] ✓ ✓ - ✓ 76.80 ± 2.5
and GaitGraph [47] are directly taken from Gait3D [62] ✓ ✓ ✓ ✓ 78.85 ± 1.4
results benchmark. We train both GaitPT and GaitFormer
[11] using this approach with the same hyperparameters de-
Table 4. Ablation study for the GaitPT architecture on CASIA-B.
tailed in Section 3.3. GaitPT obtains an average increase of We show mean accuracy for 10 runs, alongside standard deviation.
6.25% in terms of rank-1 accuracy and an average improve- For brevity, we only report the average accuracy across all walking
ment of 8.58% in rank-5 accuracy compared to the previous angles and variations. There is a significant (pairwise Welch’s t-
skeleton-based state-of-the-art. These results demonstrate test p < 0.05) increase in performance with each added stage.
that our architecture generalizes effectively even in real-
world scenarios where accurate recognition is challenging. Stage 3 Partitioning Mean Accuracy
Method R-1 Accuracy (%) R-5 Accuracy (%)

Head + Upper Body + Legs 79.71 ± 1.3
Head + Left + Right 78.57 ± 0.7
PoseGait [24] 0.24 1.08
GaitGraph [47] 6.25 16.23
Head + Opposite Limb Pairing 79.17 ± 1.0
GaitFormer [10] 6.94 15.56 All combinations 78.48 ± 1.6
GaitPT (Ours) 13.19 ± 0.7 24.14 ± 2.1
Table 5. Performance on CASIA-B for different partitioning

Table 3. Comparison between our architecture and other skeleton- schemes for the Stage 3 module in the GaitPT architecture. We re-
based models on the Gait3D benchmark in terms of Rank-1 and port mean and standard deviation for 5 runs. For brevity, we only
Rank-5 Recognition Accuracy. All methods are trained on the show the average accuracy across all walking angles and varia-
same pose estimation data. For our model, we show the mean tions. There are no significant differences between partitioning
and standard deviation computed across 3 runs. GaitPT obtains schemes (pairwise Welch’s t-test p > 0.05).
an average increase of 6.25% in Rank-1 Accuracy over previous
state-of-the-art.
ture vectors. The second scheme also obtains the head area
vector, while combining all joints on the left side of the
4.3. Ablation on GaitPT Stages body into one embedding and the corresponding joints on
To understand the capabilities of the GaitPT architec- the right side into another. In the third scheme, we merge
ture and the necessity of each stage, we perform an ablation opposite limbs, more specifically the left arm with the right
study in which we train the model with different stages ac- leg and the right arm with the left leg, as typically the op-
tivated. Table 4 displays the results for the architecture for posite limbs coordinate with each other in human walking.
each stage configuration. The results show that each stage The final partitioning scheme forces the model to learn in-
is crucial in modelling increasingly complex movement pat- teractions between all the combinations described above.
terns to achieve good recognition performance. There is a Table 5 showcases mean accuracy across 5 runs on the dif-
significant (pairwise Welch’s t-test p < 0.05) increase in ferent partitioning schemes that we explored. We found that
performance after the addition of each stage. As Stage 4 there are no significant differences (pairwise Welch’s t-test
does not have any type of spatial attention by itself, its per- p > 0.05) between the variants. We chose to use the first
formance gets improved tremendously with the addition of partitioning scheme (Head + Upper Body + Legs) in our ex-
Stage 1 which models spatial interactions. Stages 2 and 3 periments to reduce computational time when compared to
add incremental performance by increasing the complexity the ”all combinations” partitioning scheme.
of movement combinations.
4.4. Pose Estimator Effect on Gait Performance
Among all stages in the GaitPT architecture, only the
third stage lacks a clear and intuitive approach for merging Skeleton-based gait recognition is heavily dependent on
limb embeddings into groups that naturally work together the performance of the underlying pretrained pose estima-
during movement. Unlike Stage 1, which functions at the tion model. Analysis of the patterns of walking requires
joint level, and Stage 2, which operates at the limb level, the precise estimation of joint positions across time, in both
limb group tokens of Stage 3 cannot be easily defined based training and inference. However, to our knowledge, the
on human anatomy. Consequently, we conducted experi- performance of the pose estimation model and its corre-
ments with various partitioning schemes that used different sponding quality of skeletons have not been analysed in re-
strategies for merging limb tokens. gard to the gait recognition accuracy. Currently, there is no
The first scheme groups all embeddings for the head consensus on the usage of a pose estimator. For instance,
area, upper body, and lower body into three distinct fea- GaitGraph [47] uses HRNet [44], GaitFormer [10] uses Al-
Pose Estimation Model COCO mAP Gait Variation Accuracy
NM 70.8 ± 1.0 tures on keypoint detection benchmarks such as COCO [27]
OpenPose [4] 64.2 BG 59.1 ± 0.5 and MPII [1].
CL 49.4 ± 2.1
NM 72.7 ± 2.0
AlphaPose [15] 72.3 BG 60.7 ± 1.7
CL 55.3 ± 3.2
NM 90.0 ± 1.8
YOLOv3 [37] + HRNet [44] 77 BG 76.6 ± 3.0 5. Conclusions, Limiations and Societal Impact
CL 71.8 ± 1.7
NM 92.3 ± 0.7
YOLOv7 [50] + ViTPose [57] 79.8 BG 79.7 ± 2.5 In this paper we propose GaitPT, a transformer model de-
CL 75.8 ± 2.0
signed to process skeleton sequences for gait-based person
identification. Compared to other works in the area of gait
Table 6. Results of the GaitPT trained on skeletal data obtained recognition [10,47], GaitPT uses spatial and temporal atten-
with different pose estimation models from CASIA-B footage. We tion to process the skeleton sequence in a hierarchical man-
report the mean and standard deviation computed across 3 runs. ner, guided by the anatomical configuration of the human
For brevity, we only show the average accuracy across all angles skeleton. GaitPT is able to process both micro and macro
for each walking variation. The final downstream gait recognition walking movements, which are crucial for fine-grained gait
accuracy is highly correlated (Pearson’s r = 0.919) with the up- recognition [26].
stream mAP performance of the pretrained pose estimator on the
COCO benchmark [27]. We evaluate our architecture on three datasets, corre-
sponding to two scenarios: a controlled laboratory walk-
ing scenario on CASIA-B [60], to validate the robustness
phaPose [15]. Some datasets also do not release raw RGB to walking confounding factors, and in realistic, real-world
videos [10,45,63], and instead release only 2D poses which walking scenarios on GREW [63] and Gait3D [62], to test
can rapidly become obsolete as the pose estimation state-of- the model’s ability to generalize to real-world scenarios. We
the-art performance increases. We further present an analy- obtain state-of-the-art performance on both scenarios by a
sis on the effect the underlying pose estimator has on down- large margin: on CASIA-B we obtain 82.6% average ac-
stream gait recognition performance. Consequently, we ob- curacy (+6% increase over previous state-of-the-art [47]),
tained skeletons from all the RGB videos in the CASIA-B on GREW we obtain 52.16% rank-1 accuracy (+5.88% in-
dataset, using multiple pose estimation models. crease over previous silhouette-based state-of-the-art [7])
For this analysis we selected 4 popular pose estimation and on Gait3D we obtain 13.19% average rank-1 accuracy
architectures: OpenPose [4], AlphaPose [15], HRNet [44] (+6.25% increase over previous best [10]).
and ViTPose [57], and trained the GaitPT architecture on
We conduct ablation studies on the most relevant design
the obtained data from each model. We selected these
decision for GaitPT and prove that each stage in the hier-
models based on their performance levels on the COCO
archical pipeline is necessary to obtain good downstream
pose estimation benchmark [27], ranging from moderate re-
gait recognition performance. We show that there is a high
sults (OpenPose) to state-of-the-art performance (VitPose).
correlation between the performance of the underlying pose
More recent versions of pose estimation models typically
estimation model and the downstream performance of the
rely on a bounding box of the individuals in the image to
gait recognition model.
make predictions. By using more accurate bounding boxes,
we can achieve more precise joint coordinate predictions. This work has several limitations. GaitPT is trained and
For this reason, we opted to utilize YOLOv3 [37] to obtain evaluated on three datasets that include only a subset of all
inputs for the HRNet model, similarly to Teepe et al. [47], walking variations present in the real world. Good perfor-
and YOLOv7 [50] for the ViTPose architecture. mance on walking variations captured in laboratory condi-
Table 6 showcases our results. Across all walking sce- tions or certain public places does not imply that GaitPT
narios, the results indicate that higher quality data leads to generalizes to the wide range of walking manners across
better recognition accuracy. Remarkably, the accuracy gap the general population. The training data does not re-
between the most inaccurate data obtained with OpenPose flect the real-world diversity of individuals, as CASIA-B,
and the most reliable data from ViTPose is over 20% in the GREW, and Gait3D mostly contain people of Asian de-
normal walking (NM) and the carrying bag (BG) scenarios scent. Regarding potential negative societal impact, gait-
and 25% for the clothing (CL) scenario which is regarded based person identification might be used for surveillance
as the most difficult. This significant disparity highlights and behaviour monitoring. Developments in gait recog-
the importance of using a pose estimation model that mini- nition might enable the identification of individuals with-
mizes data noise. The final gait recognition performance of out their consent. Nevertheless, the development of gait-
the models is highly correlated (Pearson’s r = 0.919) with processing models might aid in the detection of pathologi-
the performance of the underlying pose estimation architec- cal or abnormal gait [17].
References [14] Chao Fan, Yunjie Peng, Chunshui Cao, Xu Liu, Saihui Hou,
Jiannan Chi, Yongzhen Huang, Qing Li, and Zhiqiang He.
[1] Mykhaylo Andriluka, Leonid Pishchulin, Peter Gehler, and Gaitpart: Temporal part-based model for gait recognition. In
Bernt Schiele. 2d human pose estimation: New benchmark Proceedings of the IEEE/CVF Conference on Computer Vi-
and state of the art analysis. In IEEE Conference on Com- sion and Pattern Recognition (CVPR), June 2020. 1, 2, 4,
puter Vision and Pattern Recognition (CVPR), June 2014. 8 6
[2] Shefali Arora and MP S Bhatia. A computer vision system [15] Hao-Shu Fang, Jiefeng Li, Hongyang Tang, Chao Xu, Haoyi
for iris recognition based on deep learning. In 2018 IEEE Zhu, Yuliang Xiu, Yong-Lu Li, and Cewu Lu. Alpha-
8th International Advance Computing Conference (IACC), pose: Whole-body regional multi-person pose estimation
pages 157–161. IEEE, 2018. 1 and tracking in real-time. IEEE Transactions on Pattern
[3] Dmitry Bryliuk and Valery Starovoitov. Access control by Analysis and Machine Intelligence, 2022. 1, 2, 8
face recognition using neural networks. Institute of Engi-
[16] Yang Fu, Yunchao Wei, Yuqian Zhou, Honghui Shi, Gao
neering Cybernetics, Laboratory of Image Processing and
Huang, Xinchao Wang, Zhiqiang Yao, and Thomas Huang.
Recognition, 4, 2002. 1
Horizontal pyramid matching for person re-identification.
[4] Z. Cao, G. Hidalgo Martinez, T. Simon, S. Wei, and Y. A. In Proceedings of the AAAI conference on artificial intelli-
Sheikh. Openpose: Realtime multi-person 2d pose estima- gence, volume 33, pages 8295–8302, 2019. 2
tion using part affinity fields. IEEE Transactions on Pattern
[17] Elsa J. Harris, I-Hung Khoo, and Emel Demircan. A sur-
Analysis and Machine Intelligence, 2019. 2, 8
vey of human gait-based artificial intelligence applications.
[5] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas
Frontiers in Robotics and AI, 8, 2022. 1, 8
Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-
[18] Kaiming He, Georgia Gkioxari, Piotr Dollar, and Ross Gir-
end object detection with transformers. In Computer Vision–
shick. Mask r-cnn. In Proceedings of the IEEE International
ECCV 2020: 16th European Conference, Glasgow, UK, Au-
Conference on Computer Vision (ICCV), Oct 2017. 2
gust 23–28, 2020, Proceedings, Part I 16, pages 213–229.
Springer, 2020. 1 [19] Saihui Hou, Chunshui Cao, Xu Liu, and Yongzhen Huang.
Gait lateral network: Learning discriminative and compact
[6] Andy Catruna, Adrian Cosma, and Ion Emilian Radoi. From
representations for gait recognition. In Andrea Vedaldi,
face to gait: Weakly-supervised learning of gender informa-
Horst Bischof, Thomas Brox, and Jan-Michael Frahm, edi-
tion from walking patterns. In 2021 16th IEEE International
tors, Computer Vision – ECCV 2020, pages 382–398, Cham,
Conference on Automatic Face and Gesture Recognition (FG
2020. Springer International Publishing. 2
2021), pages 1–5. IEEE, 2021. 1
[7] Hanqing Chao, Yiwei He, Junping Zhang, and Jianfeng [20] Anil K Jain and Ajay Kumar. Biometric recognition: an
Feng. Gaitset: Regarding gait as a set for cross-view gait overview. Second generation biometrics: The ethical, legal
recognition. Proceedings of the AAAI Conference on Artifi- and social context, pages 49–79, 2012. 1
cial Intelligence, 33(01):8126–8133, Jul. 2019. 1, 2, 6, 8 [21] Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna,
[8] Kai Chen, Jiangmiao Pang, Jiaqi Wang, Yu Xiong, Xiaox- Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and
iao Li, Shuyang Sun, Wansen Feng, Ziwei Liu, Jianping Dilip Krishnan. Supervised contrastive learning. Advances
Shi, Wanli Ouyang, et al. Hybrid task cascade for instance in Neural Information Processing Systems, 33:18661–18673,
segmentation. In Proceedings of the IEEE/CVF Conference 2020. 3
on Computer Vision and Pattern Recognition, pages 4974– [22] Na Li, Xinbo Zhao, and Chong Ma. Jointsgait: A model-
4983, 2019. 2 based gait recognition method based on gait graph convo-
[9] Adrian Cosma, Andy Catruna, and Emilian Radoi. Exploring lutional networks and joints relationship pyramid mapping.
self-supervised vision transformers for gait recognition in the arXiv preprint arXiv:2005.08625, 2020. 1, 2, 5
wild. Sensors, 23(5), 2023. 1, 3 [23] Pei Li, Loreto Prieto, Domingo Mery, and Patrick Flynn.
[10] Adrian Cosma and Emilian Radoi. Learning gait representa- Face recognition in low quality images: A survey. arXiv
tions with noisy multi-task learning. Sensors, 22(18):6803, preprint arXiv:1805.11519, 2018. 1
2022. 1, 2, 3, 5, 6, 7, 8 [24] Rijun Liao, Shiqi Yu, Weizhi An, and Yongzhen Huang. A
[11] Adrian Cosma and Ion Emilian Radoi. Wildgait: Learning model-based gait recognition method with body pose and
gait representations from raw surveillance streams. Sensors, human prior knowledge. Pattern Recognition, 98:107069,
21(24):8387, 2021. 2, 3, 7 2020. 2, 3, 6, 7
[12] Vitor C de Lima, Victor HC Melo, and William R Schwartz. [25] Vı́tor C de Lima, Victor HC Melo, and William R Schwartz.
Simple and efficient pose-based gait recognition method for Simple and efficient pose-based gait recognition method for
challenging environments. Pattern Analysis and Applica- challenging environments. Pattern Analysis and Applica-
tions, 24(2):497–507, 2021. 2, 3 tions, 24:497–507, 2021. 2, 5, 7
[13] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, [26] Beibei Lin, Shunli Zhang, Ming Wang, Lincheng Li, and
Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Xin Yu. Gaitgl: Learning discriminative global-local fea-
Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- ture representations for gait recognition. arXiv preprint
vain Gelly, et al. An image is worth 16x16 words: Trans- arXiv:2208.01380, 2022. 1, 2, 8
formers for image recognition at scale. arXiv preprint [27] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays,
arXiv:2010.11929, 2020. 1, 4 Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence
Zitnick. Microsoft coco: Common objects in context. In [42] Leslie N Smith. Cyclical learning rates for training neural
Computer Vision–ECCV 2014: 13th European Conference, networks. In 2017 IEEE winter conference on applications
Zurich, Switzerland, September 6-12, 2014, Proceedings, of computer vision (WACV), pages 464–472. IEEE, 2017. 5
Part V 13, pages 740–755. Springer, 2014. 8 [43] Yi-Fan Song, Zhang Zhang, Caifeng Shan, and Liang Wang.
[28] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Stronger, faster and more explainable: A graph convolutional
Zhang, Stephen Lin, and Baining Guo. Swin transformer: baseline for skeleton-based action recognition. In proceed-
Hierarchical vision transformer using shifted windows. In ings of the 28th ACM international conference on multime-
Proceedings of the IEEE/CVF international conference on dia, pages 1625–1633, 2020. 3, 5
computer vision, pages 10012–10022, 2021. 1 [44] Ke Sun, Bin Xiao, Dong Liu, and Jingdong Wang. Deep
[29] Ziyu Liu, Hongwen Zhang, Zhenghao Chen, Zhiyong Wang, high-resolution representation learning for human pose es-
and Wanli Ouyang. Disentangling and unifying graph con- timation. In Proceedings of the IEEE/CVF conference on
volutions for skeleton-based action recognition. In Proceed- computer vision and pattern recognition, pages 5693–5703,
ings of the IEEE/CVF Conference on Computer Vision and 2019. 2, 6, 7, 8
Pattern Recognition, pages 143–152, 2020. 1, 2 [45] Noriko Takemura, Yasushi Makihara, Daigo Muramatsu,
[30] Ilya Loshchilov and Frank Hutter. Decoupled weight decay Tomio Echigo, and Yasushi Yagi. Multi-view large popu-
regularization. arXiv preprint arXiv:1711.05101, 2017. 5 lation gait dataset and its performance evaluation for cross-
[31] Jiwen Lu and Yap-Peng Tan. Gait-based human age esti- view gait recognition. IPSJ Transactions on Computer Vision
mation. IEEE Transactions on Information Forensics and and Applications, 10(1):1–14, 2018. 8
Security, 5(4):761–770, 2010. 1 [46] Torben Teepe, Johannes Gilg, Fabian Herzog, Stefan
[32] Shervin Minaee, Elham Azimi, and Amirali Abdolrashidi. Hörmann, and Gerhard Rigoll. Towards a deeper understand-
Fingernet: Pushing the limits of fingerprint recogni- ing of skeleton-based gait recognition. In Proceedings of
tion using convolutional neural network. arXiv preprint the IEEE/CVF Conference on Computer Vision and Pattern
arXiv:1907.12956, 2019. 1 Recognition, pages 1569–1577, 2022. 2, 3, 5, 6
[47] Torben Teepe, Ali Khan, Johannes Gilg, Fabian Herzog, Ste-
[33] Omkar M Parkhi, Andrea Vedaldi, and Andrew Zisserman.
fan Hörmann, and Gerhard Rigoll. Gaitgraph: Graph convo-
Deep face recognition. 2015. 1
lutional network for skeleton-based gait recognition. In 2021
[34] Chiara Plizzari, Marco Cannici, and Matteo Matteucci. Spa-
IEEE International Conference on Image Processing (ICIP),
tial temporal transformer network for skeleton-based action
pages 2314–2318. IEEE, 2021. 1, 2, 3, 5, 6, 7, 8
recognition. In Pattern Recognition. ICPR International
[48] Hugo Touvron, Matthieu Cord, Alexandre Sablayrolles,
Workshops and Challenges: Virtual Event, January 10–15,
Gabriel Synnaeve, and Hervé Jégou. Going deeper with im-
2021, Proceedings, Part III, pages 694–701. Springer, 2021.
age transformers. In Proceedings of the IEEE/CVF Interna-
3
tional Conference on Computer Vision, pages 32–42, 2021.
[35] Salil Prabhakar, Sharath Pankanti, and Anil K Jain. Biomet- 4
ric recognition: Security and privacy concerns. IEEE secu-
[49] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-
rity & privacy, 1(2):33–42, 2003. 1
reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia
[36] Tanmay Randhavane, Uttaran Bhattacharya, Kyra Kapsaskis, Polosukhin. Attention is all you need. Advances in neural
Kurt Gray, Aniket Bera, and Dinesh Manocha. Learning per- information processing systems, 30, 2017. 1, 3
ceived emotion using affective and deep features for mental [50] Chien-Yao Wang, Alexey Bochkovskiy, and Hong-
health applications. In 2019 IEEE International Symposium Yuan Mark Liao. Yolov7: Trainable bag-of-freebies sets
on Mixed and Augmented Reality Adjunct (ISMAR-Adjunct), new state-of-the-art for real-time object detectors. arXiv
pages 395–399, 2019. 1 preprint arXiv:2207.02696, 2022. 8
[37] Joseph Redmon and Ali Farhadi. Yolov3: An incremental [51] Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao
improvement. arXiv preprint arXiv:1804.02767, 2018. 8 Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao.
[38] Florian Schroff, Dmitry Kalenichenko, and James Philbin. Pyramid vision transformer: A versatile backbone for dense
Facenet: A unified embedding for face recognition and clus- prediction without convolutions. In Proceedings of the
tering. In Proceedings of the IEEE conference on computer IEEE/CVF international conference on computer vision,
vision and pattern recognition, pages 815–823, 2015. 4 pages 568–578, 2021. 1, 3, 4
[39] Alireza Sepas-Moghaddam and Ali Etemad. Deep gait [52] Thomas Wolf, Mohammadreza Babaee, and Gerhard Rigoll.
recognition: A survey. IEEE transactions on pattern analy- Multi-view gait recognition using 3d convolutional neural
sis and machine intelligence, 45(1):264–284, 2022. 1 networks. In 2016 IEEE International Conference on Image
[40] Kohei Shiraga, Yasushi Makihara, Daigo Muramatsu, Tomio Processing (ICIP), pages 4165–4169, 2016. 2
Echigo, and Yasushi Yagi. Geinet: View-invariant gait recog- [53] Zifeng Wu, Yongzhen Huang, Liang Wang, Xiaogang Wang,
nition using a convolutional neural network. In 2016 inter- and Tieniu Tan. A comprehensive study on cross-view
national conference on biometrics (ICB), pages 1–8. IEEE, gait based human identification with deep cnns. IEEE
2016. 6 transactions on pattern analysis and machine intelligence,
[41] Jasvinder Pal Singh, Sanjeev Jain, Sakshi Arora, and 39(2):209–226, 2016. 6
Uday Pratap Singh. Vision-based gait recognition: A sur- [54] Zifeng Wu, Yongzhen Huang, Liang Wang, Xiaogang Wang,
vey. Ieee Access, 6:70497–70527, 2018. 1 and Tieniu Tan. A comprehensive study on cross-view
gait based human identification with deep cnns. IEEE
Transactions on Pattern Analysis and Machine Intelligence,
39(2):209–226, 2017. 2
[55] Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar,
Jose M Alvarez, and Ping Luo. Segformer: Simple and
efficient design for semantic segmentation with transform-
ers. Advances in Neural Information Processing Systems,
34:12077–12090, 2021. 1
[56] Chi Xu, Yasushi Makihara, Ruochen Liao, Hirotaka Niit-
suma, Xiang Li, Yasushi Yagi, and Jianfeng Lu. Real-time
gait-based age estimation and gender classification from a
single image. In Proceedings of the IEEE/CVF winter con-
ference on applications of computer vision, pages 3460–
3470, 2021. 1
[57] Yufei Xu, Jing Zhang, Qiming Zhang, and Dacheng Tao. Vit-
pose: Simple vision transformer baselines for human pose
estimation. arXiv preprint arXiv:2204.12484, 2022. 1, 2, 8
[58] Sijie Yan, Yuanjun Xiong, and Dahua Lin. Spatial tempo-
ral graph convolutional networks for skeleton-based action
recognition. In Proceedings of the AAAI conference on arti-
ficial intelligence, volume 32, 2018. 1, 2, 3, 5
[59] Mang Ye, Jianbing Shen, Gaojie Lin, Tao Xiang, Ling
Shao, and Steven CH Hoi. Deep learning for person re-
identification: A survey and outlook. IEEE transactions on
pattern analysis and machine intelligence, 44(6):2872–2893,
2021. 1
[60] Shiqi Yu, Daoliang Tan, and Tieniu Tan. A framework for
evaluating the effect of view angle, clothing and carrying
condition on gait recognition. In 18th International Con-
ference on Pattern Recognition (ICPR’06), volume 4, pages
441–444, 2006. 2, 5, 8
[61] Shaoxiong Zhang, Yunhong Wang, Tianrui Chai, Annan Li,
and Anil K Jain. Realgait: Gait recognition for person re-
identification. arXiv preprint arXiv:2201.04806, 2022. 1
[62] Jinkai Zheng, Xinchen Liu, Wu Liu, Lingxiao He, Cheng-
gang Yan, and Tao Mei. Gait recognition in the wild with
dense 3d representations and a benchmark. In IEEE Confer-
ence on Computer Vision and Pattern Recognition (CVPR),
2022. 2, 5, 6, 7, 8
[63] Zheng Zhu, Xianda Guo, Tian Yang, Junjie Huang, Jiankang
Deng, Guan Huang, Dalong Du, Jiwen Lu, and Jie Zhou.
Gait recognition in the wild: A benchmark. In Proceedings
of the IEEE/CVF International Conference on Computer Vi-
sion, pages 14789–14799, 2021. 2, 5, 6, 8
[64] Zoran Zivkovic. Improved adaptive gaussian mixture model
for background subtraction. In Proceedings of the 17th In-
ternational Conference on Pattern Recognition, 2004. ICPR
2004., volume 2, pages 28–31. IEEE, 2004. 2

Gaitpt: Skeletons Are All You Need For Gait Recognition

Uploaded by

Document Informationclick to expand document information

Copyright:

Available Formats

Gaitpt: Skeletons Are All You Need For Gait Recognition

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Gaitpt: Skeletons Are All You Need For Gait Recognition

Uploaded by

Copyright:

Available Formats

GaitPT: Skeletons Are All You Need For Gait Recognition

Andy Catruna, Adrian Cosma, Emilian Radoi

Abstract pattern, has emerged as a promising alternative for per-

ing schemes in Section 4.4 to obtain the most suitable candi-

Zl′ = Reshape(JointM erge(Zl ))

4.1. Evaluation in Controlled Scenarios

of first image in the sequence. Based on the smallest rele- 3409#results

Method R-1 Accuracy (%) R-5 Accuracy (%)

Table 5. Performance on CASIA-B for different partitioning

You might also like