2020-Gait Challenges
2020-Gait Challenges
Neurocomputing
journal homepage: www.elsevier.com/locate/neucom
a r t i c l e i n f o a b s t r a c t
Article history: Gait has emerged as an important biometric feature which is capable of identifying individuals at distance
Received 8 August 2019 without requiring any interaction with the system. Various factors such as clothing, shoes, and walking
Revised 27 March 2020
surface can affect the performance of gait recognition. However, cross-view gait recognition is particularly
Accepted 31 March 2020
challenging as the appearance of individual’s walk drastically changes with the change in the viewpoint.
Available online 11 April 2020
In this paper, we present a novel view-invariant gait representation for cross-view gait recognition us-
Communicated by Dr Iosifidis Alexandros ing the spatiotemporal motion characteristics of human walk. The proposed technique trains a deep fully
connected neural network to transform the gait descriptors from multiple viewpoints to a single canoni-
Keywords:
Cross-view gait recognition cal view. It learns a single model for all the videos captured from different viewpoints and finds a shared
View transformations high-level virtual path to project them on a single canonical view. The proposed deep neural network is
Spatiotemporal features learned only once using the spatiotemporal gait representation and applied to testing gait sequences to
construct their view-invariant gait descriptors which are used for cross-view gait recognition. The exper-
imental evaluation is carried out on two large benchmark cross-view gait datasets, CASIA-B and OU-ISIR
large population, and the results are compared with current state-of-the-art methods. The results show
that the proposed algorithm outperforms the state-of-the-art methods in cross-view gait recognition.
© 2020 Elsevier B.V. All rights reserved.
https://doi.org/10.1016/j.neucom.2020.03.101
0925-2312/© 2020 Elsevier B.V. All rights reserved.
M.H. Khan, M.S. Farid and M. Grzegorzek / Neurocomputing 402 (2020) 100–111 101
fitting techniques [16]; and the motion parameters of the sub- • We learned a single model to map the gait sequences from
ject, e.g., joint angle trajectories [17], rotation patterns of hip and all viewpoints to the canonical view using a pretty small set
thigh [19], etc. However, the computation of such parameters re- of multi-view video sequences of gait. Moreover, the proposed
quire the localization of torso, which is difficult from the low- network does not require the information of viewpoints and
resolution images captured at distance in real surveillance systems. other variations in the gait during the training of the network
Although, 3D model-based approaches [20,24] have demonstrated and at the construction of cross-view gait descriptors.
better cross-view recognition results due to their view-invariant • The performance of the proposed algorithm is evaluated on two
nature but they require high-resolution images and they are com- large benchmark cross-view gait databases: CASIA-B [42] and
putationally expensive [25]. This paper presents an appearance- OU-ISIR large population (OULP) [43]. The recognition results
based solution for cross-view gait recognition. However, it is worth and comparison with the state-of-the-art techniques confirm
mentioning that the proposed gait representation algorithm is dif- the effectiveness of the proposed method. It is worth mention-
ferent from the existing appearance-based approaches as it do not ing here that our algorithm achieved excellent results using a
require any silhouette segmentation and the gait-cycle estimation. simple linear support vector machine (SVM) [39] as classifier,
Lately, the cross-view gait recognition has received significant which demonstrates the strength of proposed cross-view gait
research efforts due to its applications in surveillance systems. Nu- descriptors.
merous cross-view gait recognition techniques have been proposed
which can be divided into three categories: (1) view-invariant gait 2. Related work
descriptors [25–30], (2) construction of 3D gait descriptors [31–
33], and (3) the view transformation-based features [4,12,13,15,34– 2.1. Gait representations
38]. The first family of techniques develop a view-invariant gait
descriptor by transforming the gait sequences of different views Numerous methods have been proposed to construct a gait rep-
into a common space. These approaches perform well in spe- resentation from the images and gait video sequences. In general,
cific scenarios and are hard to generalize for other cases. More- they can be categorized into two groups: model-based approaches
over, their feature extraction phase is also disrupted due to self and appearance-based approaches.
occlusion [4]. The approaches in the second category construct Model-based methods aim to build a gait representation us-
a 3D gait descriptor using multiple calibrated cameras. Such ap- ing the human body structure and motion models. Several human
proaches perform good in a fully controlled and calibrated multi- body parts and joint positions are tracked over time and used to
camera environment, which is costly and computationally expen- identify the walkers. Lee et al. [16] proposed the modeling of hu-
sive [29]. The approaches in the third category construct a model man silhouette structure using seven different ellipses representing
to learn a mapping/projection of gait sequences perceived from the various human body regions. They computed several statistical
multiple views. The cross-view descriptors are constructed using measurements on these regions over time to form a gait descrip-
the learned model. In contrast to the first two categories, the ap- tor. The authors in [17] locate the joint locations and compute the
proaches in the third category have demonstrated excellent recog- joint angle trajectories at these locations to form a gait represen-
nition results. Moreover, they can be directly applied to the view- tation. Chai et al. [18] split the structure of human body region
points which are significantly different from the side view, e.g., into three parts and the variance of these parts over time are com-
frontal or back view. However, most of these techniques construct bined to obtain a gait feature. In [19], a gait representation using
multiple mapping matrices, one for each pair of viewpoints. the angular motion of the hip and thigh is presented. Recent stud-
The proposed method belongs to the third category. It con- ies [10,20] have shown that these approaches highly depend on the
structs a deep neural network that learns a single model to transfer localization of the torso, and require high-resolution images. They
the knowledge of gait sequences from multiple viewpoints to one are also sensitive to video quality and are computationally expen-
canonical view. Our learning scheme is based on the observation sive [10].
that the gait characteristics of a person from different viewpoints Appearance-based approaches do not build any structural or
still exhibits a common structure which makes it different from motion model, instead they operate on the recorded sequence of
others. Therefore, the gait related features should be separated gait images directly. Usually, they extract human silhouettes from
from the viewpoint related features, which is not linearly possi- the images or video and drive various information for gait identi-
ble. The proposed method works in three steps. First, a spatiotem- fication, e.g., construct a template from silhouettes images [5,6,8],
poral gait representation is computed directly from the video se- extract various gait parameters [7,21], exploit shape [9] or projec-
quences. Second, a deep neural network is trained which finds a tion analysis [44] on silhouettes. Most of them extract the human
shared high-level virtual path to map the gait descriptors from dif- silhouettes from the images and combine them over the gait-cycle
ferent viewpoints to a single canonical view. The spatiotemporal to obtain a template image which is used for person identifica-
gait descriptor of side-view gait sequences are used as the canon- tion. Among them, gait energy image (GEI) [5] has been extensively
ical view. Third, the gallery and the probe sequences are trans- used due to its simplicity and effectiveness. The representation is
formed using the trained model in order to obtain their view- obtained by averaging the segmented silhouettes of a subject over
invariant gait representation and fed to subsequent classifier with the gait-cycle. The authors in [6] used the average of the differ-
their respective labels. We used a simple linear support vector ma- ence images between two adjacent silhouette’s frames as gait de-
chine (SVM) [39] as classifier. The major advantages of the pro- scriptor. Similarly, the authors in [8] used the average of silhou-
posed method are: ette body contours as gait representation. Goffredo et al. [7] pro-
posed the use of height and width features from the normalized
and scaled silhouette region to construct a gait representation. The
• Unlike the most existing cross-view gait recognition methods, authors in [21] used the radial basis function (RBF) network and
e.g., [4,12,29,36,40,41] which require the silhouette segmenta- deterministic learning on the height and width ratio and centroid
tion to form a gait representation, the proposed method is of the contour to approximate an individual’s gait. In [45], a com-
based upon a spatiotemporal gait representation which is di- parative study of different convolutional neural network (CNN) ar-
rectly computed from the gait sequences. Thus, our method chitectures is proposed to recognize the gait. Different low-level
does not require the silhouette segmentation and the gait-cycle features such as optical flow, gray pixels and depth maps are com-
estimation. puted from the gait sequences; stacked them into spatio-temporal
102 M.H. Khan, M.S. Farid and M. Grzegorzek / Neurocomputing 402 (2020) 100–111
volumes separately and are passed to three different CNNs. Fi- formation, and the temporal information is preserved by calculat-
nally, their outputs are fused to obtain the gait signature. The au- ing the similarity of each test body pose with all the dynemes.
thors in [46] proposed a framework for gait recognition using the Later, a joint subspace is learned using linear discriminant anal-
combination of two CNNs which are modeled in one joint learn- ysis (LDA) to obtain a view-invariant representation of activities
ing procedure and can be trained jointly. Besides, silhouette pro- for classification. The method proposed in [54] employed the hu-
jection [44], shape analysis [9] and motion information [10,47– man body poses from different viewing angles to train a Self Or-
49] have been also exploited to construct a gait representation. ganizing Map (SOM) network to determine the body pose proto-
The appearance-based approaches are capable to recognize the types which are subsequently used to describe the training ac-
individuals even from the low-resolution images and they are com- tions by calculating the fuzzy similarities between all the proto-
putationally efficient too compared to their counter part model- types and the human body poses appearing in each training ac-
based approaches [10,21,50]. However, their recognition accuracy tion video. This action representation and their information is ex-
is highly based on the precise segmentation of silhouette from ploited to train two feed-forward neural networks for person iden-
the background, which is still a challenging problem in literature. tification and action recognition respectively. Metric learning-based
An inaccurate segmentation of silhouette shape may disrupt the approaches compute a weighting vector comprising the similar-
construction of gait descriptor and degrade the recognition ac- ity score related to each feature, which is used to estimate the
curacy [21]. Conversely, the proposed method computes the spa- recognition score. The authors in [30] proposed the use of pair-
tiotemporal motion characteristics of an individual to represent wise RankSVM algorithm [56] to improve the gait recognition with
his/her gait. It neither requires the silhouette segmentation nor the several variations, such as, view, clothing and carrying. The meth-
estimation of gait-cycle. Similar to our gait representation, the au- ods in [57,58] learned the transformation matrices using a pair
thors in [51,52] computed motion descriptors from densely sam- of gait features from two different viewpoints rather than learn-
pled points in a video sequence, and their higher level represen- ing a single mapping for all the viewpoints. Similarly, the tech-
tation is obtained using histogram-based techniques. The afore- nique in [59] proposed a tensor representation framework which
mentioned techniques have shown good recognition results when employed coupled metric learning technique for cross-view gait
used within the same view scenario but performed rather poorly recognition. They extract Gabor-based representations from GEIs of
in cross-view situations because the gait recognition performance different views and project them to a common subspace for recog-
severely suffers from the appearance variance caused by the view nition. In [4,60], a deep CNN using the GEI is proposed to mea-
change [4]. In the following section, we review the approaches for sure the similarity between the gait features perceived from dif-
cross-view gait recognition. ferent viewpoints. Yan et al. [12] employed GEI with CNN to pre-
dict multiple attributes for cross-view gait recognition. These ap-
2.2. Cross-view gait recognition proaches perform well for limited scenarios, particularly when the
view change is not large but usually they are hard to generalize for
The existing cross-view gait recognition approaches can be the other cases and their feature extraction phase is also disrupted
categorized into three groups: view invariant gait features, 3D due to self occlusion [4].
gait descriptor-based approaches, and view transformation model- The approaches in the second category assume that the tempo-
based techniques. rally synchronized images of a walking subject are available from
The approaches in the first category construct a view invari- multiple cameras. They constructs a 3D gait descriptor using multi-
ant gait descriptor by transforming the gait sequences of differ- view synchronized images. Bodor et al. [31] proposed a 3D vi-
ent views into a common space. They can be further categorized sual hull model to construct the view-invariant gait features us-
into geometry-based [26,27], subspace learning-based [25,28] and ing the input images from multiple cameras. In [32], a 3-D linear
metric learning-based [28] approaches. The geometry based ap- model is proposed to construct a view-independent gait feature us-
proaches construct a view-invariant gait descriptor using the ge- ing Bayesian rules. Zhao et al. [33] constructed a 3D human model
ometrical properties of the gait sequences. For example, Kale for gait recognition using the video sequences captured from mul-
et al. [26] proposed the perspective projection model to obtain a tiple viewpoints. The 3D gait descriptor-based techniques require
side-view gait images of an individual from any arbitrary view- costly setup of multiple calibrated cameras and they are computa-
point by assuming that the walking person is a 2D planar object in tionally expensive. Moreover, they can only be used in a controlled
the sagittal plane. The authors in [27] proposed the transformation environment and therefore they are considered unsuitable for real
of motion trajectories from any arbitrary view to a standard plane applications [25].
and their similarities are compared to identify the individuals. The View transformation model (VTM) based approaches learn a
subspace learning-based approaches learn a joint subspace using mapping or a transformation relationship among the gait features
the gait features from training data. The view-invariant features for perceived from different viewpoints. Later, the learned relationship
testing sequences are obtained by projecting them on the learned is used to construct the cross-view gait descriptors prior to mea-
subspace. The authors in [25] employ subspace learning and used sure their similarity. These approaches can deal with view varia-
direct linear discriminant analysis (DLDA) to create a single pro- tions without relying on multiple cameras or camera calibration.
jection model for classification. Liu et al. [28] proposed to learn Makihara et al. [38] proposed a singular value decomposition (SVD)
a joint subspace of the gait feature using joint principal compo- based VTM to project frequency-domain based features from one
nent analysis (JPCA) to pair with different view angles. The authors view to another. Kusakunniran et al. [36] applied LDA for optimiz-
in [15,35] used the canonical correlation analysis (CCA) to project ing the gait features and to train a VTM for each pair of views.
each pair of gait sequence into two subspaces with maximal corre- Instead of using SVD, [37] computes the local motion gait fea-
lation. However, these techniques construct multiple mapping ma- tures and builds VTMs using support vector regression (SVR). Hu
trices, i.e., one for each pair of viewpoints. The authors in [29] pro- et al. [13] proposed a gait feature known as enhanced Gabor gait
posed discriminative projection with list-wise constraints and rec- (EGG) which uses a non-linear mapping to encode the statistical
tification (DPLCR) to measure the similarity of gait features across and structural characteristics of gait across the views. They also ex-
the views. ploit regularized local tensor discriminant analysis (RLTDA) to cap-
The authors in [53,54] proposed a method for activity-based ture the nonlinear manifolds. In [40], a VTM using GEI and princi-
person identification including walk using Fuzzy vector quantiza- pal component analysis is proposed for view invariant feature ex-
tion. They exploited dynemes [55] to estimate the static body in- traction. Kusakunniran et al. [41] considered the VTM construction
M.H. Khan, M.S. Farid and M. Grzegorzek / Neurocomputing 402 (2020) 100–111 103
as a regression problem by adopting GEI and a Multi-Layer Percep- (MBH) [66] and Histogram of Oriented Gradient (HOG) [67] from
tron (MLP) network to seek the motion information from source the video sequences to construct a gait representation. The HOG
view which is used to estimate the gait in target view. The authors represents the person’s static appearance by capturing the magni-
in [34] proposed a unitary linear projection to construct a cross- tude and gradient information from the gait images. Similarly, the
view gait descriptor. The technique proposed in [61] used a deep MBH is constructed by computing the gradients over the horizon-
learning-based method to transform the gait descriptors of multi- tal and the vertical components of optical flow and encodes the
ple viewpoints to a canonical view. They used a side-view GEI as relative motion information between the pixels in the direction of
canonical view and generative adversarial networks (GAN) as a re- respective axis. The evaluation of the various motion descriptors
gressor. In [62], a gait-related loss function is proposed for deep demonstrated that HOG and MBH together outperform the others
learning-based cross view recognition techniques to compute the features for gait recognition [68]. Since, the HOG comprises the
discriminative features. It used spatial transformer network to lo- person’s static appearance and the MBH highlights the informa-
calize the horizontal parts of walker, and long short-term memory tion about the changes in the optical flow field (i.e., motion bound-
model to encode their temporal attention. The method in [63] ex- aries), therefore when used collectively they have a greater impact
ploited deep learning model to compute the cross-view gait rep- in identifying a person using his/her appearance and local motion
resentation using a set of independent frames. It extracted local characteristics. Hence, we selected HOG and MBH as local descrip-
features from each silhouette and aggregated them into a single tors for gait recognition.
set-level feature which are mapped into a discriminative space to Usually, the local descriptors are used to build a signature (i.e.,
obtain the final representation. feature encoding) to characterize an image or video sequence. In
In contrast to the previous two categories, VTM has shown this work, the local motion descriptors are encoded using Fisher
the excellent recognition accuracy and can be directly applied vector [69] encoding and a codebook based on Gaussian mixture
to the viewpoints which are significantly different from the side model (GMM). The GMM describes the distribution over feature
view [61]. They are applicable to solve both cross-view and multi- space and can be expressed as,
view gait recognition problems and they do not require multiple K
calibrated cameras too. Moreover, they are computationally fast
p( X | θ) = k=1 w k N ( x | μk , k ), (1)
and therefore suitable for real-time applications [41]. However, the where K are the number of components (i.e., clusters) in GMM
majority of existing methods in this category e.g., [41,61] use sil- and θ = {wk , μk , k |k = 1, 2, . . . , K } represents the set of model
houettes of walkers to construct the gait features and build a VTM parameters with wk is the weight, μk is the mean vector and k is
to transform the gait features from one view to another view. the covariance matrix of the kth cluster. Moreover, N (x | μk , k ) is
Other VTM based methods e.g., [38,40,41] learned multiple trans- the D-dimensional Gaussian distribution. Each mixture is assumed
formations, one for each pair of viewpoints. That is, they can only to represent a specific appearance and motion pattern shared by
transform from one specific viewpoint to another view. In contrast the local descriptors. For a given feature set X = {xt |t = 1, . . . , T },
to the existing techniques, the proposed method do not require sil- the soft assignment of data xt to cluster k is defined as,
houette segmentation and trains a deep network to learn a single
w N (xt | μk , k )
model for the transformation of knowledge from multiple view- qt (k ) = K k (2)
points to a single canonical view and later this model is used to j=1 w j N (xt | μ j , j)
construct the cross-view gait descriptors to identify the walker.
Eq. (2) assigns the local descriptor to multiple clusters in a
Preliminary results of this research are published in [64]. In this
weighted manner using the posterior component probability given
paper a number of improvements over [64] are proposed, includ-
by the descriptors. We used randomly selected one million local
ing, a detailed literature review of the state-of-the-art gait recog-
descriptors from the training set to build a codebook with GMM.
nition techniques and their classification into various categories,
The number of components K in GMM are empirically computed
and a comprehensive explanation of the proposed architecture.
and set to 28 . The Fisher vector (FV) representation consists of the
More extensive experimental evaluations are performed. Two large
deviation of the local descriptor from the generative model. This
benchmark cross-view gait databases are used to assess the per-
deviation can be computed using the gradient of the descriptor
formance of the proposed algorithm. Moreover, the recognition re-
log-likelihood with respect to the model parameters θ [69]. That
sults are compared with several well-known gait recognition tech-
is, the gradient vector with respect to mean μk and covariance k
niques.
is computed as:
T
3. The proposed technique 1 xt − μ
uk = √ qt (k ) k (3)
T wk k
In the proposed gait recognition algorithm, a spatiotemporal t=1
gait representation [47] is computed from the video sequences. T
1 (xt − μk )2
Then, a deep neural network is constructed to learn the transfor- vk = qt (k ) −1 ,
2 (4)
mation of spatiotemporal gait features from different source view- T 2wk t=1 k
points to the canonical view. The learned model is used to con-
struct the cross-view gait descriptors which are fed to the subse- The Fisher encoding for the set of local descriptors X is com-
quent classifier. puted by concatenating the uk and vk for all k = 1, 2, . . . , K compo-
nents. That is,
3.1. Spatiotemporal based gait representation
f = u
1 , v1 , u2 , v2 , . . . , uK , vK
The proposed method constructs a cross-view gait descrip- The MBHx , MBHy and HOG descriptors are encoded as described
tor using the spatiotemporal features of gait to characterize above and they are fused using the representation level fusion [70].
the distinct motion information of individuals. We recall that The length of each spatiotemporal gait feature is set to 20 0 0 using
these features are extracted directly from the video sequences the principal component analysis (PCA). The proposed gait features
within a space-time volume [65] and do not involve any human- demonstrated excellent performance in the lateral view gait recog-
body segmentation from the background. We computed several nition [47], which encouraged us to choose this gait representation
local motion descriptors such as Motion Boundary Histogram to construct the cross-view gait descriptor.
104 M.H. Khan, M.S. Farid and M. Grzegorzek / Neurocomputing 402 (2020) 100–111
Fig. 1. Training of the proposed non-linear deep neural network (NDNN). (a) The gait video sequences observed from unknown viewpoints are transformed to the canonical
view. (b) The proposed network forces the m different virtual paths to learn a single, high-level and shared virtual path (dotted line) which connects all the source viewpoints
to the canonical view.
Fig. 2. Construction of cross-view descriptor and classification using Linear SVM. The output of last layer is used as cross-view gait descriptor.
the training process. It computes the element-wise difference and OULP [43]. Several experiments are performed and the recognition
uses its squared term if the error falls below a threshold which results are compared with the existing state-of-the-art methods.
is set to 1, and absolute L1 distance otherwise. It is less sensitive
to outliers as compare to mean square error which simply square 4.1. Implementation details
the difference and may result in exploding gradients. However, we
did not observe any improvement in the training accuracy and this The first step in the proposed algorithm is to compute the
could be due to the nature of local descriptors which are used gait descriptors. The spatiotemporal gait representation is com-
in gait encoding. These descriptors are normalized with L2 -norm puted from all the walking video sequences using the algorithm
to make sure that they take similar ranges of values and are not described in Section 3.1. The proposed NDNN is implmented us-
too large. We used mini-batch stochastic gradient descent method ing Keras framework [75] with Tensorflow [76] as back-end, and
through back-propagation to minimize the objective function over trained using the RMSprop variation of the gradient descent algo-
all training samples. rithm [77] with its default parameters (e.g., learning rate is initial-
ized with 0.001), for 150 epochs. The weights are updated with
3.3. Cross-view gait descriptor and classification the mini-batch of size 33. We used LeakyReLU as the activation
function with α = 0.01, and Mean Squared Error function as the
We consider the t(Xij ) (Eq. (8)) as the non-linear transforma- loss function for the model. The network is trained using back-
tion function which transforms Xij to its respective canonical view propagation with logistic regression loss. Due to small number of
Xcj . In particular, this function provides the canonical view repre- layers, the network parameter θ = {WL , bL | L = 1, 2, . . . , 4} is ini-
sentation of a gait sequence obtained from any unknown view- tialized using simple random initialization method [73]. In partic-
point. It is worth mentioning that the proposed network does not ular, all the bias terms bL are initialized with zero and weight ma-
require viewpoint information during the training of the network trix WL is initialized using a Gaussian distribution with zero mean
and at the construction of cross-view gait descriptors. Therefore, and 0.05 standard deviation. The values of the λw is set to 0.0 0 01.
the gait sequences from the testing dataset are propagated through We used multi-resolution search [73] to find the optimal values
the trained network and the final output of the network which of the hyper-parameters of NDNN. That is, first the parameter val-
consists of a set of non-linear transformations Y1 , Y2 , . . . , Y4 from ues are tested from a larger range and few best configurations are
source viewpoint to the canonical viewpoint (Figs. 1 and 2), is cho- selected. Then, a narrow search space is exploited around these
sen as the cross-view gait descriptor because it encodes the influ- values to select the optimal values in the second step. For exam-
ence of all these transformations and provides the canonical view ple, as suggested by the authors in [78], we evaluated the per-
representation of gait regardless of its input view. Let Xij ∈ be formance of the network with increasing number of hidden lay-
the jth testing instance from any unknown ith viewpoint, the final ers and stopped when obtained a peak performance on valida-
cross-view gait descriptor is constructed as, tion data. It is empirically concluded that increasing the further
number of hidden layers beyond two did not improve the perfor-
t (Xij ) = f (WL . . . f (W2 ( f (W1 Xij + b1 )) + b2 ) + . . . + bL ), (10)
mance. Similarly, the hidden layer sizes are evaluated in the range
where t (Xij ) ≈ Xc j . For classification, the gait sequences of the [50 0, 150 0]. We train the network in a regular way using a gradi-
gallery set are selected and their cross-view gait descriptors ent descent based algorithm, and then the gallery and probe gait
are obtained using the learned NDNN (i.e., by propagating them sequences are propagated through the trained network in order to
through the trained network) which are used to train a clas- obtain their cross-view gait representation (i.e., an approximation
sifier along with their respective labels. At testing, the cross- of the respective canonical view). Our cross-view gait representa-
view descriptors of the probe sequences are computed using Eq. tion is a 20 0 0 dimensional descriptor, the value is selected empir-
(10) and fed to SVM to identify the walker. We used a simple linear ically. We evaluated the performance of the proposed method on
SVM [39] for this purpose and it achieved excellent results which CASIA-B gait dataset using different values of PCA. During classi-
shows the strength and robustness of our deep network. The distri- fication, the gallery set (θ g ) contains the gait sequences of view-
bution of the gallery (to train a SVM) and the probe sets is outlined point 90◦ and the rest of the viewpoints are used in probe sets
in the experimental evaluation of the respective gait database. (θ p ) separately. The average recognition accuracies across all the
viewpoints are presented in Table 1. The experimental evaluation
4. Experiments and results reveals that the best results are achieved when the descriptor di-
mension is 20 0 0.
The proposed gait recognition algorithm is evaluated on two The computational complexity of the proposed algorithm is
large benchmark cross-view gait datasets: CASIA-B [42] and computed using the gait sequences of CASIA-B gait database. We
106 M.H. Khan, M.S. Farid and M. Grzegorzek / Neurocomputing 402 (2020) 100–111
Table 1 Table 3
Proposed gait descriptor dimension selection: recognition results (%) on CASIA-B Comparison of recognition results (%) on CASIA-B gait database. The results are ob-
gait database using different values of PCA in feature encoding. tained by averaging the accuracies on all gallery views except the identical view in
the probe set. The best results are marked in bold.
Dimension 1500 20 0 0 40 0 0 80 0 0 16,0 0 0
θg : nm1 − nm4 0◦ –180◦ 36◦ –144◦
Accuracy 75.15 83.20 75.90 64.70 44.80
θ p : nm5 − nm6 0◦ 54◦ 90◦ 126◦ 54◦ 90◦ 126◦
SVR [37] − 28 29 34 35 44 45
TSVD [36] − 39 33 42 49 50 54
Method [15] 46.3 52.4 48.3 56.9 65 67.8 69.7
ViDP [34] − 59.1 50.2 57.5 83.5 76.7 80.7
EGG-RLTDA [13] − − − − 69.8 74.4 73.9
JDLDA [25] − 27.16 25.7 29.9 39 40.9 44.6
GII [29] − 63 55 62.1 80.1 78 80
GaitGAN [61] 41.9 64.5 58.1 65.7 78.6 77.3 81.7
CNN [4] 54.8 77.8 64.9 76.1 90.8 85.8 90.4
Ours 74.8 80.9 83.2 84 83.9 88.7 85.8
posed deep network and the remaining gait sequences of 100 sub-
◦
jects are used to evaluate the performance of the cross-view gait
Fig. 3. Sample images of CASIA-B dataset from each viewpoint 0–180 (left to
right). recognition algorithms. In all the experiments, the first four normal
walk sequences (i.e., nm1 − nm4 ) of 100 subjects in the dataset are
used to form a gallery and the rest are used in different probe sets.
carefully analyzed the computation time of each step involved in Similar to the recent state-of-the-art techniques [4,25,61], three
the proposed algorithm. The local descriptors are extracted from different types of experiments are performed and the achieved re-
the entire video sequence on full resolution (i.e., 320 × 240) using sults are compared with the existing techniques.
50 frames. This step is computationally the most expensive and re- In the first set of experiments, we evaluated the cross-view
quires on average 2.4 sec. The reported time is highly dependent gait recognition performance of the proposed NDNN on each view-
on the frame’s resolution and the length of gait sequence which pair. That is, for each experiment the gallery and the probe gait
can be optimized considering these factors. The feature encoding sequences are selected from different viewpoints. We first con-
and its cross-view gait representation is obtained in 0.2 sec, and struct a gallery set (θ g ) by taking each view iteratively from
the final classification step took 0.02 sec. The experimental evalua- {0◦ , 18◦ , . . . , 180◦ } and construct a probe set (θ p ) from the rest
tion is carried out on a machine with an Intel i7-6700K CPU, 64GB of the ten views, separately, excluding the identical view in the
RAM and a NVidia GTX TITAN X GPU. gallery set. Fig. 5 presents the performance of proposed NDNN in
each experiment. The plots show that our method achieves better
4.2. Performance evaluation on CASIA-B dataset performance than similar GaitGAN [61] algorithm.
Since most existing approaches [4,15,25,29] reported recogni-
The CASIA-B database comprises the gait video sequences tion accuracies on 54◦ , 90◦ and 126◦ views in a probe set, sep-
of 124 subjects. The gait sequences are collected in a con- arately, we selected the same views for performance comparison.
trolled indoor environment using eleven different viewpoints: The recognition results are presented in Table 3, where the gallery
0◦ , 18◦ , 36◦ , . . . , 180◦ . Sample images from each viewpoint are (0◦ –180◦ ) represents the average performance for all gallery view-
shown in Fig. 3 to demonstrate the appearance-changes due to ing angles ranging from 0◦ to 180◦ except the identical viewpoint
change in the viewing angle. The database contains ten walk se- which is used in the probe set. Similarly, the gallery (36◦ –144◦ )
quences for each subject with three variations, namely: normal represents the average performance for all gallery viewpoints rang-
walk (nm), walk with bag (bg), and walk with coat (cl). Among ing from 36◦ to 144◦ except the identical viewpoint which is used
these ten sequences, six belong to nm and two to each bg and cl. in the probe set. The results show that the proposed method out-
Fig. 4 presents sample images from viewing angle 90◦ to demon- performs the compared methods in most experiments.
strate the gait under various conditions. The viewpoint 90◦ is se- In the second set of experiments during SVM classification, ac-
lected as a canonical view during the network training. cording to the recommendations of researchers in [15,28] the
Cross-view gait recognition on CASIA-B database is challenging, gallery set (θ g ) contains the gait sequences of viewpoint 90◦ and
particularly when the cross-view angle is large. It is even more dif- the rest of the viewpoints are used in probe sets (θ p ) separately.
ficult when the probe and gallery sets belong to different walking We recall that the gait sequences of all different viewpoints are
scenarios [25]. Similar to the techniques in [4,15], the normal gait used to train the proposed NDNN while the viewpoint 90◦ is se-
sequences of 24 subjects are randomly chosen to train the pro- lected as canonical view during the network training. The recog-
Table 2
Comparison of recognition results (%) on CASIA-B gait database with gallery view θ g is 90◦ during SVM classification. The best
results are marked in bold.
θ p : nm5 − nm6 40◦ 18◦ 36◦ 54◦ 72◦ 108◦ 126◦ 144◦ 162◦ 180◦
JDLDA [25] 20 25 37 58 94 − − − − −
MvDA [79] 17 27 36 64 95 − − − − −
JSL [28] 20.5 35.5 56.5 81.5 96.5 96 89.5 50 34.5 21.5
Method [15] − − 42 70 95 96 70 41 − −
Method [42] 0.4 2.4 4.8 17.7 82.3 82.3 15.3 5.2 3.6 1.2
GEI+TSVD [36] 15 18 22 52 − 90 55 − 20 10
GaitGAN [61] 22.58 37.1 54.8 74.2 98.4 96.8 75.8 57.3 35.5 21.8
Ours 75 74.5 78 88.5 97 98.5 91.5 75.5 74.5 75.5
M.H. Khan, M.S. Farid and M. Grzegorzek / Neurocomputing 402 (2020) 100–111 107
Fig. 4. Sample images from CASIA-B dataset demonstrating the variations in walk. (a) Normal walk, (b) walk with coat, and (c) walk with bag.
Fig. 5. The recognition accuracy (%) of the proposed NDNN and GaitGAN [61] using the gallery set (θ g ) from all the 11 views, separately. The probe viewing angles are (a)
0◦ , (b) 18◦ , (c) 36◦ , (d) 54◦ , (e) 72◦ , (f) 90◦ , (g) 108◦ , (h) 126◦ , (i) 144◦ , (j) 162◦ , and (k) 180◦ .
nition results and comparison with the existing methods are pre- The GaitGAN [61] method exploits GEI as gait representation
sented in Table 2. The results show that the proposed method out- and learns a VTM using GAN to learn a single model for map-
performs the state-of-the-art in all experiments except at viewing ping of different viewpoints. However, their performance degrades
angle 72◦ where GaitGAN [61] performs slightly better than our heavily when the gallery and probe viewing angles are significantly
method. It may be noted that in some cases the proposed NDNN different from each other, as shown in Fig. 5 and Table 2. The supe-
performs exceptionally well compared to the existing methods. For rior performance of the proposed NDNN comes from two factors,
example, it achieves superior performance over the state-of-the-art an effective gait representation to encode the within-class geome-
when the gallery and probe viewpoints are significantly different try and the efficient view transformation model, while the compet-
from each other (e.g., θg = 90 and θ p = 0) see Table 2. ing methods (e.g., [15,36]) incorporate the information from lim-
ited viewing angles.
108 M.H. Khan, M.S. Farid and M. Grzegorzek / Neurocomputing 402 (2020) 100–111
Fig. 6. The recognition accuracy (%) of the proposed NDNN and CNN [4]. (a) and (b) The results are obtained under similar-walking-condition by averaging the accuracies
on all gallery views except the identical view in the probe set. (c) and (d) The results are obtained under cross-walking-condition.
Fig. 7. Sample images from OULP gait dataset. (a) 55◦ , (b) 65◦ , (c) 75◦ , and (d) 85◦ .
walking sequences of person carrying a bag during the walk (bg), Method [80] 76.4/87.9 73.7/91.1 76.9/86.2
and the sequences of walker wearing a coat (cl) under viewing an- RLTDA [13] 80.8/69.4 76.5/72.1 72.3/64.4
R-VTM [40] 40.7/35.4 58.2/50.3 59.4/61.3
gle of 54◦ , 90◦ and 126◦ , separately. The results are documented
FT-SVD [38] 26.5/19.8 33.1/20.6 38.6/32.0
in Table 4. The statistics show that our method outperforms the CNN [4] 92.7/49.7 88.9/75.6 86.0/51.4
compared methods in most experiments. Method [42] 24.2/16.5 44.0/27.8 31.9/18.1
We further performed a set of experiments using different GaitGAN [61] 53.2/43.6 62.1/43.6 66.1/41.1
Ours 93/84.5 94.5/88.5 91.5/82.5
number of instances in network training as carried out in [4]. The
similar configuration of the network parameters is selected as re-
ported in first set of experiments except the number of training
instances. In particular, the normal walk sequences of 74 subjects method in [4] reported an increase of 20% in recognition accuracy.
are used to train the proposed NDNN and the rest normal gait The comparison of recognition results using 24 and 74 subjects
sequences of 50 subjects are used to evaluate the performance in the network training is presented in Fig. 6(a) and (b), respec-
of the cross-view gait descriptors. We named this experiment as tively. In case of cross-walk-condition as outlined in third set of ex-
similar-walk-condition [4]. The recognition accuracy of our method periments, the gallery set comprises the normal gait sequences of
increases by 5% when 74 subjects are used for training, while the viewing angle 36◦ , 108◦ and 144◦ whereas, the probe set consists of
M.H. Khan, M.S. Farid and M. Grzegorzek / Neurocomputing 402 (2020) 100–111 109
walking sequences of person carrying a bag during the walk (bg), Table 5
Performance evaluation (%) and comparison with the existing methods on OULP gait
and the sequences of walker wearing a coat (cl) under viewing
dataset under same and cross view settings. The values in parenthesis () represent
angle 54◦ , 90◦ and 126◦ , separately. The recognition results under the recognition results within the same-view settings. The best results are marked
both conditions (i.e., bg and cl) are presented in Fig. 6(c) and (d), in bold. Note that hyphen (-) means that either the results are not available or the
respectively. The proposed method is proven to be robust under respective approach cannot be evaluated under those experiments.
cross-walk-condition experiments and achieves better recognition Gallery Method Probe
accuracy than method in [4], the difference is up to 20%. The re-
55◦ 65◦ 75◦ 85◦
sults of these experiments reveal that the our method outperforms
◦
in most of the experiments in both similar and cross walking con- 55 DeepGait [81] (97.4) 96.1 93.4 88.7
wQVTM [83] − 78.3 64.0 48.6
ditions, even when the network is trained using limited number
GEINet [60] (94.7) 93.2 89.7 79.9
of subjects i.e., 24. Achieving high detection accuracy under lim- Method [82] (95.2) 93.6 81.2 62.2
ited training dataset is no doubt a challenging task and the perfor- PdVS [84] − 76.2 61.4 45.5
mance of the proposed method in this scenario is appreciable. Fur- AVTM [84] − 77.7 64.5 42.7
thermore, another advantage of the proposed method is its simple Method [4] (98.8) 98.3 96.0 80.5
Ours (100) 95.1 94.9 97.5
gait representation which neither requires the silhouette segmen-
65◦ DeepGait [81] 97.3 (97.6) 97.2 95.4
tation nor gait-cycle estimation, and it can be computed directly wQVTM [83] 81.5 − 79.2 67.5
from the gait video sequences. GEINet [60] 93.7 (95.1) 93.8 90.6
Method [82] 90.9 (95.3) 95.5 90.2
PdVS [84] 76 − 77.1 65.5
4.3. Performance evaluation on OULP dataset AVTM [84] 75.6 − 76.4 62.8
Method [4] 96.3 (98.9) 97.3 83.3
Ours 94.8 (100) 95.5 97.5
The OU-ISIR large population (OULP) is the largest cross-view
75◦ DeepGait [81] 93.3 97.5 (97.7) 97.6
gait database which comprises the gait sequences of more than wQVTM [83] 70.2 80.0 − 78.2
4,0 0 0 subjects. The gait sequences are recorded in an indoor envi- GEINet [60] 90.1 94.1 (95.2) 93.8
ronment at 30 frames-per-second, under four viewing angles: 55◦ , Method [82] 77.5 94.4 (96.0) 96.0
65◦ , 75◦ and 85◦ . Each subject was asked to walk along a course PdVS [84] 60.3 76.2 − 76.5
AVTM [84] 59.9 74.9 − 76.3
twice in a natural manner. Fig. 7 presents some example images Method [4] 94.2 97.8 (98.9) 85.1
of a walking subject with four viewing angles. The viewpoint 85◦ Ours 95.1 96.0 (100) 97.7
is selected as a canonical view during the network training. The 85◦ DeepGait [81] 89.3 96.4 98.3 (98.3)
researches e.g., [60,81,82] designed two different types of experi- wQVTM [83] 51.1 68.5 79.0 −
GEINet [60] 81.4 91.2 94.6 (94.7)
ments on this dataset, same-view experiment and cross-view ex-
Method [82] 55.4 87.1 94.8 (94.7)
periment. In the same-view experimental setting, the gallery and PdVS [84] 40.5 60.6 73.1 −
probe gait sequences belong to the same viewing-angle, while in AVTM [84] 40.2 61.9 74.3 −
the cross-view setting they belong to different viewing-angles. Par- Method [4] 90.0 96.0 98.4 (98.9)
ticularly, we first construct a gallery by picking each view itera- Ours 98.0 97.1 97.7 (100)
tively from {55◦ , 65◦ , 75◦ , 85◦ } and construct a probe from the rest
of the three views separately.
The OULP dataset is divided into 5 parts and the distribution proposed NDNN framework. The experimental evaluation showed
is publicly available at1 . Each part contains the division of 1912 that the recognition accuracy of the proposed algorithm is drasti-
subjects in two equal disjoint sets. A five 2-fold cross validations cally degraded up to 26% when the gait descriptors of [51,52] were
are performed in both type of experiments to meet the protocols used as input to network. Though, these feature are also computed
of benchmarks as in [60,81–83]. That is, for each part when the using local spatiotemporal descriptors within a space-time volume
gallery and the probe sets are exchanged 2-fold cross validation is along the trajectories, however several features are exploited to
adopted. In each run, we consider one set as testing, and train a construct the final gait representation. Moreover, the view trans-
network with the remaining set and vice versa. The average recog- formation model is built through a non-linear deep neural network
nition accuracies obtained by the proposed method and other com- which is able to capture the non-linear manifolds in gait and learns
pared methods are reported in Table 5. The recognition results re- multi-step virtual path between the source viewpoints and their
veal that our method outperforms the compared methods in most respective canonical view. Due to these advantages, our method
experiments. achieved higher cross-view gait recognition accuracies than the ex-
One can conclude from the results presented in Tables 2–5 and isting methods.
Figs. 5–6 that the proposed method performed consistently bet-
ter than state-of-the-art cross-view gait recognition techniques in
many experiments. We observed that the superior performance of
5. Conclusion
the proposed algorithm is due to two factors, the effective gait rep-
resentation and the efficient view transformation model. The pro-
In this paper, a novel cross-view gait recognition algorithm is
posed gait representation is based upon the spatiotemporal char-
proposed using a non-linear deep neural network. Spatiotempo-
acteristics of gait, in particular, it comprises of two features, one
ral motion cues of human walk are extracted and used to con-
captures the static appearance of the individual and the other con-
struct a gait descriptor. The network is trained using these gait de-
siders the local motion information extracted from the changes in
scriptors, which transfers the knowledge of an individual’s gait se-
the optical flow field. To support this argument, an investigation
quences from different viewpoints to a single canonical view. The
study is also carried out to evaluate the performance of the pro-
cross-view gait representation of a testing instance is achieved by a
posed algorithm using the gait representations proposed in [51,52].
forward propagation through the trained network. A simple linear
The gait features computed using [51,52] are employed as input to
support vector machine is used to classify the cross-view descrip-
tors. The experimental evaluations performed on two benchmark
1
http://www.am.sanken.osaka-u.ac.jp/BiometricDB/dataset/GaitLP/Benchmarks. cross-view gait datasets and comparisons with the state-of-the-art
html . methods confirm the effectiveness of the proposed algorithm.
110 M.H. Khan, M.S. Farid and M. Grzegorzek / Neurocomputing 402 (2020) 100–111
Declaration of Competing Interest [26] A. Kale, A.R. Chowdhury, R. Chellappa, Towards a view invariant gait recogni-
tion algorithm, in: Proceedings Conference Advanced Video and Signal Based
Surveillance, IEEE, 2003, pp. 143–150.
The authors declare that they have no known competing finan- [27] F. Jean, R. Bergevin, A. Albu, Computing and evaluating view-normalized body
cial interests or personal relationships that could have appeared to part trajectories, Image Vis. Comput. 27 (9) (2009) 1272–1284.
influence the work reported in this paper. [28] N. Liu, J. Lu, Y. Tan, Joint subspace learning for view-invariant gait recognition,
IEEE Signal Process. Lett. 18 (7) (2011) 431–434.
[29] Z. Zhang, et al., Gii representation-based cross-view gait recognition by dis-
CRediT authorship contribution statement criminative projection with list-wise constraints, IEEE Trans. Cybern. (2017).
[30] R. Martín-Félez, T. Xiang, Gait recognition by ranking, in: ECCV, Springer, 2012,
pp. 328–341.
Muhammad Hassan Khan: Conceptualization, Methodology,
[31] R. Bodor, et al., View-independent human motion classification using im-
Investigation, Writing - original draft, Writing - review & editing. age-based reconstruction, Int. J. Comput. Vis. 27 (8) (2009) 1194–1206.
Muhammad Shahid Farid: Methodology, Investigation, Writing - [32] Z. Zhang, N.F. Troje, View-independent person identification from human gait,
Neurocomputing 69 (1–3) (2005) 250–256.
original draft, Writing - review & editing. Marcin Grzegorzek: Con-
[33] G. Zhao, et al., 3d gait recognition using multiple cameras, in: Proceedings
ceptualization, Writing - review & editing. of the International Conference Automatic Face and Gesture Recognition, IEEE,
2006, pp. 529–534.
References [34] M. Hu, et al., View-invariant discriminative projection for multi-view
gait-based human identification, IEEE Trans. Inf. Forensics Secur. 8 (12) (2013)
[1] M. Ye, C. Liang, Y. Yu, Z. Wang, Q. Leng, C. Xiao, J. Chen, R. Hu, Person reidenti- 2034–2045.
fication via ranking aggregation of similarity pulling and dissimilarity pushing, [35] K. Bashir, T. Xiang, S. Gong, Cross view gait recognition using correlation
IEEE Trans. Multimed. 18 (12) (2016) 2553–2566. strength, in: BMVC, 2010, pp. 1–11.
[2] Y. Sun, M. Zhang, Z. Sun, T. Tan, Demographic analysis from biometric data: [36] W. Kusakunniran, et al., Multiple views gait recognition using view trans-
achievements, challenges, and new frontiers, IEEE Trans. Pattern Anal. Mach. formation model based on optimized gait energy image, in: Proceedings of
Intell. 40 (2) (2018) 332–351. the IEEE International Conference on Computer Vision (ICCV), IEEE, 2009,
[3] J.K. Pillai, M. Puertas, R. Chellappa, Cross-sensor iris recognition through kernel pp. 1058–1064.
learning, IEEE Trans. Pattern Anal. Mach. Intell. 36 (1) (2014) 73–85. [37] W. Kusakunniran, et al., Support vector regression for multi-view gait recogni-
[4] Z. Wu, et al., A comprehensive study on cross-view gait based human identi- tion based on local motion feature selection, in: Proceedings of the IEEE Com-
fication with deep CNNs, IEEE Trans. Pattern Anal. Mach. Intell. 39 (2) (2017) puter Society Conference on Computer Vision and Pattern Recognition (CVPR),
209–226. IEEE, 2010, pp. 974–981.
[5] J. Man, B. Bhanu, Individual recognition using gait energy image, IEEE Trans. [38] Y. Makihara, et al., in: Gait Recognition Using a View Transformation Model in
Pattern Anal. Mach. Intell. 28 (2) (2006) 316–322. the Frequency Domain, Springer, 2006, pp. 151–163.
[6] E. Zhang, Y. Zhao, W. Xiong, Active energy image plus 2DLPP for gait recogni- [39] R.-E. Fan, et al., Liblinear: a library for large linear classification, J. Mach. Learn.
tion, Signal Process. 90 (7) (2010) 2295–2302. Res. 9 (Aug) (2008) 1871–1874.
[7] M. Goffredo, J.N. Carter, M.S. Nixon, Front-view gait recognition, in: Proceed- [40] S. Zheng, et al., Robust view transformation model for gait recognition, in: Pro-
ings of the International Conference Biometrics: Theory, Applications and Sys- ceedings of the International Conference on Image Processing (ICIP), IEEE, 2011,
tems, IEEE, 2008, pp. 1–6. pp. 2073–2076.
[8] C. Wang, et al., Human identification using temporal information preserving [41] W. Kusakunniran, Q. Wu, J. Zhang, H. Li, Cross-view and multi-view gait recog-
gait template, IEEE Trans. Pattern Anal. Mach. Intell. 34 (11) (2012) 2164–2176. nitions based on view transformation model using multi-layer perceptron, Pat-
[9] W. Kusakunniran, et al., Pairwise shape configuration-based PSA for gait recog- tern Recognit. Lett. 33 (7) (2012) 882–889.
nition under small viewing angle change, in: Proceedings of the International [42] S. Yu, D. Tan, T. Tan, A framework for evaluating the effect of view angle, cloth-
Conference Advanced Video and Signal-Based Surveillance (AVSS), IEEE, 2011, ing and carrying condition on gait recognition, in: Proceedings of the Confer-
pp. 17–22. ence Pattern Recognition (ICPR), 4, IEEE, 2006, pp. 441–444.
[10] Y. Yang, D. Tu, G. Li, Gait recognition using flow histogram energy image, [43] H. Iwama, et al., The OU-ISIR gait database comprising the large population
in: Proceedings of the International Conference on Pattern Recognition (ICPR), dataset and performance evaluation of gait recognition, IEEE Trans. Inf. Foren-
2014, pp. 444–449. sics Secur. 7 (5) (2012) 1511–1521.
[11] S. Yu, H. Chen, Q. Wang, L. Shen, Y. Huang, Invariant feature extraction for gait [44] D. Tan, et al., Uniprojective features for gait recognition, in: Proceedings of the
recognition using only one uniform model, Neurocomputing 239 (2017) 81–93. International Conference Biometrics, Springer, 2007, pp. 673–682.
[12] C. Yan, B. Zhang, F. Coenen, Multi-attributes gait identification by convolutional [45] F.M. Castro, M.J. Marín-Jiménez, N. Guil, N.P. de la Blanca, Multimodal feature
neural networks, in: Proceedings of the International Congress on Image and fusion for CNN-based gait recognition: an empirical comparison, arXiv:1806.
Signal Processing (CISP), IEEE, 2015, pp. 642–647. 07753(2018).
[13] H. Hu, Enhanced Gabor feature based classification using a regularized locally [46] C. Song, Y. Huang, Y. Huang, N. Jia, L. Wang, Gaitnet: an end-to-end network
tensor discriminant model for multiview gait recognition, IEEE Trans. Circt. for gait based human identification, Pattern Recognit. 96 (2019) 106988.
Syst. Video Technol. 23 (7) (2013) 1274–1286. [47] M.H. Khan, M.S. Farid, M. Grzegorzek, Spatiotemporal features of human mo-
[14] W. Xu, C. Luo, A. Ji, C. Zhu, Coupled locality preserving projections for tion for gait recognition, Signal Image Video Process. 13 (2) (2019) 369–377.
cross-view gait recognition, Neurocomputing 224 (2017) 37–44. [48] M.H. Khan, M.S. Farid, M. Grzegorzek, Using a generic model for code-
[15] W. Kusakunniran, et al., Recognizing gaits across views through correlated mo- book-based gait recognition algorithms, in: Proceedings of the International
tion co-clustering, IEEE Trans. Image Process. 23 (2) (2014) 696–709. Workshop Biometrics Forensics (IWBF), IEEE, 2018, pp. 1–7.
[16] L. Lee, W.E.L. Grimson, Gait analysis for recognition and classification, in: Pro- [49] M.H. Khan, et al., Gait recognition using motion trajectory analysis, in:
ceedings of the International Conference Automatic Face and Gesture Recogni- Proceedings of the International Conference Computer Recognition Systems
tion, IEEE, 2002, pp. 155–162. (CORES), Springer, 2017, pp. 73–82.
[17] L. Wang, H. Ning, T. Tan, W. Hu, Fusion of static and dynamic body biomet- [50] M.H. Khan, M.S. Farid, M. Grzegorzek, A generic codebook based approach for
rics for gait recognition, IEEE Trans. Circt. Syst. Video Technol. 14 (2) (2004) gait recognition, Multimed. Tools Appl. (2019) 1–24.
149–158. [51] M.J. Marín-Jiménez, F.M. Castro, Á. Carmona-Poyato, N. Guil, On how to im-
[18] Y. Chai, et al., A novel human gait recognition method by segmenting and ex- prove tracklet-based gait recognition systems, Pattern Recognit. Lett. 68 (2015)
tracting the region variance feature, in: Proceedings of the International Con- 103–110.
ference Pattern Recognition (ICPR), 4, 2006, pp. 425–428. [52] W. Gong, M. Sapienza, F. Cuzzolin, Fisher tensor decomposition for uncon-
[19] D. Cunado, M.S. Nixon, J.N. Carter, Automatic extraction and description of hu- strained gait recognition, Training 2 (3) (2013).
man gait models for recognition purposes, Comput. Vis. Image Underst. 90 (1) [53] A. Iosifidis, A. Tefas, I. Pitas, Activity-based person identification using fuzzy
(2003) 1–41. representation and discriminant learning, IEEE Trans. Inf. Forensics Secur. 7 (2)
[20] G. Ariyanto, M.S. Nixon, Marionette mass-spring model for 3d gait biomet- (2011) 530–542.
rics, in: Proceedings of the International Conference Biometrics, IEEE, 2012, [54] A. Iosifidis, A. Tefas, I. Pitas, Person identification from actions based on arti-
pp. 354–359. ficial neural networks, in: Proceedings of the IEEE Symposium Computational
[21] W. Zeng, C. Wang, F. Yang, Silhouette-based gait recognition via deterministic Intelligent Biometrics Identity Management (CIBIM), IEEE, 2013, pp. 7–13.
learning, Pattern Recognit. 47 (11) (2014) 3568–3584. [55] A. Iosifidis, N. Nikolaidis, I. Pitas, Movement recognition exploiting multi-view
[22] M.H. Khan, Human Activity Analysis in Visual Surveillance and Healthcare, 45, information, in: Proceedings of the International Workshop Multimedia Signal
Logos Verlag Berlin GmbH, 2018. Processing (MMSP), IEEE, 2010, pp. 427–431.
[23] Y. Makihara, T. Tanoue, D. Muramatsu, Y. Yagi, S. Mori, Y. Utsumi, M. Iwamura, [56] O. Chapelle, S.S. Keerthi, Efficient algorithms for ranking with SVMs, Inf. Retr.
K. Kise, Individuality-preserving silhouette extraction for gait recognition, IPSJ 13 (3) (2010) 201–215.
Trans. Comput. Vis. Appl. 7 (2015) 74–78. [57] X. Ben, C. Gong, P. Zhang, R. Yan, Q. Wu, W. Meng, Coupled bilinear discrim-
[24] H. Yang, S. Lee, Reconstruction of 3d human body pose for gait recognition, inant projection for cross-view gait recognition, IEEE Trans. Circt. Syst. Video
in: Proceedings of the International Conference Biometrics, Springer, 2006, Technol. (2019a).
pp. 619–625. [58] X. Ben, C. Gong, P. Zhang, X. Jia, Q. Wu, W. Meng, Coupled patch align-
[25] J. Portillo-Portillo, et al., Cross view gait recognition using joint-direct linear ment for matching cross-view gaits, IEEE Trans. Image Process. 28 (6) (2019b)
discriminant analysis, Sensors 17 (1) (2016) 6. 3142–3157.
M.H. Khan, M.S. Farid and M. Grzegorzek / Neurocomputing 402 (2020) 100–111 111
[59] X. Ben, P. Zhang, Z. Lai, R. Yan, X. Zhai, W. Meng, A general tensor represen- [82] Q. Chen, Y. Wang, Z. Liu, Q. Liu, D. Huang, Feature map pooling for cross-view
tation framework for cross-view gait recognition, Pattern Recognit. 90 (2019c) gait recognition based on silhouette sequence images, in: Proceedings of the
87–98. International Conference 2017 IEEE International Joint Conference on Biomet-
[60] K. Shiraga, et al., Geinet: view-invariant gait recognition using a convolutional rics (IJCB), 2017, pp. 54–61, doi:10.1109/BTAS.2017.8272682.
neural network, in: Proceedings of the International Conference Biometrics, [83] D. Muramatsu, Y. Makihara, Y. Yagi, View transformation model incorporating
IEEE, 2016, pp. 1–8. quality measures for cross-view gait recognition, IEEE Trans. Cybern. 46 (7)
[61] S. Yu, H. Chen, E.B.G. Reyes, N. Poh, Gaitgan: invariant gait feature ex- (2016) 1602–1615.
traction using generative adversarial networks., in: CVPR Workshops, 2017, [84] D. Muramatsu, et al., Gait-based person recognition using arbitrary view trans-
pp. 532–539. formation model, IEEE Trans. Image Process. 24 (1) (2015) 140–154.
[62] Y. Zhang, Y. Huang, S. Yu, L. Wang, Cross-view gait recognition by discrimina-
tive feature learning, IEEE Trans. Image Process. 29 (2019) 1001–1015. M. H. Khan obtained the B.S. and M.Phil. degrees in Com-
[63] H. Chao, Y. He, J. Zhang, J. Feng, Gaitset: regarding gait as a set for cross-view puter Science from BZ University and University of the
gait recognition, in: Proceedings of the AAAI Conference on Artificial Intelli- Punjab, Pakistan respectively, and the Ph.D. in computer
gence, 33, 2019, pp. 8126–8133. science from University of Siegen, Germany. He is cur-
[64] M.H. Khan, M.S. Farid, M. Zahoor, M. Grzegorzek, Cross-view gait recogni- rently an Assistant Professor with the College of Infor-
tion using non-linear view transformations of spatiotemporal features, in: Pro- mation Technology, University of the Punjab, Lahore, Pak-
ceedings of the International Conference Image Process. (ICIP), IEEE, 2018, istan. His research interests include Pattern Recognition
pp. 773–777. and Machine Learning techniques particularly in visual
[65] Laptev, Lindeberg, Space-time interest points, in: Proceedings of the IEEE Inter- surveillance and vision-based biometric recognition, and
national Conference Computer Vision (ICCV), 2003, pp. 432–439 vol.1, doi:10. human activity recognition in the challenging healthcare
1109/ICCV.2003.1238378. or therapeutic interventions systems based on algorithmi-
[66] H. Wang, A. Kläser, C. Schmid, C.-L. Liu, Dense trajectories and motion bound- cally monitored health condition to stimulate the physical
ary descriptors for action recognition, Int. J. Comput. Vis. 103 (1) (2013) 60–79. and mental well-being.
[67] N. Dalal, B. Triggs, Histograms of oriented gradients for human detection, 2005.
[68] M.H. Khan, M.S. Farid, M. Grzegorzek, Person identification using spatiotem- M. S. Farid received the B.S., M.Sc., and M.Phil. (Hons.)
poral motion characteristics, in: Proceedings of the International Conference degrees in computer science from the University of the
Image Processing (ICIP), IEEE, 2017, pp. 166–170. Punjab, Lahore, Pakistan, in 2004, 2006, and 2009, re-
[69] J. Sánchez, et al., Image classification with the fisher vector: theory and prac- spectively, and the Ph.D. in computer science from Uni-
tice, Int. J. Comput. Vis. 105 (3) (2013) 222–245. versity of Torino, Turin, Italy, in 2015. He worked as Post-
[70] X. Peng, L. Wang, X. Wang, Y. Qiao, Bag of visual words and fusion methods doc researcher for a short term in 2017 at the Department
for action recognition: comprehensive study and good practice, Comput. Vis. of Computer Science and Engineering, Qatar University,
Image Underst. 150 (2016) 109–125. Doha, Qatar. Currently, He is currently an Assistant Pro-
[71] H. Rahmani, A. Mian, M. Shah, Learning a deep model for human action recog- fessor with the College of Information Technology, Uni-
nition from novel viewpoints, IEEE Trans. Pattern Anal. Mach. Intell. (2017). versity of the Punjab, Lahore, Pakistan. He worked on 3D
[72] A.L. Maas, A.Y. Hannun, A.Y. Ng, Rectifier nonlinearities improve neural net- television technology particularly on efficient representa-
work acoustic models, in: Proceedings of the ICML, 30, 2013, p. 3. tion & coding of multiview videos, novel view synthesis
[73] Y. Bengio, Practical recommendations for gradient-based training of deep techniques and quality assessment of 3D videos. His re-
architectures, in: Neural networks: Tricks of the trade, Springer, 2012, search interests also include information fusion, biometrics, image segmentation,
pp. 437–478. and medical image analysis. He has been a member of Technical Program Commit-
[74] R. Girshick, Fast r-CNN, in: Proceedings of the IEEE International Conference tee for several international conferences, including, IEEE FIT, ICACS, and ICOSST, and
Computer Vision (ICCV), 2015, pp. 1440–1448. served as reviewer in numerous journals.
[75] F. Chollet, et al., Keras, 2015.
[76] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat,
G. Irving, M. Isard, et al., Tensorflow: a system for large-scale machine learn-
ing., in: OSDI, 16, 2016, pp. 265–283. M. Grzegorzek is Professor of Medical Informatics with a
[77] T. Tieleman, G. Hinton, Lecture 6.5-rmsprop: divide the gradient by a running research focus on Medical Data Science at the University
average of its recent magnitude, COURSERA: Neural Netw. Mach. Learn. 4 (2) of Lbeck, Germany. He was leading the Research Group
(2012) 26–31. for Pattern Recognition at the University of Siegen, Ger-
[78] H. Larochelle, D. Erhan, A. Courville, J. Bergstra, Y. Bengio, An empirical eval- many from October 2010 to September 2018. He studied
uation of deep architectures on problems with many factors of variation, in: Computer Science at the Silesian University of Technol-
Proceedings of the ICML, ACM, 2007, pp. 473–480. ogy, did his PhD at the Pattern Recognition Lab at the
[79] A. Mansur, Y. Makihara, D. Muramatsu, Y. Yagi, Cross-view gait recognition University of Erlangen-Nuremberg, worked scientifically
using view-dependent discriminative analysis, in: Proceedings of the Interna- as Postdoc in the Multimedia and Vision Research Group
tional Conference on Biometrics (IJCB), IEEE, 2014, pp. 1–8. at the Queen Mary University of London and at the Insti-
[80] I. Rida, X. Jiang, G.L. Marcialis, Human body part selection by group lasso of tute for Web Science and Technologies at the University
motion for model-free gait recognition, IEEE Signal Process. Lett. 23 (1) (2016) of Koblenz-Landau, did his habilitation at the AGH Univer-
154–158. sity of Science and Technology in Kraków. He published
[81] C. Li, et al., Deepgait: a learning deep convolutional representation for view-in- more than 100 papers in pattern recognition, image processing, machine learning,
variant gait recognition using joint Bayesian, Appl. Sci. 7 (3) (2017) 210. and multimedia analysis.