Graph Sequence RNN For Vision-Based Freezing of Gait Detection
Graph Sequence RNN For Vision-Based Freezing of Gait Detection
fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2019.2946469, IEEE
Transactions on Image Processing
1
Abstract—Freezing of gait (FoG) is one of the most common of the life [7]. Early detection and quantification of FoG
symptoms of Parkinson’s disease (PD), a neurodegenerative events are of great importance in clinical practice and could
disorder which impacts millions of people around the world. be used for the evaluation of treatment efficacy for FoG [8].
Accurate assessment of FoG is critical for the management
of PD and to evaluate the efficacy of treatments. Currently, However, current FoG annotations heavily rely on subjective
the assessment of FoG requires well-trained experts to perform scoring by well-trained experts, which is extremely time-
time-consuming annotations via vision-based observations. Thus, consuming. Therefore, computer-aided intelligent solutions are
automatic FoG detection algorithms are needed. In this study, we needed to establish objective and timely FoG detection and
formulate vision-based FoG detection, as a fine-grained graph quantification.
sequence modelling task, by representing the anatomic joints
in each temporal segment with a directed graph, since FoG
events can be observed through the motion patterns of joints. A
novel deep learning method is proposed, namely graph sequence
recurrent neural network (GS-RNN), to characterize the FoG
patterns by devising graph recurrent cells, which take graph
sequences of dynamic structures as inputs. For the cases of which (a) Sample frames of a gait video segment
prior edge annotations are not available, a data-driven based
adjacency estimation method is further proposed. To the best of
our knowledge, this is one of the first studies on vision-based
FoG detection using deep neural networks designed for graph
sequences of dynamic structures. Experimental results on more (b) Graph representation of each video frame
than 150 videos collected from 45 patients demonstrated promis-
ing performance of the proposed GS-RNN for FoG detection with Fig. 1: Illustration of a graph sequence (b) produced by a
an AUC value of 0.90. gait video segment (a). The graph vertices are associated with
Index Terms—Parkinson’s disease, freezing of gait detection, the human anatomical joints which are obtained from a human
deep learning, recurrent neural network, graph sequence pose estimation algorithm, and the edges among these vertices
are further identified by the proposed method.
I. I NTRODUCTION Since observing PD subjects has been the gold standard of
Parkinson’s disease (PD) is a neurodegenerative disorder, identifying when FoG events happen in clinical assessments
characterized by motor symptoms as a result of dopaminergic [9], we can formulate FoG detection as a task which classifies
loss in the substantia nigra [1], [2]. Freezing of gait (FoG) each short segment of a long assessment video into two
is a debilitating symptom of PD, presenting as a sudden and classes: FoG and non-FoG. To this end, vision-based FoG
brief episode where patients feet get stuck to the floor, and detection methods have been rarely studied, although a few
a cessation of movement results despite the intention to keep vision-based Parkinsonian gait analysis methods have been
walking [3], [4]. As the disease progresses, FoG becomes more proposed [10]–[13]. These PD gait analysis methods were
frequent and severe, posing a major risk for falls [5], [6] and mainly devised to characterize Parkinsonian gaits at a coarse
eventually affecting the mobility, independence and quality level (e.g., categorizing a given gait video as normal or abnor-
mal) and are not intended for accurately reporting individual
K. Hu, Z. Wang, and D. D. Feng are with the School of Com- FoG events in a video. In addition, following a traditional
puter Science, The University of Sydney, NSW 2006, Australia (e-
mail: [email protected]; [email protected]; da- machine learning pipeline, these methods rely on extracting
[email protected]). hand-crafted features by assuming that a video contains only
W. Wang, L. Wang, and T. Tan are with the Center for Research on a patient walking independently. However, in clinical settings,
Intelligent Perception and Computing (CRIPAC), National Laboratory of
Pattern Recognition (NLPR), Institute of Automation Chinese Academy supporting staff are often involved to ensure the safety of PD
of Sciences (CASIA) Beijing 100190, China, and University of Chinese subjects. As a result, multiple persons can appear in recorded
Academy of Sciences (UCAS). L. Wang and T. Tan are also with the Center videos, which may violate the assumptions of those methods
for Excellence in Brain Science and Intelligence Technology (CEBSIT),
Institute of Automation Chinese Academy of Sciences (CASIA) (e-mail: where only a patient appears.
[email protected];[email protected]; [email protected]). Recent years have witnessed the ground-breaking success of
K. A. Ehgoetz Martens and S. J. G. Lewis are with the Parkinsons Disease deep learning techniques for many vision tasks, such as object
Research Clinic, Brain and Mind Centre, The University of Sydney, Syd-
ney, NSW 2050, Australia (e-mail: [email protected]; recognition, video classification and human action recognition.
[email protected]). These techniques provide a unique opportunity to develop
1057-7149 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2019.2946469, IEEE
Transactions on Image Processing
2
Fig. 2: Illustration of the proposed GS-RNN architecture for modelling gait videos with graph sequences. The adjacency
estimation layer estimates the edge weights by utilizing the bilinear transform, the graph RNN layer is designed to track
and propagate temporal graph patterns, the vertex-wise RNN layer helps to reduce model complexity by taking fewer vertex
relations into account, and the graph pooling layer generates graph-level predictions referring to the vertices with top likelihood
which contribute to the FoG patterns.
deep learning based FoG detection methods to address the the long short term memory (LSTM) and the gated recurrent
limitations of the existing PD gait analysis methods. Although units network (GRU) have been widely used to model sequen-
many methods [14] have been proposed for generic video tial vector inputs with promising results [18], [19]. Although
classification problems which involve significant variation several studies have been proposed to address sequential graph
between different classes (e.g., kicking and jumping) and each inputs [20], [21], it is not trivial to apply them to general
video frame is generally represented as a whole unit, they graph sequences especially when the structures are dynamic
may neglect the subtle dynamics of FoG events, considering (i.e. vertices and edges can change over time). In this study,
the variation among different subjects could be higher than we propose a novel RNN architecture, namely graph sequence
that between FoG and non-FoG events. Several recent studies RNN (GS-RNN), to deal with general sequential graphs of
(e.g. Pose-CNN [15], [16] and our recent one [17]) have dynamic structures. In particular, to leverage the success of
been conducted to model the subtle variation using region or gated mechanisms, which alleviates the gradient vanishing
patch based representations. However, the relationship among and exploding issues of the original RNN, GS-LSTM and
patches and the entire temporal sequence have not been GS-GRU are implemented. Computational operators, gated
adequately explored. mechanisms and memory states of GS-RNN cells are devised
Therefore, for the first time, the current study aims to formu- to track sequential graph patterns while being compatible with
late FoG detection as a fine-grained graph sequence modelling dynamic graph structures. Experimental results demonstrate
task by representing each temporal video segment collected the effectiveness of the proposed GS-RNN architecture for the
from a clinical assessment with a graph. As illustrated in FoG detection task and the benefits of utilizing graph sequence
Fig. 1, a number of consecutive temporal segments of an representation. Moreover, graph sequence representations pro-
assessment video are organized in sequential order: for each vide additional localization hints for clinical assessments.
segment, the anatomical joints are extracted and characterized In summary, the major contributions of this paper are three-
as vertices of a directed graph, which is in line with the clinical fold:
practice where the joints of the knees and feet are particularly • We formulate FoG detection as a fine-grained graph
attended to. As a result, a graph sequence is obtained to sequence modelling task, which is one of the first studies
represent this input video. Note that the spatial structures to implement vision-based FoG detection. Instead of
of the graph sequence are dynamic since the detected joints characterizing each video temporal segment as a whole
(vertices) vary over time (i.e. the locations of the subject and unit or the patches of individual joints, we represent
the supporting clinical staff could change, and the joints could each video with a graph sequence where the vertices are
be occluded from the view in recording procedures). associated with the anatomical joints, which enables fine-
Traditionally, recurrent neural networks (RNNs) including grained characterization of the dynamic patterns of FoG
1057-7149 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2019.2946469, IEEE
Transactions on Image Processing
3
1057-7149 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2019.2946469, IEEE
Transactions on Image Processing
4
lights on FoG detection by analyzing temporal structure data. extended to the remaining frames of Vt . Hence, the pixels
The key advantages of GCNNs are implemented by graph within the i-th window can be extracted as an anatomic joint
convolution layers which address graph inputs of varying proposal vti to characterize the local patterns around the i-th
structures. By stacking multiple graph convolution layers, it joint.
is feasible to construct deep neural networks. Nonetheless,
By treating vti as the i-th vertex of a graph Gt , and thus
GCNNs are designed for independent graph inputs and not
Vt = {vti } is the set of vertices of Gt . In addition, Et denotes
available to formulate sequential or temporal patterns from
a set of ordered vertex pairs (i.e., edges or arrows) to represent
graph sequences.
the relations between any two vertices. Hence, a directed graph
Generally, RNNs have been widely used to model sequential
Gt = (Vt , Et ) is derived to represent the video segment Vt .
data. In particular, the LSTM and GRU methods were pro-
To characterize the edges in Et , a weighted adjacency matrix
posed to address the gradient vanishing and exploding issues
At = (aij t ) ∈ R
n×n
is introduced. Note that At is possible
of the original RNNs by introducing gated mechanisms [18],
to be asymmetrical as the interactions among joints can be
[19]. GRU involves less computation compared with LSTM
of either action or reaction. To further characterize the graph
while keeping similar performance and improving the effi-
Gt , let Xt = (x1t , x2t , ..., xnt )T , where xit ∈ Rd denotes the
ciency of the original RNNs. Moreover, GRU has shown better
vertex feature vector computed using vti via pre-trained neural
classification performance on smaller datasets [40]. However,
networks to represent joint appearance and motion; let yti
these RNNs are designed for general sequential inputs of
be a binary response to indicate whether FoG occurs within
which the input vectors are of a fixed length, whilst graph
the i-th anatomic joint proposal at the temporal index t or
sequences usually describe more complex structures over time
not (i.e., 1 for FoG and 0 for non-FoG). In particular, denote
and it is more challenging to learn. Although structural graph
yt = maxi yti as the graph-level response. Note that at least
RNNs [20], [21], [41]–[43] have been proposed to take graph
one joint contributes to an FoG event if a graph-level response
sequences of fixed graph structures as the inputs, dynamic
is annotated as FoG.
graph sequences, which are more general for a wide range
of applications, have not been fully explored in the past.
Therefore, advanced RNNs are needed to represent and model
these complex patterns conveyed through the graph sequences
of dynamic graph structures.
1057-7149 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2019.2946469, IEEE
Transactions on Image Processing
5
B. Adjacency Matrix Estimation the i-th vertex can be written as xit = (xi1 i2 id
t , xt , ..., xt ) and
i i1 i2 id
When the prior knowledge of a dataset is not available, an x̃t = (x̃t , x̃t , ..., x̃t ) which correspond to the i-th row of
adjacency estimation layer is adopted in a GS-RNN to learn Xt and ˜Xt , respectively.
a weighted adjacency matrix At of which the element aij
X
t x̃ij
t = aik kj
t xt . (9)
represents the relationship between the joint proposals vti and k
vtj . This layer introduces a bilinear transformation to obtain In detail, Eq. (9) implies that the j-th exchanged feature of
edge weight estimation aij t , which explores the vertex features vertex i in x̃t acquires information from the j-th original
xit and xjt . In addition, a bilinear operation is able to address features of its direct predecessors (including itself) xkj
t with
the inconsistent dimensions of the two inputs, thus xit and the weight aik t of the corresponding edges. Exchanging infor-
xjt can be the features of different modalities. By adding mation between the vertices helps capture useful patterns such
superscripts to xit and xjt to denote different feature modalities, as simultaneously occurring abnormalities as references to
i(c)
xt ∈ Rp is the feature vector extracted from the pre-trained enhance representation learning. Next, linear transforms Wxi
j(r)
C3D and xt ∈ Rq is extracted from the pre-trained ResNet- and Whi are applied to the exchanged features of the vertices
50. In particular, aij
t is estimated as: and the hidden state from the last step which is broadcast to
aij
i(c) j(r) each vertex.
t = g(xt M xt ), (1)
The forget gate ft ∈ R|Vt |×p controls the extent to which the
where M ∈ R p×q
is a matrix of parameters which can be existing patterns should be kept, the output gate ot ∈ R|Vt |×p
updated during the backward propagation, is an element- controls the extent to which the cell state is involved into
wise vector multiplication operator, and g(·) represents a computing the output, and the candidate cell state ċt ∈ R|Vt |×p
function of Rq → R which is chosen as a linear function with are computed in the similar manner. Note that the graph-level
trainable weights and bias in this study. Note that the estimated cell state c̈t ∈ R|Vt |×p and the graph-level hidden state ḣt ∈
adjacency matrix At is asymmetrical, i.e., aij ji
t 6= at , due to the R|Vt |×p are obtained vertex-wisely. The graph-level states are
asymmetry of M and the different feature modalities of vertex matrices of which each row contains the state of a particular
vit and vjt . Such asymmetry is helpful to represent different vertex. Given the potential for a temporally changing graph
interactions (i.e., action and reaction) between vertices. Note structure, it is not always feasible to find the corresponding
that the computation of At is differentiable and At can be vertices of the next graph. Thus maximum pooling operators
learned through both forward and backward propagations. are introduced for c̈t and ḣt on vertices and the pooling states
are further broadcast to the vertices of the next graph. Hence,
C. Graph RNN Cell the graph LSTM cell is able to keep track of the dependencies
among the graphs of the dynamic structures in the sequence.
Graph RNN cells in this study introduce gate mechanisms
Similar strategies are also applied to the computation of
similar to LSTM and GRU. Based on these two types of
graph GRU cells as shown in Fig. 4 (c). Graph GRU cells
mechanisms, we introduce the graph LSTM cell and the graph
involve less computation than graph LSTM cells, which are
GRU cell as shown in Fig. 4. The computations of the Graph
expected to be more efficient for training and prediction.
LSTM cell shown in Fig. 4 (a) are as follows:
zt := σ(Wxz AXt + Whz ht−1 + 1t bz ), (10)
it := σ(Wxi AXt + Whi ht−1 + 1t bi ), (2)
rt := σ(Wxr AXt + Whr ht−1 + 1t br ), (11)
ft := σ(Wxf AXt + Whf ht−1 + 1t bf ), (3)
ḣt := tanh(Wxh AXt + rt Whh ht−1 + 1t bh ), (12)
ot := σ(Wxo AXt + Who ht−1 + 1t bo ), (4)
ḧt := (1 − zt )ḣt + zt ht−1 , (13)
c˙t := tanh(Wxc AXt + Whc ht−1 + 1t bc ), (5)
ht := 1t+1 max σ(ḧt ). (14)
v∈V
c̈t := ft ct−1 + it ċt , (6)
ḣt := ot tanh(c̈t ), (7) D. Vertex-wise RNN Cell
In general, stacking multiple Graph RNN layers in a GS-
ct := 1t+1 max(c̈t ), ht := 1t+1 max(ḣt ). (8) RNN involves expensive non-linear computation, which sig-
v∈V v∈V
nificantly increases the model complexity. Therefore, it tends
Similar to general LSTM cells, Eq. (2) computes the input
to result in over-fitting issues. To alleviate such issues and to
gate it ∈ R|Vt |×p controlling the extent to which the new
build deep GS-RNNs, vertex-wise RNN cells are proposed,
patterns are introduced to the cell, where p is the hidden size.
which apply shared linear transformations on each vertex
All the vertices and the hidden state from the last temporal
separately. As a result, no pattern exchange is conducted in
step are involved, where σ is an vertex-wise sigmoid function
the vertex-wise RNN cells as in graph RNN cells. Vertex-wise
and 1t is an all-one |Vt | dimensional vector for broadcast
RNN cells are implemented as vertex-wise LSTM cells and
purpose. First, vertex features xt is multiplied with the adja-
vertex-wise GRU cells. The computation of the Vertex-wise
cency matrix A, which can be interpreted as the information
LSTM cell shown in Fig. 4 (b) are formulated from Eq. (15)
exchange among the vertices in line with the edge weights.
to Eq. (21).
Let X̃t = At Xt denote the exchanged vertex features of the
input graph. The original feature and the exchanged feature of it := σ(Wxi Xt + Whi ht−1 + 1t bi ), (15)
1057-7149 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2019.2946469, IEEE
Transactions on Image Processing
6
1057-7149 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2019.2946469, IEEE
Transactions on Image Processing
7
indicate the n-th training sample. Therefore, the loss function cognitive function (evaluated by MMSE), and Hoehn and
of the proposed GS-RNN is defined as: Yahr stages [47], which is a common metric to describe the
symptoms of PD progress. This dataset has the largest number
XX g(n) g(n) of subjects in the literature of vision-based Parkinsonian gait
J =− [y (n) log(ŷt ) + (1 − y (n) )log(1 − ŷt )]. analysis. For example, 11 subjects were involved in the work
n t
(28) of [10] and 30 subjects in [11]. It is also the first one for FoG
detection.
F. Context Fusion TABLE I: Demographics of the FoG video dataset
GS-RNN fg computes the prediction based on the inputŷtg Age, years Years since diagnosis Education, years MMSE
graph sequence. Nonetheless, the global context is absent by
68.49 (7.7) 9.94 (5.8) 13.80 (3.2) 27.94 (1.9)
involving the joint proposals only. To further help accurately
characterize FoG patterns, a context model fc is applied to 1 2%
2 38%
take the entire video segment sequence {Vt } as the input. Hoehn and Yahr stages 3 18%
The prediction ŷtc = ft (Vt ) is derived and further fused 4 33%
with ŷtg . The context model fc is chosen as a pre-trained 5 10%
C3D network which is adopted for each video segment Vt
and an RNN network which is further applied to formulate To evaluate the performance of the proposed method over
temporal relations between the C3D features of these temporal this dataset comprehensively, a 5-fold cross validation was
segments. introduced. The 45 subjects were randomly and evenly par-
Without increasing the model complexity, the graph se- titioned into 5 groups. For each fold, the video segments of 4
quence based model and the context model are trained in- groups were chosen for the training and validation purposes,
dependently. Fusing ŷtg and ŷtc helps to better predict FoG and the video segments of the remaining one group were used
events. Three fusion strategies, including the product fusion, for test purposes. Therefore, the videos of each subject only
the maximum pooling fusion and the linear fusion, are listed appeared in either the training or test partition, which helps to
from Eq. (29) to Eq. (31). statistically estimate the performance for unseen subjects.
A number of metrics were adopted to comprehensively eval-
ŷt = ŷtc ŷtg , (29)
uate the FoG detection performance. Firstly, as the prediction
ŷt = max(ŷtc , ŷtg ). (30) ŷt is continuous in [0, 1], which indicates the FoG probability,
a threshold θ should be identified in line with a specified use
ŷt = γ ŷtc + (1 − γ)ŷtg , γ ∈ (0, 1). (31) case. If ŷt > θ, the corresponding video segment was marked
The outputs of these three fusing functions are in [0, 1], as an FoG event; otherwise the segment was marked as a
representing the probability of an FoG event occurring in Vt . non-FoG event. Next, accuracy (the percentage of the samples
correctly classified over the total sample size), sensitivity (true
positive rate) and specificity (true negative rate) associated
IV. E XPERIMENTAL R ESULTS AND D ISCUSSIONS
with this threshold were computed. By varying the threshold
A. FoG Dataset & Evaluation Metrics and plotting the corresponding sensitivity against 1-specificity,
The dataset in this study consists of videos collected from receiver operating characteristic (ROC) curve, and area under
45 subjects who underwent clinical assessment. During the curve (AUC) were further utilized to evaluate the effectiveness
clinical assessment, each subject completed a Timed Up and of the proposed GS-RNNs.
Go (TUG) test used for functional mobility assessment [46].
The TUG tests were recorded by frontal view videos at 25
B. Implementation Details
frames per second with a frame resolution of 720 × 576. FoG
events within these videos were annotated by well-trained In this study, three types of pre-trained features were applied
experts on a per-frame basis. Note that clinical staff were including Res-Net 50 vertex features, C3D vertex features and
involved in the video recording processes to ensure the safety C3D context features. The details of obtaining these features
of the subjects during the TUG tests. Furthermore, the angle are introduced as follows:
of the camera was set to capture the body parts from chest to • Res-Net 50 vertex feature Setting the bounding window
feet to meet ethical requirements. as 50×50 pixels, the size of each anatomic joint proposal
In summary, 167 videos totalling 25.5 hours in duration was 25 × 50 × 50 × 3, where 25 was the frame rate and
were acquired, where 8.7% of the total hours collected 3 represented the RGB channels. Pre-trained ResNet-50
contained FoG patterns. This indicates a highly imbalanced with a size (2, 2) at last pooling layer was applied frame
dataset. For FoG detection, these videos were divided into by frame. The maximum pooling operation was applied
91,559 one-second long non-overlapped video segments. If along the temporal axis to reduce the model complexity
any frame of a segment was annotated as FoG referring to the and the computational cost. The dimension of Res-Net
ground truth, this segment was labelled as FoG; otherwise it 50 vertex feature vector was 1 × 2048.
was labelled as non-FoG. The demographics of the dataset are • C3D vertex feature The purpose of C3D vertex fea-
listed in Table I including age, year since diagnosis, education, ture differ from the Res-Net 50 feature. The Res-Net
1057-7149 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2019.2946469, IEEE
Transactions on Image Processing
8
ROC for Different GS-RNN Architectures ROC for Different Strategies Involving Structural and Sequential Graph Patterns
1 1
0.9 0.9
0.8 0.8
0.7 0.7
True positive rate
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
Bi-directional GS-GRU
Bi-directional GS-GRU Bi-directional GS-GRU w/o Adjacency Patterns
0.1 Bi-directional GS-LSTM 0.1 Sequential Prediction w/o Structural Patterns
Forward GS-GRU Graph Prediction w/o Sequential Patterns
Forward GS-LSTM Prediction w/o Structural and Sequential Patterns
0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
False positive rate False positive rate
(a) (b)
Fig. 5: ROC curves of GS-RNNs. (a) Comparison among different GS-RNN architectures. (b) Comparison among different
strategies involving structural and sequential patterns.
50 feature was only applied in the adjacency matrix the AUC of these curves. The bi-directional GS-GRU achieved
estimation whilst the C3D vertex feature was used for the highest AUC value 0.884 compared with the other GS-
modelling motion patterns as well. A higher precision of RNN architectures. As expected, the bi-directional GS-GRU
C3D feature was expected, and the proposal size was set outperformed the forward GS-GRU. For GS-LSTM, the gated
to 100×100. The dimension of C3D vertex feature vector mechanisms are much more complex than GS-GRU, thus the
was 1 × 8192. model complexity of the bi-directional GS-LSTM increases,
• C3D context feature The video frames were first re- which negatively impacted the model performance.
sized with fixed aspect ratio and cropped into temporal
clips with size of 25 × 224 × 224 × 3. These temporal TABLE II: Comparison of different GS-RNN architectures for
clips were then fed to a pre-trained C3D network to FoG detection
compute spatial-temporal features. The dimension of the GS-RNN Architectures AUC
C3D context feature was 1 × 32768.
Bi-directional GS-GRU 0.884
Given the imbalanced dataset, all the positive samples were Bi-directional GS-LSTM 0.869
used, with an equivalent number of negative samples randomly Forward GS-GRU 0.875
Forward GS-LSTM 0.883
selected from the training set in each epoch for model training.
The initial learning rate was set to 0.001, utilizing the stochas-
tic gradient decent optimizer. With an Nvidia GTX 1080Ti TABLE III: Comparison of different strategies to represent
GPU card, training of the bi-directional GS-GRU for each structural and sequential graph patterns
epoch was completed in 50 minutes (containing videos from
40 subjects); for the testing, each one-second video segment Methods AUC
was computed within 0.5 sec. Bi-directional GS-GRU 0.884
Bi-directional GS-GRU w/o Adjacency Patterns 0.879
C. FoG Detection Performance of GS-RNNs Sequential Prediction w/o Structural Patterns (C3D-GRU) 0.878
Graph Prediction w/o Sequential Patterns (GCNN) 0.878
GS-RNNs were evaluated by applying different types of GS- Prediction w/o Structural and Sequential Patterns (C3D) 0.874
RNN cells including GS-LSTM cells or GS-GRU cells, and
the directions were set as forward or bi-directional. Similar to GS-RNNs take both the structural and temporal graph
general RNNs, the forward direction involved only the past patterns into account simultaneously. To evaluate how these
graphs and can be utilized for online predictions. The bi- patterns contributed to FoG detection, we further compared the
directional network utilized both the previous and the future bi-directional GS-GRU with different methods, which either
graphs for accurate characterization and prediction of the partially or do not utilize these patterns. The first one is the
graph FoG patterns. In detail, we implemented the forward bi-directional GS-GRU without the adjacency matrix, which
GS-LSTM, the forward GS-GRU, the bi-directional GS-LSTM only utilizes two vertex-wise GRU layers. The second (C3D-
and the bi-directional GS-GRU. Each model contained an GRU) utilized the entire video segment with C3D by ignoring
adjacency matrix estimation layer, a graph RNN layer, a the anatomic graph as input, representing temporal patterns
vertex-wise layer and a graph pooling layer. Fig. 5 (a) shows with GRU. The third (GCNN) ignored sequential patterns and
the ROC curves of the proposed architectures and Table II lists applied GCNN to process each graph independently. The last
1057-7149 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2019.2946469, IEEE
Transactions on Image Processing
9
TABLE IV: DeLong’s test for different strategies to represent structural and sequential graph patterns
Methods AUC Diff. Z p
Sequential Prediction w/o Structural Patterns 0.007 4.04 5.33E-05
Bi-directional GS-GRU w/o Adjacency Patterns 0.005 4.54 5.51E-06
Prediction w/o Structural and Sequential Patterns 0.012 7.23 4.68E-13
Graph Prediction w/o Sequential Patterns 0.015 11.34 < 2.2E − 16
ROC for Different Fusion Strategies ROC for Graph Series and Context Predictions
1 1
0.9 0.9
0.8 0.8
0.7 0.7
True positive rate
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
Fusion of 1/2 Graph Series and 1/2 Context
0.1 Fusion of 1/3 Graph Series and 2/3 Context 0.1 Fusion of 1/2 Graph Series and 1/2 Context
Multiplication Fusion Context Only
Maximum Pooling Fusion Graph Series Only
0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
False positive rate False positive rate
(a) (b)
Fig. 6: Comparison of the detection results in terms of ROC Curves. (a) Comparison among different fusion strategies. (b)
Comparison among the best fusion strategy and the cases that graph representation and context model are used independently.
(C3D) generated the prediction on entire temporal segments model were trained independently avoiding an increase in
with C3D independently. Note that the second and the last model complexity. In terms of the fusion strategies, the linear
methods can be viewed as context-level models as they utilized fusion, the product fusion and the maximum pooling fusion
the entire video segment in a straightforward manner. Hence, were investigated. For the linear fusion, γ = 12 and γ = 23
for simplicity, the context model, which is further fused with were selected, which achieved the best image classification
GS-RNN for the complementary purpose, was selected from performance in [50].
these two methods by referring to their performance in this Fig. 6 shows the ROC curves of different fusion methods
study. and Table V lists the AUC values of these ROC curves.
Fig. 5 (b) shows the ROC curves of these methods and Fig. 6 (a) illustrates the curves of different fusion strategies.
Table III lists the AUC values of the ROC curves. The results The linear fusion strategy with γ = 12 achieved the best
indicate that simultaneously characterizing the structural and overall performance 0.900 in terms of AUC. In Fig. 6 (b),
the sequential graph patterns enhanced the performance of the curve of the best fusion strategy was compared with the
FoG detection in terms of AUC. In addition, DeLong’s test cases that the graph sequence model and the context model
was applied to evaluate the statistical significance of the are utilized independently. The fusion predictions improved
improvements [48], [49]. DeLong’s test is a nonparametric the detection performance by taking advantage of the two
approach by using the generalized U -statistics for the ROC independent methods. The improvement in terms of AUC to
curves. Table IV lists the results of the DeLong’s test. The the best fusion method compared with ŷtc and ŷtg are 0.016
p-values of the bi-directional GS-GRU and the rest methods and 0.029, respectively.
indicate that the improvements were statistically significant In particular, the sensitivity, the specificity and the accuracy
(α = 0.001). Therefore, characterizing the structural and the related to a threshold θ̂ were also adopted for the evaluation,
sequential graph patterns of gaits was clearly helpful for FoG which maximized the following Youden’s J statistic [51]:
detection.
θ̂ = arg min sensitivity + specif icity − 1. (32)
θ
D. FoG Detection Performance of Fusion Strategies By maximizing this statistic, a threshold can be derived to treat
To further improve the performance of FoG detection, the the sensitivity and the specificity with equal importance. These
context prediction ŷtc was fused with the GS-RNN prediction evaluation metrics and associated J statistics are also listed in
ŷtg . The context model was chosen as a general GRU network Table V. For the best fusion strategy, the sensitivity, specificity
for sequential predictions, which utilized C3D context features and accuracy values related to this threshold achieved 83.8%,
as the input. Note that the GS-RNN model and the context 82.3% and 82.5%, respectively.
1057-7149 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2019.2946469, IEEE
Transactions on Image Processing
10
1057-7149 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2019.2946469, IEEE
Transactions on Image Processing
11
prediction phases without compromising the model’s learning ForeFront, a large collaborative research group dedicated to
capacity. the study of neurodegenerative diseases.
ACKNOWLEDGEMENT R EFERENCES
This research was partially supported by Australian Re- [1] E. R. Dorsey, A. Elbaz, E. Nichols, F. Abd-Allah, A. Abdelalim, J. C.
search Council (ARC) Grant DP160103675, NHMRC-ARC Adsuar, M. G. Ansha, C. Brayne, J.-Y. J. Choi, D. Collado-Mateo, et al.,
Dementia Fellowship (#1110414), the National Health and “Global, regional, and national burden of parkinson’s disease, 1990–
2016: a systematic analysis for the global burden of disease study 2016,”
Medical Research Council (NHMRC) of Australia Pro- The Lancet Neurology, vol. 17, no. 11, pp. 939–953, 2018.
gram Grant (#ł1037746), Dementia Research Team Grant [2] J. Jankovic, “Parkinsons disease: clinical features and diagnosis,” Jour-
(#ł1095127) and NeuroSleepCentre of Research Excellence nal of Neurology, Neurosurgery & Psychiatry, vol. 79, no. 4, pp. 368–
(#ł1060992,) the ARC Centre of Excellence in Cognition 376, 2008.
[3] M. A. Hely, W. G. J. Reid, M. A. Adena, G. M. Halliday, and J. G. L.
and its Disorders Memory Program (#CE110001021), Sydney Morris, “The Sydney multicenter study of Parkinson’s disease: the
Research Excellence Initiative (SREI) 2020 of the Univer- inevitability of dementia at 20 years,” Movement Disorders, vol. 23,
sity of Sydney, Natural Science Foundation of China Grant no. 6, pp. 837–844, 2008.
[4] M. Macht, Y. Kaussner, J. C. Möller, K. Stiasny-Kolster, K. M. Eggert,
(#61420106015), and Parkinson Canada. The ethical approval H.-P. Krüger, and H. Ellgring, “Predictors of freezing in Parkinson’s
of this research was obtained from the University of Sydney disease: a survey of 6,620 patients,” Movement Disorders, vol. 22, no. 7,
Human Ethics Board (#2014/255). We thank our patients who pp. 953–956, 2007.
[5] B. R. Bloem, J. M. Hausdorff, J. E. Visser, and N. Giladi, “Falls and
participated into the data collection and they all provided freezing of gait in Parkinson’s disease: a review of two interconnected,
written informed consent. We would like to acknowledge and episodic phenomena,” Movement Disorders, vol. 19, no. 8, pp. 871–884,
thank Moran Gilat, Julie Hall, Alana Muller, Jennifer Szeto 2004.
[6] S. J. G. Lewis and R. A. Barker, “A pathophysiological model of freezing
and Courtney Walton for conducting and scoring the freezing of gait in Parkinson’s disease,” Parkinsonism & Related Disorders,
of gait assessments. We would also like to acknowledge vol. 15, no. 5, pp. 333 – 338, 2009.
1057-7149 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2019.2946469, IEEE
Transactions on Image Processing
12
[7] J. D. Schaafsma, Y. Balash, T. Gurevich, A. L. Bartels, J. M. Hausdorff, [28] Z. Qiu, T. Yao, and T. Mei, “Learning spatio-temporal representation
and N. Giladi, “Characterization of freezing of gait subtypes and the with pseudo-3d residual networks,” in IEEE International Conference
response of each to levodopa in Parkinson’s disease,” European Journal on Computer Vision, pp. 5534–5542, IEEE, 2017.
of Neurology, vol. 10, no. 4, pp. 391–398, 2003. [29] J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venu-
[8] S. Donovan, C. Lim, N. Diaz, N. Browner, P. Rose, L. R. Sudarsky, gopalan, K. Saenko, and T. Darrell, “Long-term recurrent convolutional
D. Tarsy, S. Fahn, and D. K. Simon, “Laserlight cues for gait freezing networks for visual recognition and description,” in IEEE Conference
in parkinsons disease: an open-label study,” Parkinsonism & related on Computer Vision and Pattern Recognition, pp. 2625–2634, 2015.
disorders, vol. 17, no. 4, pp. 240–245, 2011. [30] X. Shi, Z. Chen, H. Wang, D.-Y. Yeung, W.-k. Wong, and W.-c.
[9] T. R. Morris, C. Cho, V. Dilda, J. M. Shine, S. L. Naismith, S. J. Lewis, Woo, “Convolutional LSTM network: A machine learning approach for
and S. T. Moore, “Clinical assessment of freezing of gait in Parkinson’s precipitation nowcasting,” in Advances in Neural Information Processing
disease from computer-generated animation,” Gait & Posture, vol. 38, Systems, pp. 802–810, 2015.
no. 2, pp. 326 – 329, 2013. [31] C. Feichtenhofer, A. Pinz, and A. Zisserman, “Convolutional two-stream
[10] T. Khan, J. Westin, and M. Dougherty, “Motion cue analysis for Parkin- network fusion for video action recognition,” in IEEE Conference on
sonian gait recognition,” The Open Biomedical Engineering Journal, Computer Vision and Pattern Recognition, pp. 1933–1941, 2016.
vol. 7, p. 1, 2013. [32] F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini,
[11] M. Nieto-Hidalgo, F. J. Ferrández-Pastor, R. J. Valdivieso-Sarabia, “The graph neural network model,” IEEE Transactions on Neural
J. Mora-Pascual, and J. M. Garcı́a-Chamizo, “A vision based proposal Networks, vol. 20, no. 1, pp. 61–80, 2009.
for classification of normal and abnormal gait using RGB camera,” [33] J. Bruna, W. Zaremba, A. Szlam, and Y. LeCun, “Spectral networks and
Journal of Biomedical Informatics, vol. 63, pp. 82–89, 2016. locally connected networks on graphs,” arXiv preprint arXiv:1312.6203,
2013.
[12] M. Nieto-Hidalgo, F. J. Ferrández-Pastor, R. J. Valdivieso-Sarabia,
[34] D. K. Duvenaud, D. Maclaurin, J. Iparraguirre, R. Bombarell, T. Hirzel,
J. Mora-Pascual, and J. M. Garcı́a-Chamizo, “Vision based gait analysis
A. Aspuru-Guzik, and R. P. Adams, “Convolutional networks on graphs
for frontal view gait sequences using RGB camera,” in International
for learning molecular fingerprints,” in Advances in neural information
Conference on Ubiquitous Computing and Ambient Intelligence, pp. 26–
processing systems, pp. 2224–2232, 2015.
37, Springer, 2016.
[35] M. Henaff, J. Bruna, and Y. LeCun, “Deep convolutional networks on
[13] S. Sedai, M. Bennamoun, D. Q. Huynh, A. El-Sallam, S. Foo, J. Alder- graph-structured data,” arXiv preprint arXiv:1506.05163, 2015.
son, and C. Lind, “3d human pose tracking using gaussian process [36] T. N. Kipf and M. Welling, “Semi-supervised classification with graph
regression and particle filter applied to gait analysis of parkinson’s convolutional networks,” arXiv preprint arXiv:1609.02907, 2016.
disease patients,” in 2013 IEEE 8th Conference on Industrial Electronics [37] L. Zhao, X. Peng, Y. Tian, M. Kapadia, and D. N. Metaxas, “Semantic
and Applications (ICIEA), pp. 1636–1642, June 2013. graph convolutional networks for 3d human pose regression,” in IEEE
[14] S. Xie, C. Sun, J. Huang, Z. Tu, and K. Murphy, “Rethinking spatiotem- Conference on Computer Vision and Pattern Recognition, pp. 3425–
poral feature learning: Speed-accuracy trade-offs in video classification,” 3435, 2019.
in European Conference on Computer Vision, pp. 318–335, Springer, [38] M. Li, S. Chen, X. Chen, Y. Zhang, Y. Wang, and Q. Tian, “Actional-
2018. structural graph convolutional networks for skeleton-based action recog-
[15] G. Chéron, I. Laptev, and C. Schmid, “P-cnn: Pose-based cnn features nition,” in IEEE Conference on Computer Vision and Pattern Recogni-
for action recognition,” in IEEE International Conference on Computer tion, pp. 3595–3603, 2019.
Vision, pp. 3218–3226, 2015. [39] X. Liu, W. Liu, M. Zhang, J. Chen, L. Gao, C. Yan, and T. Mei, “Social
[16] X. Wang and A. Gupta, “Videos as space-time region graphs,” in relation recognition from videos via multi-scale spatial-temporal reason-
European Conference on Computer Vision, pp. 399–417, 2018. ing,” in IEEE Conference on Computer Vision and Pattern Recognition,
[17] K. Hu, Z. Wang, K. Martens Ehgoetz, and S. Lewis, “Vision-based pp. 3566–3574, 2019.
freezing of gait detection with anatomic patch based representation,” in [40] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical evaluation of
Asian Conference on Computer Vision, pp. 564–576, Springer, 2018. gated recurrent neural networks on sequence modeling,” arXiv preprint
[18] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural arXiv:1412.3555, 2014.
computation, vol. 9, no. 8, pp. 1735–1780, 1997. [41] Y. Li, D. Tarlow, M. Brockschmidt, and R. Zemel, “Gated graph
[19] K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, sequence neural networks,” arXiv preprint arXiv:1511.05493, 2015.
H. Schwenk, and Y. Bengio, “Learning phrase representations using [42] Y. Seo, M. Defferrard, P. Vandergheynst, and X. Bresson, “Structured
rnn encoder-decoder for statistical machine translation,” arXiv preprint sequence modeling with graph convolutional recurrent networks,” in
arXiv:1406.1078, 2014. International Conference on Neural Information Processing, pp. 362–
[20] A. Jain, A. R. Zamir, S. Savarese, and A. Saxena, “Structural-RNN: 373, Springer, 2018.
Deep learning on spatio-temporal graphs,” in IEEE Conference on [43] C. Si, W. Chen, W. Wang, L. Wang, and T. Tan, “An attention enhanced
Computer Vision and Pattern Recognition, pp. 5308–5317, 2016. graph convolutional lstm network for skeleton-based action recognition,”
[21] S. Yan, Y. Xiong, and D. Lin, “Spatial temporal graph convolutional in IEEE Conference on Computer Vision and Pattern Recognition, June
networks for skeleton-based action recognition,” in Thirty-Second AAAI 2019.
Conference on Artificial Intelligence, 2018. [44] Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh, “Realtime multi-person
2d pose estimation using part affinity fields,” in IEEE Conference on
[22] Y. Du, Y. Fu, and L. Wang, “Representation learning of temporal
Computer Vision and Pattern Recognition, 2017.
dynamics for skeleton-based action recognition,” IEEE Transactions on
[45] S.-E. Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh, “Convolutional
Image Processing, vol. 25, no. 7, pp. 3010–3022, 2016.
pose machines,” in IEEE Conference on Computer Vision and Pattern
[23] J. Liu, G. Wang, L.-Y. Duan, K. Abdiyeva, and A. C. Kot, “Skeleton- Recognition, pp. 4724–4732, 2016.
based human action recognition with global context-aware attention lstm [46] A. Shumway-Cook, S. Brauer, and M. Woollacott, “Predicting the
networks,” IEEE Transactions on Image Processing, vol. 27, no. 4, probability for falls in community-dwelling older adults using the timed
pp. 1586–1599, 2018. up & go test,” Physical Therapy, vol. 80, no. 9, pp. 896–903, 2000.
[24] J. Liu, A. Shahroudy, D. Xu, A. C. Kot, and G. Wang, “Skeleton-based [47] M. M. Hoehn and M. D. Yahr, “Parkinsonism: onset, progression, and
action recognition using spatio-temporal lstm network with trust gates,” mortality,” Neurology, vol. 17, no. 5, pp. 427–427, 1967.
IEEE transactions on pattern analysis and machine intelligence, vol. 40, [48] E. R. DeLong, D. M. DeLong, and D. L. Clarke-Pearson, “Comparing
no. 12, pp. 3007–3021, 2018. the areas under two or more correlated receiver operating characteristic
[25] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and curves: A nonparametric approach,” Biometrics, vol. 44, no. 3, pp. 837–
L. Fei-Fei, “Large-scale video classification with convolutional neural 845, 1988.
networks,” in IEEE Conference on Computer Vision and Pattern Recog- [49] X. Robin, N. Turck, A. Hainard, N. Tiberti, F. Lisacek, J.-C. Sanchez,
nition, pp. 1725–1732, 2014. and M. Müller, “pROC: an open-source package for R and S+ to analyze
[26] K. Simonyan and A. Zisserman, “Two-stream convolutional networks and compare roc curves,” BMC Bioinformatics, vol. 12, no. 1, p. 77,
for action recognition in videos,” in Advances in Neural Information 2011.
Processing Systems, pp. 568–576, 2014. [50] Y. Peng, X. He, and J. Zhao, “Object-part attention model for fine-
[27] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learn- grained image classification,” IEEE Transactions on Image Processing,
ing spatiotemporal features with 3d convolutional networks,” in IEEE vol. 27, no. 3, pp. 1487–1500, 2018.
International Conference on Computer Vision, pp. 4489–4497, IEEE, [51] W. J. Youden, “Index for rating diagnostic tests,” Cancer, vol. 3, no. 1,
2015. pp. 32–35, 1950.
1057-7149 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2019.2946469, IEEE
Transactions on Image Processing
13
1057-7149 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.