0% found this document useful (0 votes)
15 views13 pages

Graph Sequence RNN For Vision-Based Freezing of Gait Detection

Uploaded by

allenwu0731
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views13 pages

Graph Sequence RNN For Vision-Based Freezing of Gait Detection

Uploaded by

allenwu0731
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

This article has been accepted for publication in a future issue of this journal, but has not been

fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2019.2946469, IEEE
Transactions on Image Processing
1

Graph Sequence Recurrent Neural Network for


Vision-based Freezing of Gait Detection
Kun Hu, Zhiyong Wang, Member, IEEE, Wei Wang, Member, IEEE, Kaylena A. Ehgoetz Martens,
Liang Wang, Fellow, IEEE, Tieniu Tan, Fellow, IEEE, Simon J. G. Lewis, and David Dagan Feng, Fellow, IEEE

Abstract—Freezing of gait (FoG) is one of the most common of the life [7]. Early detection and quantification of FoG
symptoms of Parkinson’s disease (PD), a neurodegenerative events are of great importance in clinical practice and could
disorder which impacts millions of people around the world. be used for the evaluation of treatment efficacy for FoG [8].
Accurate assessment of FoG is critical for the management
of PD and to evaluate the efficacy of treatments. Currently, However, current FoG annotations heavily rely on subjective
the assessment of FoG requires well-trained experts to perform scoring by well-trained experts, which is extremely time-
time-consuming annotations via vision-based observations. Thus, consuming. Therefore, computer-aided intelligent solutions are
automatic FoG detection algorithms are needed. In this study, we needed to establish objective and timely FoG detection and
formulate vision-based FoG detection, as a fine-grained graph quantification.
sequence modelling task, by representing the anatomic joints
in each temporal segment with a directed graph, since FoG
events can be observed through the motion patterns of joints. A
novel deep learning method is proposed, namely graph sequence
recurrent neural network (GS-RNN), to characterize the FoG
patterns by devising graph recurrent cells, which take graph
sequences of dynamic structures as inputs. For the cases of which (a) Sample frames of a gait video segment
prior edge annotations are not available, a data-driven based
adjacency estimation method is further proposed. To the best of
our knowledge, this is one of the first studies on vision-based
FoG detection using deep neural networks designed for graph
sequences of dynamic structures. Experimental results on more (b) Graph representation of each video frame
than 150 videos collected from 45 patients demonstrated promis-
ing performance of the proposed GS-RNN for FoG detection with Fig. 1: Illustration of a graph sequence (b) produced by a
an AUC value of 0.90. gait video segment (a). The graph vertices are associated with
Index Terms—Parkinson’s disease, freezing of gait detection, the human anatomical joints which are obtained from a human
deep learning, recurrent neural network, graph sequence pose estimation algorithm, and the edges among these vertices
are further identified by the proposed method.
I. I NTRODUCTION Since observing PD subjects has been the gold standard of
Parkinson’s disease (PD) is a neurodegenerative disorder, identifying when FoG events happen in clinical assessments
characterized by motor symptoms as a result of dopaminergic [9], we can formulate FoG detection as a task which classifies
loss in the substantia nigra [1], [2]. Freezing of gait (FoG) each short segment of a long assessment video into two
is a debilitating symptom of PD, presenting as a sudden and classes: FoG and non-FoG. To this end, vision-based FoG
brief episode where patients feet get stuck to the floor, and detection methods have been rarely studied, although a few
a cessation of movement results despite the intention to keep vision-based Parkinsonian gait analysis methods have been
walking [3], [4]. As the disease progresses, FoG becomes more proposed [10]–[13]. These PD gait analysis methods were
frequent and severe, posing a major risk for falls [5], [6] and mainly devised to characterize Parkinsonian gaits at a coarse
eventually affecting the mobility, independence and quality level (e.g., categorizing a given gait video as normal or abnor-
mal) and are not intended for accurately reporting individual
K. Hu, Z. Wang, and D. D. Feng are with the School of Com- FoG events in a video. In addition, following a traditional
puter Science, The University of Sydney, NSW 2006, Australia (e-
mail: [email protected]; [email protected]; da- machine learning pipeline, these methods rely on extracting
[email protected]). hand-crafted features by assuming that a video contains only
W. Wang, L. Wang, and T. Tan are with the Center for Research on a patient walking independently. However, in clinical settings,
Intelligent Perception and Computing (CRIPAC), National Laboratory of
Pattern Recognition (NLPR), Institute of Automation Chinese Academy supporting staff are often involved to ensure the safety of PD
of Sciences (CASIA) Beijing 100190, China, and University of Chinese subjects. As a result, multiple persons can appear in recorded
Academy of Sciences (UCAS). L. Wang and T. Tan are also with the Center videos, which may violate the assumptions of those methods
for Excellence in Brain Science and Intelligence Technology (CEBSIT),
Institute of Automation Chinese Academy of Sciences (CASIA) (e-mail: where only a patient appears.
[email protected];[email protected]; [email protected]). Recent years have witnessed the ground-breaking success of
K. A. Ehgoetz Martens and S. J. G. Lewis are with the Parkinsons Disease deep learning techniques for many vision tasks, such as object
Research Clinic, Brain and Mind Centre, The University of Sydney, Syd-
ney, NSW 2050, Australia (e-mail: [email protected]; recognition, video classification and human action recognition.
[email protected]). These techniques provide a unique opportunity to develop

1057-7149 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2019.2946469, IEEE
Transactions on Image Processing
2

Fig. 2: Illustration of the proposed GS-RNN architecture for modelling gait videos with graph sequences. The adjacency
estimation layer estimates the edge weights by utilizing the bilinear transform, the graph RNN layer is designed to track
and propagate temporal graph patterns, the vertex-wise RNN layer helps to reduce model complexity by taking fewer vertex
relations into account, and the graph pooling layer generates graph-level predictions referring to the vertices with top likelihood
which contribute to the FoG patterns.

deep learning based FoG detection methods to address the the long short term memory (LSTM) and the gated recurrent
limitations of the existing PD gait analysis methods. Although units network (GRU) have been widely used to model sequen-
many methods [14] have been proposed for generic video tial vector inputs with promising results [18], [19]. Although
classification problems which involve significant variation several studies have been proposed to address sequential graph
between different classes (e.g., kicking and jumping) and each inputs [20], [21], it is not trivial to apply them to general
video frame is generally represented as a whole unit, they graph sequences especially when the structures are dynamic
may neglect the subtle dynamics of FoG events, considering (i.e. vertices and edges can change over time). In this study,
the variation among different subjects could be higher than we propose a novel RNN architecture, namely graph sequence
that between FoG and non-FoG events. Several recent studies RNN (GS-RNN), to deal with general sequential graphs of
(e.g. Pose-CNN [15], [16] and our recent one [17]) have dynamic structures. In particular, to leverage the success of
been conducted to model the subtle variation using region or gated mechanisms, which alleviates the gradient vanishing
patch based representations. However, the relationship among and exploding issues of the original RNN, GS-LSTM and
patches and the entire temporal sequence have not been GS-GRU are implemented. Computational operators, gated
adequately explored. mechanisms and memory states of GS-RNN cells are devised
Therefore, for the first time, the current study aims to formu- to track sequential graph patterns while being compatible with
late FoG detection as a fine-grained graph sequence modelling dynamic graph structures. Experimental results demonstrate
task by representing each temporal video segment collected the effectiveness of the proposed GS-RNN architecture for the
from a clinical assessment with a graph. As illustrated in FoG detection task and the benefits of utilizing graph sequence
Fig. 1, a number of consecutive temporal segments of an representation. Moreover, graph sequence representations pro-
assessment video are organized in sequential order: for each vide additional localization hints for clinical assessments.
segment, the anatomical joints are extracted and characterized In summary, the major contributions of this paper are three-
as vertices of a directed graph, which is in line with the clinical fold:
practice where the joints of the knees and feet are particularly • We formulate FoG detection as a fine-grained graph
attended to. As a result, a graph sequence is obtained to sequence modelling task, which is one of the first studies
represent this input video. Note that the spatial structures to implement vision-based FoG detection. Instead of
of the graph sequence are dynamic since the detected joints characterizing each video temporal segment as a whole
(vertices) vary over time (i.e. the locations of the subject and unit or the patches of individual joints, we represent
the supporting clinical staff could change, and the joints could each video with a graph sequence where the vertices are
be occluded from the view in recording procedures). associated with the anatomical joints, which enables fine-
Traditionally, recurrent neural networks (RNNs) including grained characterization of the dynamic patterns of FoG

1057-7149 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2019.2946469, IEEE
Transactions on Image Processing
3

events. However, extracting hand-crafted features usually requires


• A novel recurrent neural network architecture GS-RNN strong assumptions, which may not be feasible in realistic
is proposed to learn from graph sequences of dynamic scenarios. For example, these methods often assume that only
structures. More specifically, GS-LSTM and GS-GRU are a patient appears in a video and can walk independently. This
implemented to leverage the success of gated mecha- ignores the fact that the patient may have mobility difficulties
nisms. and require external support to prevent possible falls. To
• A large video dataset was created during the clinical as- address these limitations, deep learning techniques provide a
sessments of 45 PD subjects to evaluate the effectiveness great opportunity for developing real-world applicable FoG
of our proposed methods. detection methods built on the ground-breaking success of
The paper is organized as follows. Section II reviews the many visual understanding tasks.
related works including Parkinsonian gait analysis and various
deep learning techniques. Section III introduces the details of B. Deep Learning based Video Classification
our proposed methods. Section IV presents comprehensive ex- Deep learning techniques have been widely utilized for
perimental results to evaluate the effectiveness of our proposed video classification due to their great success in many visual
GS-RNN for FoG detection. Lastly, Section V concludes our understanding tasks. At first, single stream [25] and two-
study with discussions for future work. stream methods [26] were proposed. The single-stream based
method applies pre-trained 2D convolution filters frame by
II. R ELATED W ORK frame and different temporal fusion strategies are investigated.
In this section, related studies are reviewed from three The two-stream based method takes the advantage of the
aspects: vision-based Parkinsonian gait analysis methods, deep appearance and optical flow features obtained by 2D convolu-
learning based video classification methods, and neural net- tions to form spatial and temporal representations. Based on
works for graph data. Note traditional hand-crafted feature these pioneering studies, three major types of deep learning
based recognition methods are omitted, as deep learning based based methods are currently utilized to recognize human
methods have achieved the state-of-the-art recognition per- actions in video: convolution neural network (CNN), recurrent
formance. Skeleton based human action recognition methods neural network (RNN) and two-stream based methods. The
(e.g., [22]–[24]) are also omitted, as they generally rely on first type in general extends the 2D CNN architecture to its
accurate pose information and cannot be directly applied 3D counterpart by which the convolution filters are extended
to our FoG detection task where pose information may be to filter 3-dimensional video data, such as C3D [27], P3D [28]
incomplete. and I3D [14]. By considering an input video as a 2D image
sequence, the second type aims to model the temporal struc-
A. Vision-based Parkinsonian Gait Analysis ture with recurrent neural networks such as long short term
memory (LSTM) or gated recurrent units (GRU) [29], [30].
Several vision-based Parkinsonian gait analysis methods
The last type which is based on the pioneering two-stream
have been proposed [10]–[12]. At first the sagittal view was
approach represents video content with both appearance and
applied to record human gait by placing a camera laterally to
motion features [31].
human subjects. In [10], the stride cycle and the posture lean
However, the intra-class variation could be higher than the
related features were introduced to characterize gait patterns.
inter-class variation for FoG detection, while these above-
The motion cue matching was be computed by the cosine
mentioned methods mainly address generic human action
similarity between normal and abnormal gait and the matching
recognition problems which involve significant inter-class vari-
percentage was used to predict the label for an entire walking
ation. Therefore, novel fine-grained recognition methods are
behaviour. However, temporal localization is not available to
needed to take the characteristics of FoG videos in clinical
accurately detect abnormal patterns within a video. In [11],
assessments into consideration. Several recent studies (e.g.
gait patterns were characterized by various motion features
Pose-CNN [15], [16] and [17]) model the subtle variation with
including stride length, leg angle, and average cycle time. To
region or patch based representations. However, the relations
identify abnormal gait patterns, traditional binary classifiers
among patches and an entire temporal sequence have not been
were utilized. Besides the sagittal view, frontal view videos
adequately explored.
have also been explored due to the convenience of the space
saving set-up, where a subject is required to walk towards and
away from a recording camera [12]. The set-up is similar to C. Neural Networks for Graph Data
clinical assessments and can avoid the issue that one leg is Graph neural network (GNN) [32] was proposed as one of
occluded by the other. the first graph-based neural networks to process the data con-
Note that these methods were not devised specifically for taining graph structures. Recently, graph convolution neural
identifying FoG events and only perform crude gait analysis network (GCNN) has been proposed to exploit the structure
of PD patients. In addition, the traditional pattern recognition context of input data as an extension of the convolution
pipeline is followed in these studies, whereby hand-crafted neural network [33]. It helps solve many challenging problems
features are extracted and fed into a machine learning model including material design which involves molecular struc-
such as support vector machine (SVM), generalized linear tures [34], [35], social network analysis [36], pose-based
model or ensemble learning methods to obtain predictions. applications [37], [38], video analysis [16], [39], and sheds

1057-7149 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2019.2946469, IEEE
Transactions on Image Processing
4

lights on FoG detection by analyzing temporal structure data. extended to the remaining frames of Vt . Hence, the pixels
The key advantages of GCNNs are implemented by graph within the i-th window can be extracted as an anatomic joint
convolution layers which address graph inputs of varying proposal vti to characterize the local patterns around the i-th
structures. By stacking multiple graph convolution layers, it joint.
is feasible to construct deep neural networks. Nonetheless,
By treating vti as the i-th vertex of a graph Gt , and thus
GCNNs are designed for independent graph inputs and not
Vt = {vti } is the set of vertices of Gt . In addition, Et denotes
available to formulate sequential or temporal patterns from
a set of ordered vertex pairs (i.e., edges or arrows) to represent
graph sequences.
the relations between any two vertices. Hence, a directed graph
Generally, RNNs have been widely used to model sequential
Gt = (Vt , Et ) is derived to represent the video segment Vt .
data. In particular, the LSTM and GRU methods were pro-
To characterize the edges in Et , a weighted adjacency matrix
posed to address the gradient vanishing and exploding issues
At = (aij t ) ∈ R
n×n
is introduced. Note that At is possible
of the original RNNs by introducing gated mechanisms [18],
to be asymmetrical as the interactions among joints can be
[19]. GRU involves less computation compared with LSTM
of either action or reaction. To further characterize the graph
while keeping similar performance and improving the effi-
Gt , let Xt = (x1t , x2t , ..., xnt )T , where xit ∈ Rd denotes the
ciency of the original RNNs. Moreover, GRU has shown better
vertex feature vector computed using vti via pre-trained neural
classification performance on smaller datasets [40]. However,
networks to represent joint appearance and motion; let yti
these RNNs are designed for general sequential inputs of
be a binary response to indicate whether FoG occurs within
which the input vectors are of a fixed length, whilst graph
the i-th anatomic joint proposal at the temporal index t or
sequences usually describe more complex structures over time
not (i.e., 1 for FoG and 0 for non-FoG). In particular, denote
and it is more challenging to learn. Although structural graph
yt = maxi yti as the graph-level response. Note that at least
RNNs [20], [21], [41]–[43] have been proposed to take graph
one joint contributes to an FoG event if a graph-level response
sequences of fixed graph structures as the inputs, dynamic
is annotated as FoG.
graph sequences, which are more general for a wide range
of applications, have not been fully explored in the past.
Therefore, advanced RNNs are needed to represent and model
these complex patterns conveyed through the graph sequences
of dynamic graph structures.

III. P ROPOSED M ETHOD


The major components of our novel GS-RNN architecture
are illustrated in Fig. 2, including the adjacency estimation
layer, the graph RNN layer, the vertex-wise RNN layer and the
graph pooling layer: the adjacency estimation layer aims to es-
timate the weights of the edges of each graph; the graph RNN
layer is designed to track and propagate the graph patterns of Fig. 3: Illustration of the construction of Gt from the input
the input graph sequence; the vertex-wise RNN layer helps temporal video segment Vt . Joints are detected by applying
reduce model complexity by involving less vertex relations; convolution pose machines to the middle frame of Vt and
the graph pooling layer generates graph-level predictions, proposals are extracted by the bounding windows centered
which refers to the vertices that have the highest likelihood at their associated joints. By treating the proposals {vti } as
of contributing to FoG patterns. Similar to general RNNs, the vertices of a graph and computing the pre-trained feature
deep representation can be achieved by stacking multiple graph of each proposal vti as xit , the adjacency matrix A can be
RNN layers. Furthermore, the architecture can be extended as estimated edge-wisely.
bi-directional GS-RNNs to take both the past and the future
graphs for modelling.
According to the above discussions, an FoG assessment
A. Anatomic Joint Graph Sequence video V can be represented as a dynamic graph sequence
As FoG events can be observed from anatomic regions, {Gt }. In terms of the dynamic characteristics, there are two
anatomic joint proposals are extracted to construct a graph as unique attributes. Firstly, the joints can be occluded during
illustrated in Fig. 3 from each temporal segment of an input a trial, and it is not ensured that all the joints could be
video by adopting the convolution pose machines [44], [45]. In accurately tracked across all temporal indices. Secondly, as a
particular, a clinical assessment video is treated as a sequence computer-aided joint estimation method, some joints could be
V = {Vt } of which the element Vt is a temporal segment incorrectly identified by the convolution pose machine. As a
of a fixed duration and t indicates its temporal index. For result, the vertices and the edges of Gt could change along the
each Vt , convolution pose machines take the middle frame to temporal indices. Therefore, advanced sequential modelling
compute anatomic joint locations. Hence, square windows can methods are required to process dynamic graph sequences by
be identified with their centres located at each joint position effectively characterizing the dynamic structural patterns for
of Vt . These windows obtained from the middle frame are FoG detection.

1057-7149 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2019.2946469, IEEE
Transactions on Image Processing
5

B. Adjacency Matrix Estimation the i-th vertex can be written as xit = (xi1 i2 id
t , xt , ..., xt ) and
i i1 i2 id
When the prior knowledge of a dataset is not available, an x̃t = (x̃t , x̃t , ..., x̃t ) which correspond to the i-th row of
adjacency estimation layer is adopted in a GS-RNN to learn Xt and ˜Xt , respectively.
a weighted adjacency matrix At of which the element aij
X
t x̃ij
t = aik kj
t xt . (9)
represents the relationship between the joint proposals vti and k
vtj . This layer introduces a bilinear transformation to obtain In detail, Eq. (9) implies that the j-th exchanged feature of
edge weight estimation aij t , which explores the vertex features vertex i in x̃t acquires information from the j-th original
xit and xjt . In addition, a bilinear operation is able to address features of its direct predecessors (including itself) xkj
t with
the inconsistent dimensions of the two inputs, thus xit and the weight aik t of the corresponding edges. Exchanging infor-
xjt can be the features of different modalities. By adding mation between the vertices helps capture useful patterns such
superscripts to xit and xjt to denote different feature modalities, as simultaneously occurring abnormalities as references to
i(c)
xt ∈ Rp is the feature vector extracted from the pre-trained enhance representation learning. Next, linear transforms Wxi
j(r)
C3D and xt ∈ Rq is extracted from the pre-trained ResNet- and Whi are applied to the exchanged features of the vertices
50. In particular, aij
t is estimated as: and the hidden state from the last step which is broadcast to
aij
i(c) j(r) each vertex.
t = g(xt M xt ), (1)
The forget gate ft ∈ R|Vt |×p controls the extent to which the
where M ∈ R p×q
is a matrix of parameters which can be existing patterns should be kept, the output gate ot ∈ R|Vt |×p
updated during the backward propagation, is an element- controls the extent to which the cell state is involved into
wise vector multiplication operator, and g(·) represents a computing the output, and the candidate cell state ċt ∈ R|Vt |×p
function of Rq → R which is chosen as a linear function with are computed in the similar manner. Note that the graph-level
trainable weights and bias in this study. Note that the estimated cell state c̈t ∈ R|Vt |×p and the graph-level hidden state ḣt ∈
adjacency matrix At is asymmetrical, i.e., aij ji
t 6= at , due to the R|Vt |×p are obtained vertex-wisely. The graph-level states are
asymmetry of M and the different feature modalities of vertex matrices of which each row contains the state of a particular
vit and vjt . Such asymmetry is helpful to represent different vertex. Given the potential for a temporally changing graph
interactions (i.e., action and reaction) between vertices. Note structure, it is not always feasible to find the corresponding
that the computation of At is differentiable and At can be vertices of the next graph. Thus maximum pooling operators
learned through both forward and backward propagations. are introduced for c̈t and ḣt on vertices and the pooling states
are further broadcast to the vertices of the next graph. Hence,
C. Graph RNN Cell the graph LSTM cell is able to keep track of the dependencies
among the graphs of the dynamic structures in the sequence.
Graph RNN cells in this study introduce gate mechanisms
Similar strategies are also applied to the computation of
similar to LSTM and GRU. Based on these two types of
graph GRU cells as shown in Fig. 4 (c). Graph GRU cells
mechanisms, we introduce the graph LSTM cell and the graph
involve less computation than graph LSTM cells, which are
GRU cell as shown in Fig. 4. The computations of the Graph
expected to be more efficient for training and prediction.
LSTM cell shown in Fig. 4 (a) are as follows:
zt := σ(Wxz AXt + Whz ht−1 + 1t bz ), (10)
it := σ(Wxi AXt + Whi ht−1 + 1t bi ), (2)
rt := σ(Wxr AXt + Whr ht−1 + 1t br ), (11)
ft := σ(Wxf AXt + Whf ht−1 + 1t bf ), (3)
ḣt := tanh(Wxh AXt + rt Whh ht−1 + 1t bh ), (12)
ot := σ(Wxo AXt + Who ht−1 + 1t bo ), (4)
ḧt := (1 − zt )ḣt + zt ht−1 , (13)
c˙t := tanh(Wxc AXt + Whc ht−1 + 1t bc ), (5)
ht := 1t+1 max σ(ḧt ). (14)
v∈V
c̈t := ft ct−1 + it ċt , (6)
ḣt := ot tanh(c̈t ), (7) D. Vertex-wise RNN Cell
In general, stacking multiple Graph RNN layers in a GS-
ct := 1t+1 max(c̈t ), ht := 1t+1 max(ḣt ). (8) RNN involves expensive non-linear computation, which sig-
v∈V v∈V
nificantly increases the model complexity. Therefore, it tends
Similar to general LSTM cells, Eq. (2) computes the input
to result in over-fitting issues. To alleviate such issues and to
gate it ∈ R|Vt |×p controlling the extent to which the new
build deep GS-RNNs, vertex-wise RNN cells are proposed,
patterns are introduced to the cell, where p is the hidden size.
which apply shared linear transformations on each vertex
All the vertices and the hidden state from the last temporal
separately. As a result, no pattern exchange is conducted in
step are involved, where σ is an vertex-wise sigmoid function
the vertex-wise RNN cells as in graph RNN cells. Vertex-wise
and 1t is an all-one |Vt | dimensional vector for broadcast
RNN cells are implemented as vertex-wise LSTM cells and
purpose. First, vertex features xt is multiplied with the adja-
vertex-wise GRU cells. The computation of the Vertex-wise
cency matrix A, which can be interpreted as the information
LSTM cell shown in Fig. 4 (b) are formulated from Eq. (15)
exchange among the vertices in line with the edge weights.
to Eq. (21).
Let X̃t = At Xt denote the exchanged vertex features of the
input graph. The original feature and the exchanged feature of it := σ(Wxi Xt + Whi ht−1 + 1t bi ), (15)

1057-7149 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2019.2946469, IEEE
Transactions on Image Processing
6

(a) Graph LSTM Cell (c) Graph GRU Cell

(b) Vertex-wise LSTM Cell (d) Vertex-wise GRU Cell


Fig. 4: Illustration of the gate mechanism of the proposed Graph RNN cells based on LSTM and GRU for graph sequence of
dynamic structures. Two kinds of cells are designed: the graph RNN cell is applied to represent the vertex patterns with their
adjacency relations; the vertex-wise RNN cell is devised to reduce the model complexity, which helps build deep GS-RNNs.

ft := σ(Wxf Xt + Whf ht−1 + 1t bf ), (16) E. Graph Pooling


The state ḣt of the graph RNN cell and the Vertex-
ot := σ(Wxo Xt + Who ht−1 + 1t bo ), (17)
wise RNN cell can be viewed as hidden vertex features of
the corresponding layer. Vertex-wise fully connected layers
c˙t := tanh(Wxc Xt + Whc ht−1 + 1t bc ), (18)
and activation functions can be applied to compute the FoG
probability of each vertex, which indicates whether FoG
c̈t := ft ct−1 + it ċt , (19)
happens in regard to the vertex or not. In general, for a
graph Gt = (Vt , Et ), let ŷ ∈ R|Vp | denote the outputs of
ḣt := ot tanh(c̈t ), (20) the last fully connected layer, in which the i-th component
ŷi of ŷ is the estimation of the response of the i-th vertex
ct := 1t+1 max(c̈t ), ht := 1t+1 max(ḣt ). (21) yi and |Vp | is the number of vertices. Since the vertex-level
v∈V v∈V
annotation is not available, i.e. no prior knowledge of yti , a
Note that the vertex-wise LSTM cell can be applied to a graph pooling strategy is applied to produce the response yt of an
with an arbitrary number of vertices. Indeed, the proposed entire temporal segment. Note that the pooling operation also
vertex-wise dense layer can be viewed as a special case of the eliminates the impact of the inconsistent structures of the entire
graph RNN cell with At = I. graph sequence. The basic assumption of graph pooling is that
Similarly, the vertex-wise GRU cell shown in Fig. 4 (d) can at least one vertex contributes to the FoG event when FoG is
be formulated from Eq. (22) to Eq. (26). annotated on the entire graph (the video segment). Hence, the
maximum elements of ŷt can be viewed as an estimation of
zt := σ(Wxz Xt + Whz ht−1 + 1t bz ), (22)
the graph-level response yt , which is estimated as:
rt := σ(Wxr Xt + Whr ht−1 + 1t br ), (23) ŷtg = maxi (ŷti ), (27)
ḣt := tanh(Wxh Xt + rt Whh ht−1 + 1t bh ), (24) where ŷtg represents the response of a graph.
Let fg denote the computation of the forward propagation
ḧt := (1 − zt )ḣt + zt ht−1 , (25) of the proposed GS-RNN. As a binary classification problem,
a cross-entropy loss function is used to optimize the model fg .
ht := 1t+1 max σ(ḧt ). (26) A superscript n is used for the variables mentioned above to
v∈V

1057-7149 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2019.2946469, IEEE
Transactions on Image Processing
7

indicate the n-th training sample. Therefore, the loss function cognitive function (evaluated by MMSE), and Hoehn and
of the proposed GS-RNN is defined as: Yahr stages [47], which is a common metric to describe the
symptoms of PD progress. This dataset has the largest number
XX g(n) g(n) of subjects in the literature of vision-based Parkinsonian gait
J =− [y (n) log(ŷt ) + (1 − y (n) )log(1 − ŷt )]. analysis. For example, 11 subjects were involved in the work
n t
(28) of [10] and 30 subjects in [11]. It is also the first one for FoG
detection.
F. Context Fusion TABLE I: Demographics of the FoG video dataset
GS-RNN fg computes the prediction based on the inputŷtg Age, years Years since diagnosis Education, years MMSE
graph sequence. Nonetheless, the global context is absent by
68.49 (7.7) 9.94 (5.8) 13.80 (3.2) 27.94 (1.9)
involving the joint proposals only. To further help accurately
characterize FoG patterns, a context model fc is applied to 1 2%
2 38%
take the entire video segment sequence {Vt } as the input. Hoehn and Yahr stages 3 18%
The prediction ŷtc = ft (Vt ) is derived and further fused 4 33%
with ŷtg . The context model fc is chosen as a pre-trained 5 10%
C3D network which is adopted for each video segment Vt
and an RNN network which is further applied to formulate To evaluate the performance of the proposed method over
temporal relations between the C3D features of these temporal this dataset comprehensively, a 5-fold cross validation was
segments. introduced. The 45 subjects were randomly and evenly par-
Without increasing the model complexity, the graph se- titioned into 5 groups. For each fold, the video segments of 4
quence based model and the context model are trained in- groups were chosen for the training and validation purposes,
dependently. Fusing ŷtg and ŷtc helps to better predict FoG and the video segments of the remaining one group were used
events. Three fusion strategies, including the product fusion, for test purposes. Therefore, the videos of each subject only
the maximum pooling fusion and the linear fusion, are listed appeared in either the training or test partition, which helps to
from Eq. (29) to Eq. (31). statistically estimate the performance for unseen subjects.
A number of metrics were adopted to comprehensively eval-
ŷt = ŷtc ŷtg , (29)
uate the FoG detection performance. Firstly, as the prediction
ŷt = max(ŷtc , ŷtg ). (30) ŷt is continuous in [0, 1], which indicates the FoG probability,
a threshold θ should be identified in line with a specified use
ŷt = γ ŷtc + (1 − γ)ŷtg , γ ∈ (0, 1). (31) case. If ŷt > θ, the corresponding video segment was marked
The outputs of these three fusing functions are in [0, 1], as an FoG event; otherwise the segment was marked as a
representing the probability of an FoG event occurring in Vt . non-FoG event. Next, accuracy (the percentage of the samples
correctly classified over the total sample size), sensitivity (true
positive rate) and specificity (true negative rate) associated
IV. E XPERIMENTAL R ESULTS AND D ISCUSSIONS
with this threshold were computed. By varying the threshold
A. FoG Dataset & Evaluation Metrics and plotting the corresponding sensitivity against 1-specificity,
The dataset in this study consists of videos collected from receiver operating characteristic (ROC) curve, and area under
45 subjects who underwent clinical assessment. During the curve (AUC) were further utilized to evaluate the effectiveness
clinical assessment, each subject completed a Timed Up and of the proposed GS-RNNs.
Go (TUG) test used for functional mobility assessment [46].
The TUG tests were recorded by frontal view videos at 25
B. Implementation Details
frames per second with a frame resolution of 720 × 576. FoG
events within these videos were annotated by well-trained In this study, three types of pre-trained features were applied
experts on a per-frame basis. Note that clinical staff were including Res-Net 50 vertex features, C3D vertex features and
involved in the video recording processes to ensure the safety C3D context features. The details of obtaining these features
of the subjects during the TUG tests. Furthermore, the angle are introduced as follows:
of the camera was set to capture the body parts from chest to • Res-Net 50 vertex feature Setting the bounding window
feet to meet ethical requirements. as 50×50 pixels, the size of each anatomic joint proposal
In summary, 167 videos totalling 25.5 hours in duration was 25 × 50 × 50 × 3, where 25 was the frame rate and
were acquired, where 8.7% of the total hours collected 3 represented the RGB channels. Pre-trained ResNet-50
contained FoG patterns. This indicates a highly imbalanced with a size (2, 2) at last pooling layer was applied frame
dataset. For FoG detection, these videos were divided into by frame. The maximum pooling operation was applied
91,559 one-second long non-overlapped video segments. If along the temporal axis to reduce the model complexity
any frame of a segment was annotated as FoG referring to the and the computational cost. The dimension of Res-Net
ground truth, this segment was labelled as FoG; otherwise it 50 vertex feature vector was 1 × 2048.
was labelled as non-FoG. The demographics of the dataset are • C3D vertex feature The purpose of C3D vertex fea-
listed in Table I including age, year since diagnosis, education, ture differ from the Res-Net 50 feature. The Res-Net

1057-7149 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2019.2946469, IEEE
Transactions on Image Processing
8

ROC for Different GS-RNN Architectures ROC for Different Strategies Involving Structural and Sequential Graph Patterns
1 1

0.9 0.9

0.8 0.8

0.7 0.7
True positive rate

True positive rate


0.6 0.6

0.5 0.5

0.4 0.4

0.3 0.3

0.2 0.2
Bi-directional GS-GRU
Bi-directional GS-GRU Bi-directional GS-GRU w/o Adjacency Patterns
0.1 Bi-directional GS-LSTM 0.1 Sequential Prediction w/o Structural Patterns
Forward GS-GRU Graph Prediction w/o Sequential Patterns
Forward GS-LSTM Prediction w/o Structural and Sequential Patterns
0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
False positive rate False positive rate

(a) (b)
Fig. 5: ROC curves of GS-RNNs. (a) Comparison among different GS-RNN architectures. (b) Comparison among different
strategies involving structural and sequential patterns.

50 feature was only applied in the adjacency matrix the AUC of these curves. The bi-directional GS-GRU achieved
estimation whilst the C3D vertex feature was used for the highest AUC value 0.884 compared with the other GS-
modelling motion patterns as well. A higher precision of RNN architectures. As expected, the bi-directional GS-GRU
C3D feature was expected, and the proposal size was set outperformed the forward GS-GRU. For GS-LSTM, the gated
to 100×100. The dimension of C3D vertex feature vector mechanisms are much more complex than GS-GRU, thus the
was 1 × 8192. model complexity of the bi-directional GS-LSTM increases,
• C3D context feature The video frames were first re- which negatively impacted the model performance.
sized with fixed aspect ratio and cropped into temporal
clips with size of 25 × 224 × 224 × 3. These temporal TABLE II: Comparison of different GS-RNN architectures for
clips were then fed to a pre-trained C3D network to FoG detection
compute spatial-temporal features. The dimension of the GS-RNN Architectures AUC
C3D context feature was 1 × 32768.
Bi-directional GS-GRU 0.884
Given the imbalanced dataset, all the positive samples were Bi-directional GS-LSTM 0.869
used, with an equivalent number of negative samples randomly Forward GS-GRU 0.875
Forward GS-LSTM 0.883
selected from the training set in each epoch for model training.
The initial learning rate was set to 0.001, utilizing the stochas-
tic gradient decent optimizer. With an Nvidia GTX 1080Ti TABLE III: Comparison of different strategies to represent
GPU card, training of the bi-directional GS-GRU for each structural and sequential graph patterns
epoch was completed in 50 minutes (containing videos from
40 subjects); for the testing, each one-second video segment Methods AUC
was computed within 0.5 sec. Bi-directional GS-GRU 0.884
Bi-directional GS-GRU w/o Adjacency Patterns 0.879
C. FoG Detection Performance of GS-RNNs Sequential Prediction w/o Structural Patterns (C3D-GRU) 0.878
Graph Prediction w/o Sequential Patterns (GCNN) 0.878
GS-RNNs were evaluated by applying different types of GS- Prediction w/o Structural and Sequential Patterns (C3D) 0.874
RNN cells including GS-LSTM cells or GS-GRU cells, and
the directions were set as forward or bi-directional. Similar to GS-RNNs take both the structural and temporal graph
general RNNs, the forward direction involved only the past patterns into account simultaneously. To evaluate how these
graphs and can be utilized for online predictions. The bi- patterns contributed to FoG detection, we further compared the
directional network utilized both the previous and the future bi-directional GS-GRU with different methods, which either
graphs for accurate characterization and prediction of the partially or do not utilize these patterns. The first one is the
graph FoG patterns. In detail, we implemented the forward bi-directional GS-GRU without the adjacency matrix, which
GS-LSTM, the forward GS-GRU, the bi-directional GS-LSTM only utilizes two vertex-wise GRU layers. The second (C3D-
and the bi-directional GS-GRU. Each model contained an GRU) utilized the entire video segment with C3D by ignoring
adjacency matrix estimation layer, a graph RNN layer, a the anatomic graph as input, representing temporal patterns
vertex-wise layer and a graph pooling layer. Fig. 5 (a) shows with GRU. The third (GCNN) ignored sequential patterns and
the ROC curves of the proposed architectures and Table II lists applied GCNN to process each graph independently. The last

1057-7149 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2019.2946469, IEEE
Transactions on Image Processing
9

TABLE IV: DeLong’s test for different strategies to represent structural and sequential graph patterns
Methods AUC Diff. Z p
Sequential Prediction w/o Structural Patterns 0.007 4.04 5.33E-05
Bi-directional GS-GRU w/o Adjacency Patterns 0.005 4.54 5.51E-06
Prediction w/o Structural and Sequential Patterns 0.012 7.23 4.68E-13
Graph Prediction w/o Sequential Patterns 0.015 11.34 < 2.2E − 16

ROC for Different Fusion Strategies ROC for Graph Series and Context Predictions
1 1

0.9 0.9

0.8 0.8

0.7 0.7
True positive rate

True positive rate


0.6 0.6

0.5 0.5

0.4 0.4

0.3 0.3

0.2 0.2
Fusion of 1/2 Graph Series and 1/2 Context
0.1 Fusion of 1/3 Graph Series and 2/3 Context 0.1 Fusion of 1/2 Graph Series and 1/2 Context
Multiplication Fusion Context Only
Maximum Pooling Fusion Graph Series Only
0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
False positive rate False positive rate

(a) (b)
Fig. 6: Comparison of the detection results in terms of ROC Curves. (a) Comparison among different fusion strategies. (b)
Comparison among the best fusion strategy and the cases that graph representation and context model are used independently.

(C3D) generated the prediction on entire temporal segments model were trained independently avoiding an increase in
with C3D independently. Note that the second and the last model complexity. In terms of the fusion strategies, the linear
methods can be viewed as context-level models as they utilized fusion, the product fusion and the maximum pooling fusion
the entire video segment in a straightforward manner. Hence, were investigated. For the linear fusion, γ = 12 and γ = 23
for simplicity, the context model, which is further fused with were selected, which achieved the best image classification
GS-RNN for the complementary purpose, was selected from performance in [50].
these two methods by referring to their performance in this Fig. 6 shows the ROC curves of different fusion methods
study. and Table V lists the AUC values of these ROC curves.
Fig. 5 (b) shows the ROC curves of these methods and Fig. 6 (a) illustrates the curves of different fusion strategies.
Table III lists the AUC values of the ROC curves. The results The linear fusion strategy with γ = 12 achieved the best
indicate that simultaneously characterizing the structural and overall performance 0.900 in terms of AUC. In Fig. 6 (b),
the sequential graph patterns enhanced the performance of the curve of the best fusion strategy was compared with the
FoG detection in terms of AUC. In addition, DeLong’s test cases that the graph sequence model and the context model
was applied to evaluate the statistical significance of the are utilized independently. The fusion predictions improved
improvements [48], [49]. DeLong’s test is a nonparametric the detection performance by taking advantage of the two
approach by using the generalized U -statistics for the ROC independent methods. The improvement in terms of AUC to
curves. Table IV lists the results of the DeLong’s test. The the best fusion method compared with ŷtc and ŷtg are 0.016
p-values of the bi-directional GS-GRU and the rest methods and 0.029, respectively.
indicate that the improvements were statistically significant In particular, the sensitivity, the specificity and the accuracy
(α = 0.001). Therefore, characterizing the structural and the related to a threshold θ̂ were also adopted for the evaluation,
sequential graph patterns of gaits was clearly helpful for FoG which maximized the following Youden’s J statistic [51]:
detection.
θ̂ = arg min sensitivity + specif icity − 1. (32)
θ
D. FoG Detection Performance of Fusion Strategies By maximizing this statistic, a threshold can be derived to treat
To further improve the performance of FoG detection, the the sensitivity and the specificity with equal importance. These
context prediction ŷtc was fused with the GS-RNN prediction evaluation metrics and associated J statistics are also listed in
ŷtg . The context model was chosen as a general GRU network Table V. For the best fusion strategy, the sensitivity, specificity
for sequential predictions, which utilized C3D context features and accuracy values related to this threshold achieved 83.8%,
as the input. Note that the GS-RNN model and the context 82.3% and 82.5%, respectively.

1057-7149 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2019.2946469, IEEE
Transactions on Image Processing
10

TABLE V: Comparison of different fusion strategies


False False Likelihood Likelihood
AUC Youden’s J Sensitivity Specificity positive negative ratio ratio Accuracy
rate rate positive negative
Linear fusion with γ = 23 0.898 0.66 86.5% 79.6% 20.4% 13.5% 4.24 0.17 80.2%
Linear fusion with γ = 12 0.900 0.66 83.8% 82.3% 17.7% 16.2% 4.75 0.20 82.5%
Maximum pooling fusion 0.898 0.66 83.7% 82.1% 17.9% 16.3% 4.67 0.20 82.2%
Product fusion 0.897 0.65 84.3% 80.4% 19.6% 15.7% 4.31 0.20 80.8%
Bi-directional GS-GRU 0.884 0.64 84.4% 79.2% 20.8% 15.6% 4.05 0.20 79.6%
Context model 0.878 0.61 77.4% 83.4% 16.6% 22.6% 4.65 0.27 82.8%

E. Comparison with State-of-the-art Methods F. Key Vertex Localization


In order to further understand how the sequential graph
representations contribute to FoG detection, Fig. 7 visualizes
TABLE VI: Comparisons with existing methods the vertex-level responses of the bi-directional GS-GRU and
GCNN methods. Note that the bi-directional GS-GRU utilizes
Method AUC Sens. Spec. Acc. sequential patterns while GCNN treats input graphs individu-
C3D [27] 0.874 80.2 80.2 80.2 ally. Two positive 6-second FoG video clips are visualized and
P3D [28] 0.819 77.1 74.1 74.5 one frame is selected per second for the illustration purpose.
Spatial + Dilated Temporal CNN [52] 0.844 80.0 78.9 79.0
2D CNN (ResNet-50) + LSTM [29] 0.863 84.9 74.8 75.7 The vertices which achieve the top-5 FoG scores in a graph
Bilinear Attention Pooling [53] 0.848 85.1 75.0 75.9 are highlighted in red and defined as key vertices.
Space-Time Region Graphs CNN [16] 0.846 78.3 77.5 77.6 In general, multiple persons can appear in a video and the
C3DAN [54] - 68.2 80.8 79.3
Global-local 2D CNN + LSTM [17] 0.869 84.5 76.5 77.2 anatomic joint proposals (vertices) are produced collectively.
Bi-directional GS-GRU 0.884 84.4 79.2 79.6 For positive FoG video clips, the key vertices should be
GS-GRU Fused with Context 0.900 83.8 82.3 82.5 correctly located on the patient subjects because FoG events
should not be associated with supporting staffs. Otherwise, the
algorithm would produce incorrect predictions. Therefore, the
We conducted comparisons with 6 existing video classifi- key vertices correctly located on patient subjects are counted
cation methods, including the 3D convolution methods [27], per second. For the bi-directional GS-GRU, most of the key
[28], the dilated temporal CNN method [52], the CNN-LSTM vertices are located on the patient subjects while some top
method [29], the bilinear pooling method [53] and the space- vertices occasionally appear on the supporting staffs, which
time region graphs CNN method [16], and our two recent demonstrates that GS-GRU takes the correct joint proposals
studies, including C3DAN and global-local 2D CNN + LSTM to recognize FoG events. However, for GWN, which does
are listed [17], [54]. As shown in Table VI, C3D and 2D not introduce graph sequential patterns, the count of correctly
CNN (ResNet-50) + LSTM outperformed the others, which located key vertices decreases. It indicates that sequential
suggests that integrating C3D and the ResNet-50 pre-trained graph representations play an important role in improving
neural networks to construct the vertex features and their the performance of FoG detection and can provide additional
relations is reasonable. Although dilated temporal convolutions insights associated with FoG events.
have demonstrated their effectiveness for many temporal data
related tasks, they were not able to capture the dynamic FoG V. C ONCLUSION
patterns adequately. Space-time region graphs have been pro-
posed to characterize the patch relations by utilizing GCNNs, In this study, a novel deep neural network architecture
however, the long-term temporal relations were not explored GS-RNN is presented to process the spatial temporal data
to characterize FoG patterns. Although bilinear methods have represented as dynamic graph sequences. Graph RNN cells
been successfully applied for fine-grained classification prob- and vertex-wise RNN cells are devised as the building blocks
lems, the increased model complexity of bilinear models may of GS-RNNs, which model the structural and the temporal
compromise the performance for FoG detection. graph patterns simultaneously. To this end, GS-RNNs can be
used to formulate vision-based FoG detection as a fine-grained
Our recent work C3DAN [54] aims to identify attended sequential modelling task. Comprehensive experimental results
regions, but the movements of supporting staff may also be on an in-house dataset, which has the largest number of
considered, which could compromise the performance of FoG subjects in the literature of video-based PD gait analysis,
detection. In [17], the structural patterns among joints were demonstrate the superior performance of the proposed GS-
first explored for FoG detection with promising results. The RNN architectures. In addition, the graph representation of
improvement of GS-RNN over [17] demonstrates that tempo- anatomic joints provides an intuitive interpretation of the de-
ral context is also important for characterizing the temporal tection results by localizing the key vertices of an FoG video,
dynamics of FoG events. which is helpful for clinical assessments in practice. In our
In summary, these comparisons clearly demonstrate the future work, we will focus on simplifying GS-RNN cells and
effectiveness of our proposed GS-RNN for FoG detection. architectures to reduce the computational cost for training and

1057-7149 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2019.2946469, IEEE
Transactions on Image Processing
11

(#1-1) Key vertices of GS-GRU on video clip sample #1

(#1-2) Key vertices of GCNN on video clip sample #1

(#2-1) Key vertices of GS-GRU on video clip sample #2

(#2-2) Key vertices of GCNN on video clip sample #2


Fig. 7: Illustration of the key vertex localization task in two FoG video clip samples by utilizing the bi-directional GS-GRU
and GCNN methods. Each clip is of 6-second length, of which one frame is selected per second for the figure, and the top-5
scored vertices (key vertices) are highlighted in red. The count of the key vertices correctly locating on the subjects are noted
on each frame, and it can be observed that GS-GRU is more likely to focus on the subjects, which is benefited from the graph
temporal relations.

prediction phases without compromising the model’s learning ForeFront, a large collaborative research group dedicated to
capacity. the study of neurodegenerative diseases.

ACKNOWLEDGEMENT R EFERENCES
This research was partially supported by Australian Re- [1] E. R. Dorsey, A. Elbaz, E. Nichols, F. Abd-Allah, A. Abdelalim, J. C.
search Council (ARC) Grant DP160103675, NHMRC-ARC Adsuar, M. G. Ansha, C. Brayne, J.-Y. J. Choi, D. Collado-Mateo, et al.,
Dementia Fellowship (#1110414), the National Health and “Global, regional, and national burden of parkinson’s disease, 1990–
2016: a systematic analysis for the global burden of disease study 2016,”
Medical Research Council (NHMRC) of Australia Pro- The Lancet Neurology, vol. 17, no. 11, pp. 939–953, 2018.
gram Grant (#ł1037746), Dementia Research Team Grant [2] J. Jankovic, “Parkinsons disease: clinical features and diagnosis,” Jour-
(#ł1095127) and NeuroSleepCentre of Research Excellence nal of Neurology, Neurosurgery & Psychiatry, vol. 79, no. 4, pp. 368–
(#ł1060992,) the ARC Centre of Excellence in Cognition 376, 2008.
[3] M. A. Hely, W. G. J. Reid, M. A. Adena, G. M. Halliday, and J. G. L.
and its Disorders Memory Program (#CE110001021), Sydney Morris, “The Sydney multicenter study of Parkinson’s disease: the
Research Excellence Initiative (SREI) 2020 of the Univer- inevitability of dementia at 20 years,” Movement Disorders, vol. 23,
sity of Sydney, Natural Science Foundation of China Grant no. 6, pp. 837–844, 2008.
[4] M. Macht, Y. Kaussner, J. C. Möller, K. Stiasny-Kolster, K. M. Eggert,
(#61420106015), and Parkinson Canada. The ethical approval H.-P. Krüger, and H. Ellgring, “Predictors of freezing in Parkinson’s
of this research was obtained from the University of Sydney disease: a survey of 6,620 patients,” Movement Disorders, vol. 22, no. 7,
Human Ethics Board (#2014/255). We thank our patients who pp. 953–956, 2007.
[5] B. R. Bloem, J. M. Hausdorff, J. E. Visser, and N. Giladi, “Falls and
participated into the data collection and they all provided freezing of gait in Parkinson’s disease: a review of two interconnected,
written informed consent. We would like to acknowledge and episodic phenomena,” Movement Disorders, vol. 19, no. 8, pp. 871–884,
thank Moran Gilat, Julie Hall, Alana Muller, Jennifer Szeto 2004.
[6] S. J. G. Lewis and R. A. Barker, “A pathophysiological model of freezing
and Courtney Walton for conducting and scoring the freezing of gait in Parkinson’s disease,” Parkinsonism & Related Disorders,
of gait assessments. We would also like to acknowledge vol. 15, no. 5, pp. 333 – 338, 2009.

1057-7149 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2019.2946469, IEEE
Transactions on Image Processing
12

[7] J. D. Schaafsma, Y. Balash, T. Gurevich, A. L. Bartels, J. M. Hausdorff, [28] Z. Qiu, T. Yao, and T. Mei, “Learning spatio-temporal representation
and N. Giladi, “Characterization of freezing of gait subtypes and the with pseudo-3d residual networks,” in IEEE International Conference
response of each to levodopa in Parkinson’s disease,” European Journal on Computer Vision, pp. 5534–5542, IEEE, 2017.
of Neurology, vol. 10, no. 4, pp. 391–398, 2003. [29] J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venu-
[8] S. Donovan, C. Lim, N. Diaz, N. Browner, P. Rose, L. R. Sudarsky, gopalan, K. Saenko, and T. Darrell, “Long-term recurrent convolutional
D. Tarsy, S. Fahn, and D. K. Simon, “Laserlight cues for gait freezing networks for visual recognition and description,” in IEEE Conference
in parkinsons disease: an open-label study,” Parkinsonism & related on Computer Vision and Pattern Recognition, pp. 2625–2634, 2015.
disorders, vol. 17, no. 4, pp. 240–245, 2011. [30] X. Shi, Z. Chen, H. Wang, D.-Y. Yeung, W.-k. Wong, and W.-c.
[9] T. R. Morris, C. Cho, V. Dilda, J. M. Shine, S. L. Naismith, S. J. Lewis, Woo, “Convolutional LSTM network: A machine learning approach for
and S. T. Moore, “Clinical assessment of freezing of gait in Parkinson’s precipitation nowcasting,” in Advances in Neural Information Processing
disease from computer-generated animation,” Gait & Posture, vol. 38, Systems, pp. 802–810, 2015.
no. 2, pp. 326 – 329, 2013. [31] C. Feichtenhofer, A. Pinz, and A. Zisserman, “Convolutional two-stream
[10] T. Khan, J. Westin, and M. Dougherty, “Motion cue analysis for Parkin- network fusion for video action recognition,” in IEEE Conference on
sonian gait recognition,” The Open Biomedical Engineering Journal, Computer Vision and Pattern Recognition, pp. 1933–1941, 2016.
vol. 7, p. 1, 2013. [32] F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini,
[11] M. Nieto-Hidalgo, F. J. Ferrández-Pastor, R. J. Valdivieso-Sarabia, “The graph neural network model,” IEEE Transactions on Neural
J. Mora-Pascual, and J. M. Garcı́a-Chamizo, “A vision based proposal Networks, vol. 20, no. 1, pp. 61–80, 2009.
for classification of normal and abnormal gait using RGB camera,” [33] J. Bruna, W. Zaremba, A. Szlam, and Y. LeCun, “Spectral networks and
Journal of Biomedical Informatics, vol. 63, pp. 82–89, 2016. locally connected networks on graphs,” arXiv preprint arXiv:1312.6203,
2013.
[12] M. Nieto-Hidalgo, F. J. Ferrández-Pastor, R. J. Valdivieso-Sarabia,
[34] D. K. Duvenaud, D. Maclaurin, J. Iparraguirre, R. Bombarell, T. Hirzel,
J. Mora-Pascual, and J. M. Garcı́a-Chamizo, “Vision based gait analysis
A. Aspuru-Guzik, and R. P. Adams, “Convolutional networks on graphs
for frontal view gait sequences using RGB camera,” in International
for learning molecular fingerprints,” in Advances in neural information
Conference on Ubiquitous Computing and Ambient Intelligence, pp. 26–
processing systems, pp. 2224–2232, 2015.
37, Springer, 2016.
[35] M. Henaff, J. Bruna, and Y. LeCun, “Deep convolutional networks on
[13] S. Sedai, M. Bennamoun, D. Q. Huynh, A. El-Sallam, S. Foo, J. Alder- graph-structured data,” arXiv preprint arXiv:1506.05163, 2015.
son, and C. Lind, “3d human pose tracking using gaussian process [36] T. N. Kipf and M. Welling, “Semi-supervised classification with graph
regression and particle filter applied to gait analysis of parkinson’s convolutional networks,” arXiv preprint arXiv:1609.02907, 2016.
disease patients,” in 2013 IEEE 8th Conference on Industrial Electronics [37] L. Zhao, X. Peng, Y. Tian, M. Kapadia, and D. N. Metaxas, “Semantic
and Applications (ICIEA), pp. 1636–1642, June 2013. graph convolutional networks for 3d human pose regression,” in IEEE
[14] S. Xie, C. Sun, J. Huang, Z. Tu, and K. Murphy, “Rethinking spatiotem- Conference on Computer Vision and Pattern Recognition, pp. 3425–
poral feature learning: Speed-accuracy trade-offs in video classification,” 3435, 2019.
in European Conference on Computer Vision, pp. 318–335, Springer, [38] M. Li, S. Chen, X. Chen, Y. Zhang, Y. Wang, and Q. Tian, “Actional-
2018. structural graph convolutional networks for skeleton-based action recog-
[15] G. Chéron, I. Laptev, and C. Schmid, “P-cnn: Pose-based cnn features nition,” in IEEE Conference on Computer Vision and Pattern Recogni-
for action recognition,” in IEEE International Conference on Computer tion, pp. 3595–3603, 2019.
Vision, pp. 3218–3226, 2015. [39] X. Liu, W. Liu, M. Zhang, J. Chen, L. Gao, C. Yan, and T. Mei, “Social
[16] X. Wang and A. Gupta, “Videos as space-time region graphs,” in relation recognition from videos via multi-scale spatial-temporal reason-
European Conference on Computer Vision, pp. 399–417, 2018. ing,” in IEEE Conference on Computer Vision and Pattern Recognition,
[17] K. Hu, Z. Wang, K. Martens Ehgoetz, and S. Lewis, “Vision-based pp. 3566–3574, 2019.
freezing of gait detection with anatomic patch based representation,” in [40] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical evaluation of
Asian Conference on Computer Vision, pp. 564–576, Springer, 2018. gated recurrent neural networks on sequence modeling,” arXiv preprint
[18] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural arXiv:1412.3555, 2014.
computation, vol. 9, no. 8, pp. 1735–1780, 1997. [41] Y. Li, D. Tarlow, M. Brockschmidt, and R. Zemel, “Gated graph
[19] K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, sequence neural networks,” arXiv preprint arXiv:1511.05493, 2015.
H. Schwenk, and Y. Bengio, “Learning phrase representations using [42] Y. Seo, M. Defferrard, P. Vandergheynst, and X. Bresson, “Structured
rnn encoder-decoder for statistical machine translation,” arXiv preprint sequence modeling with graph convolutional recurrent networks,” in
arXiv:1406.1078, 2014. International Conference on Neural Information Processing, pp. 362–
[20] A. Jain, A. R. Zamir, S. Savarese, and A. Saxena, “Structural-RNN: 373, Springer, 2018.
Deep learning on spatio-temporal graphs,” in IEEE Conference on [43] C. Si, W. Chen, W. Wang, L. Wang, and T. Tan, “An attention enhanced
Computer Vision and Pattern Recognition, pp. 5308–5317, 2016. graph convolutional lstm network for skeleton-based action recognition,”
[21] S. Yan, Y. Xiong, and D. Lin, “Spatial temporal graph convolutional in IEEE Conference on Computer Vision and Pattern Recognition, June
networks for skeleton-based action recognition,” in Thirty-Second AAAI 2019.
Conference on Artificial Intelligence, 2018. [44] Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh, “Realtime multi-person
2d pose estimation using part affinity fields,” in IEEE Conference on
[22] Y. Du, Y. Fu, and L. Wang, “Representation learning of temporal
Computer Vision and Pattern Recognition, 2017.
dynamics for skeleton-based action recognition,” IEEE Transactions on
[45] S.-E. Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh, “Convolutional
Image Processing, vol. 25, no. 7, pp. 3010–3022, 2016.
pose machines,” in IEEE Conference on Computer Vision and Pattern
[23] J. Liu, G. Wang, L.-Y. Duan, K. Abdiyeva, and A. C. Kot, “Skeleton- Recognition, pp. 4724–4732, 2016.
based human action recognition with global context-aware attention lstm [46] A. Shumway-Cook, S. Brauer, and M. Woollacott, “Predicting the
networks,” IEEE Transactions on Image Processing, vol. 27, no. 4, probability for falls in community-dwelling older adults using the timed
pp. 1586–1599, 2018. up & go test,” Physical Therapy, vol. 80, no. 9, pp. 896–903, 2000.
[24] J. Liu, A. Shahroudy, D. Xu, A. C. Kot, and G. Wang, “Skeleton-based [47] M. M. Hoehn and M. D. Yahr, “Parkinsonism: onset, progression, and
action recognition using spatio-temporal lstm network with trust gates,” mortality,” Neurology, vol. 17, no. 5, pp. 427–427, 1967.
IEEE transactions on pattern analysis and machine intelligence, vol. 40, [48] E. R. DeLong, D. M. DeLong, and D. L. Clarke-Pearson, “Comparing
no. 12, pp. 3007–3021, 2018. the areas under two or more correlated receiver operating characteristic
[25] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and curves: A nonparametric approach,” Biometrics, vol. 44, no. 3, pp. 837–
L. Fei-Fei, “Large-scale video classification with convolutional neural 845, 1988.
networks,” in IEEE Conference on Computer Vision and Pattern Recog- [49] X. Robin, N. Turck, A. Hainard, N. Tiberti, F. Lisacek, J.-C. Sanchez,
nition, pp. 1725–1732, 2014. and M. Müller, “pROC: an open-source package for R and S+ to analyze
[26] K. Simonyan and A. Zisserman, “Two-stream convolutional networks and compare roc curves,” BMC Bioinformatics, vol. 12, no. 1, p. 77,
for action recognition in videos,” in Advances in Neural Information 2011.
Processing Systems, pp. 568–576, 2014. [50] Y. Peng, X. He, and J. Zhao, “Object-part attention model for fine-
[27] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learn- grained image classification,” IEEE Transactions on Image Processing,
ing spatiotemporal features with 3d convolutional networks,” in IEEE vol. 27, no. 3, pp. 1487–1500, 2018.
International Conference on Computer Vision, pp. 4489–4497, IEEE, [51] W. J. Youden, “Index for rating diagnostic tests,” Cancer, vol. 3, no. 1,
2015. pp. 32–35, 1950.

1057-7149 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2019.2946469, IEEE
Transactions on Image Processing
13

[52] C. Lea, M. D. Flynn, R. Vidal, A. Reiter, and G. D. Hager, “Temporal


convolutional networks for action segmentation and detection,” in IEEE
Conference on Computer Vision and Pattern Recognition, pp. 1003–
1012, IEEE, 2017.
[53] R. Girdhar and D. Ramanan, “Attentional pooling for action recogni-
tion,” in Advances in Neural Information Processing Systems, pp. 33–44,
2017.
[54] R. Sun, Z. Wang, K. E. Martens, and S. Lewis, “Convolutional 3d
attention network for video based freezing of gait recognition,” in
International Conference on Digital Image Computing: Techniques and
Applications (DICTA), pp. 1–7, IEEE, 2018.

1057-7149 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

You might also like