Deep Neural Networks Enable Quantitative Movement Analysis Using Single-Camera Videos
Deep Neural Networks Enable Quantitative Movement Analysis Using Single-Camera Videos
Deep Neural Networks Enable Quantitative Movement Analysis Using Single-Camera Videos
https://doi.org/10.1038/s41467-020-17807-z OPEN
Many neurological and musculoskeletal diseases impair movement, which limits people’s
function and social participation. Quantitative assessment of motion is critical to medical
decision-making but is currently possible only with expensive motion capture systems and
highly trained personnel. Here, we present a method for predicting clinically relevant motion
parameters from an ordinary video of a patient. Our machine learning models predict
parameters include walking speed (r = 0.73), cadence (r = 0.79), knee flexion angle at
maximum extension (r = 0.83), and Gait Deviation Index (GDI), a comprehensive metric of
gait impairment (r = 0.75). These correlation values approach the theoretical limits for
accuracy imposed by natural variability in these metrics within our patient population. Our
methods for quantifying gait pathology with commodity cameras increase access to quan-
titative motion analysis in clinics and at home and enable researchers to conduct large-scale
studies of neurological and musculoskeletal disorders.
1 Department of Bioengineering, Stanford University, Stanford, CA 94305, USA. 2 Center for Gait and Motion Analysis, Gillette Children’s Specialty
Healthcare, St. Paul, MN 55101, USA. 3 Department of Orthopedic Surgery, University of Minnesota, Minneapolis, MN 55454, USA. 4These authors
contributed equally: Łukasz Kidziński, Bryan Yang. ✉email: [email protected]; [email protected]
G
ait metrics, such as walking speed, cadence, symmetry, from 2D planar projections17, but their study included a popu-
and gait variability are valuable clinical measurements in lation of only two impaired subjects and required carefully
conditions such as Parkinson’s disease1, osteoarthritis2, engineered features, limiting generalizability. Moreover, for pre-
stroke , cerebral palsy4, multiple sclerosis5, and muscular dys-
3 dictions that are not directly explained by physical phenomena,
trophy6. Laboratory-based optical motion capture is the current such as clinical decisions, feature engineering is particularly dif-
gold standard for clinical motion analysis (Fig. 1a); it is used to ficult. To overcome these limitations, we used deep neural net-
diagnose pathological motion, plan treatment, and monitor out- works (machine learning models that employ multiple artificial
comes. Unfortunately, economic and time constraints inhibit the neural network layers to learn complex, and potentially nonlinear,
routine collection of this valuable, high-quality data. Further, relationships between inputs and outputs), which have been
motion data collected in a laboratory may fail to capture how shown to be an effective tool for making robust predictions in an
individuals move in natural settings. Recent advances in machine impaired population compared with methods using hand-
learning, along with the ubiquity and low cost of wearable sensors engineered features18. Our method capitalizes on 2D pose esti-
and smartphones, have positioned us to overcome the limitations mates from video to predict (i) quantitative gait metrics com-
of laboratory-based motion analysis. Researchers have trained monly used in clinical gait analysis, and (ii) clinical decisions.
machine learning models to estimate gait parameters7,8 or detect We designed machine learning models to predict clinical gait
the presence of disease9, but current models often rely on data metrics from trajectories of 2D body poses extracted from videos
generated by specialized hardware such as optical motion capture using OpenPose (Fig. 1b and Supplementary Movie 1). Our
equipment, inertial measurement units, or depth cameras10,11. models were trained on 1792 videos of 1026 unique patients with
Standard video has the potential to be a low-cost, easy-to-use cerebral palsy. These videos, along with gold-standard optical
alternative to monitor motion. Modern computational methods, motion capture data, were collected as part of a clinical gait
including deep learning12, along with large publicly available analysis. Measures derived from the optical motion capture data
datasets13 have enabled pose estimation algorithms, such as served as ground-truth labels for each visit (see Methods). We
OpenPose14, to produce estimates of body pose from standard predicted visit-level gait metrics (i.e., values averaged over mul-
video across varying lighting, activity, age, skin color, and angle- tiple strides from multiple experimental trials), since the videos
of-view15. Human pose estimation software, including OpenPose, and gold-standard optical motion capture were collected con-
outputs estimates of the two-dimensional (2D) image-plane temporaneously but not simultaneously. These visit-level esti-
positions of joints (e.g., ankles and knees) and other anatomical mates of values, such as average speed or cadence, are widely
locations (e.g., heels and pelvis) in each frame of a video (Fig. 1b). adopted in clinical practice. We tested convolutional neural net-
These estimates of 2D planar projections are too noisy and biased, work (CNN), random forest (RF), and ridge regression (RR)
due to manually annotated ground truth and planar projection models, with the same fixed set of input signals for each model. In
errors, to be used directly for extracting clinically meaningful the CNN models, we input raw time series; in the other two
information such as three-dimensional (3D) gait metrics or models (which are not designed for time-series input), we input
treatment indications16. Investigators recently predicted cadence summary statistics such as mean and percentile. We present the
Anthropometric
measurements
12°
Fig. 1 Comparison of the current clinical workflow with our video-based workflow. a In the current clinical workflow, a physical therapist first takes a
number of anthropometric measurements and places reflective markers on the patient’s body. Several specialized cameras track the positions of these
markers, which are later reconstructed into 3D position time series. These signals are converted to joint angles as a function of time and are subsequently
processed with algorithms and tools unique to each clinic or laboratory. b In our proposed workflow, data are collected using a single commodity camera.
We use the OpenPose14 algorithm to extract trajectories of keypoints from a sagittal-plane video. We present an example input frame, and then the same
frame with detected keypoints overlaid. To illustrate the detected pose, the keypoints are connected. Next, these signals are fed into a neural network that
extracts clinically relevant metrics. Note that this workflow does not require manual data processing or specialized hardware, allowing monitoring at home.
CNN results since in all cases, the CNN performed as well or video by our best models had correlations of 0.73, 0.79, and, 0.83,
better than the other models (Fig. 2); however, more thorough respectively, with the ground-truth motion capture data (Table 1
feature engineering specific to each prediction task could improve and Fig. 3a–c).
performance for all model types. Our models, trajectories of Our model’s predictive performance for walking speed was
anatomic keypoints derived using OpenPose, and ground-truth close to the theoretical upper bound given intra-patient stride-to-
labels are freely shared at http://github.com/stanfordnmbl/ stride variability. Variability of gait metrics can be decomposed
mobile-gaitlab/. into inter-patient and intra-patient (stride-to-stride) variability27.
The correlation between our model and ground-truth walking
speed was 0.73; thus, our model explained 53% of the observed
Results
variance. In the cerebral palsy population, intra-patient stride-to-
Predicting common gait metrics. We first sought to determine
stride variability in walking speed typically accounts for about
visit-level average walking speed, cadence, and knee flexion angle
25% of the observed variance in walking speed28. Therefore, we
at maximum extension from a 15 s sagittal-plane walking video.
do not expect the variance explained to exceed 75% because our
These gait metrics are routinely used as part of diagnostics and
video and ground-truth motion capture data were not collected
treatment planning for cerebral palsy4 and many other disorders,
simultaneously, making it infeasible to capture stride-to-stride
including Parkinson’s disease19,20, Alzheimer’s disease21,22,
variability. The remaining 22% of variability likely represented
osteoarthritis2,23, stroke3,24, non-Alzheimer’s dementia25, multi-
some additional trial-to-trial variability, along with inter-patient
ple sclerosis5,26, and muscular dystrophy6. The walking speed,
variability that the model failed to capture.
cadence, and knee flexion at maximum extension predicted from
Our predictions of knee flexion angle at maximum extension
Model performance
within the gait cycle, a key biomechanical parameter in clinical
1.0 decision-making, had a correlation of 0.83 with the correspond-
CNN ing ground-truth motion capture data (Fig. 3c). For comparison,
Ridge regression the knee flexion angle at maximum extension directly computed
Random forest from the thigh and shank vectors defined by the hip, knee, and
0.9 ankle keypoints of OpenPose had a correlation of only 0.51 with
the ground-truth value, possibly due in part to the fixed position
of the camera and associated projection errors. This implies that
information contained in other variables used by our model had
0.8 substantial predictive power.
Correlation
True vs. predicted correlation (95% CI) Mean bias (95% CI; p value) Mean absolute error
Walking speed (m/s) 0.73 (0.66–0.79) 0.00 (−0.02–0.02; 0.93) 0.13
Cadence (strides/s) 0.79 (0.73–0.84) 0.01 (0.00–0.02; 0.10) 0.08
Knee flexion (degrees) 0.83 (0.78–0.87) 0.33 (−0.40–1.06; 0.38) 4.8
Gait Deviation Index 0.75 (0.68–0.81) 0.54 (−0.33–1.42; 0.22) 6.5
We measured performance of the CNN model for four walking parameters: walking speed, cadence, knee flexion at maximum extension, and Gait Deviation Index (GDI). All statistics were derived from
predictions on the test set, i.e., visits that the model has never seen. Bias was computed by subtracting predicted value from observed value. Correlations are reported with 95% confidence interval (CI).
All predictions had correlations with true values above 0.73. For perspective, stride-to-stride correlation for GDI is reported to be 0.73–0.8931, which is comparable with our estimator. We used a two-
sided t-test to check if predictions were biased. In each case there was no statistical evidence for rejecting the null hypothesis (no bias).
a 1.6
Walking speed b Cadence
1.4 1.4
1.2
1.2
True cadence
1.0
True speed
1.0
0.8
0.8
0.6
0.4 0.6
r = 0.73 r = 0.79
0.2
0.4
0.4 0.6 0.8 1.0 0.6 0.7 0.8 0.9 1.0 1.1 1.2
Predicted speed [m/s] Predicted cadence [strides/s]
60 100
90
40
True knee flexion
True GDI
80
20
70
0
60
r = 0.83 r = 0.75
–20 50
–10 0 10 20 30 40 50 60 60 70 80 90
Predicted knee flexion [degrees] Predicted GDI
30
20
20
10
True asymmetry
True change
10
0
0 –10
–10 –20
r = 0.43 r = 0.83
–20 –30
–5 0 5 10 15 20 –20 –10 0 10 20
Predicted asymmetry Predicted change [degrees]
g Change in GDI
30
20
True change
10
–10
r = 0.59
–20
–20 –10 0 10 20
Predicted change
Fig. 3 Convolution neural network (CNN) model performance. We evaluated the correlation, r, between the true gait metric values from motion capture
data and the predicted values from the video keypoint time-series data and our model. Our model predicted (a) speed, (b) cadence, (c) knee flexion angle
at maximum extension, and (d) Gait Deviation Index. We also did a post-hoc analysis to predict (e) asymmetry in GDI, as well as longitudinal changes in
(f) knee flexion angle at maximum extension and (g) GDI. In all plots, the straight blue line corresponds to the best linear fit to predicted vs. observed data
while light bands correspond to the 95% confidence interval for the regression curve derived using bootstrapping (n = 200 bootstrapping trials).
these joint angles enabled us to predict GDI with high accuracy motion capture (Fig. 4). This, along with the higher correlation
from 2D video. We predicted GMFCS with weighted kappa of observed for predicting sagittal-plane knee kinematics, suggests
0.71 (Table 2); inter-rater variability of GMFCS is reported to be that GDI estimation could be improved with additional views of
0.76–0.8132, and agreement between a physician and a parent is the patient’s gait.
0.48–0.6733. The predicted GMFCS scores were correct 66% of
the time and always within 1 of the true score. The largest rate of
misclassifications occurred while differentiating between GMFCS Predicting longitudinal gait changes and surgical events. A
levels I and II, but this is unsurprising as more information than post-hoc analysis using the predicted gait metrics from single gait
can be gleaned from a simple 10 m walking task (e.g., about the visits showed that we partially captured gait asymmetry and long-
patient’s mobility over a wider range of tasks, terrain, and time) is itudinal changes for individual patients. Gait asymmetry may arise
typically needed to distinguish between these two levels. from impairments in motor control, asymmetric orthopedic
We reasoned that remaining unexplained variability in GDI deformity, and asymmetric pain, and can be used to inform clinical
may be due to unobserved information from the frontal and decisions34. Longitudinal changes can inform clinicians about
transverse planes. To test this, we computed correlations between progression of symptoms and long-term benefits of treatment, since
the GDI prediction model’s residuals and parameters that are not the lack of longitudinal data makes analysis of long-term effects of
captured by OpenPose from the sagittal view. We found that the treatment difficult35. We used predicted values from the models
residuals between true and predicted GDI were correlated with described earlier to estimate asymmetry and longitudinal changes,
the patient’s mean foot progression angle (p < 10−4) and mean and thus did not train new models for this task. Our predicted gait
hip adduction during gait (p < 10−4) as measured by optical asymmetry, specifically, the difference in GDI between the two
limbs, correlated with the true asymmetry with r = 0.43 (Fig. 3e);
this lower correlation is expected because we estimate asymmetry as
Table 2 Model accuracy in predicting the Gross Motor a difference between two noisy predictions of GDI for the left and
Function Classification System (GMFCS) score. right limbs. We predicted longitudinal change assuming the true
baselines measured in the clinic are known and future values are to
True I True II True III True IV be estimated. This framework approximates the use of videos to
monitor patients at home after an initial in-clinic gait analysis. The
Predicted I 50 21 0 0
Predicted II 26 47 1 0
change in knee flexion at maximum extension angle correlated with
Predicted III 0 8 22 4 the true change with r = 0.83 (Fig. 3f), while the change in GDI
Predicted IV 0 0 1 0 over time correlated with r = 0.59 (Fig. 3g). In the case where we
did not use baseline GDI in the model, correlations between the
The GMFCS score is derived from an expert clinical rater assessing walking, sitting, and use of
assistive devices for mobility. The confusion matrix presents our GMFCS prediction based solely
difference in model-predicted values and difference in ground-truth
on videos in the test set. Prediction using our CNN model has Cohen’s kappa = 0.71, which is clinic-measured values were 0.68 for knee flexion at maximum
close to the intra-rater variability in GMFCS. In addition, misclassifications were exclusively by
only one level (e.g., True I never predicted to be III or IV).
extension and 0.40 for GDI.
Finally, we sought to predict whether a patient would have
surgery in the future, since accurate prediction of treatment might
a Mean hip adduction vs residual GDI b Mean foot progression angle at stance
vs residual GDI
15 15
10 10
5 5
Residual GDI
Residual GDI
0 0
–5 –5
–10 –10
–15 –15
r = 0.27 r = 0.32
–20 p < 10–4 –20 p < 10–4
Fig. 4 Correlation between GDI prediction residuals and non-sagittal-plane kinematics. The residuals from predicting GDI from video are correlated with
the mean (a) foot progression and (b) hip adduction angles derived from optical motion capture. These correlations suggest that the foot progression and
hip adduction angles, which are inputs to the calculation of ground-truth GDI, are not fully captured in the sagittal-plane video. We tried linear and
quadratic models and chose the better one by the Bayesian Information Criterion. In each plot, the blue curve corresponds to the best quadratic fit to
predicted vs. observed data while the light band corresponds to the 95% confidence interval for the regression curve derived using bootstrapping (n = 200
bootstrapping trials). We tested if each fit is significant by using the F-test and we reported corresponding p values.
r = –0.54
100
0.8
90
True positive rate
0.6
80
GDI
70
0.4
60
CNN (0.71 AUC)
0.2 Ridge regression (0.66)
Random forest (0.66) 50
Logistic regression (GDI) (0.68)
Ensemble CNN + GDI (0.73)
0.0
0.0 0.2 0.4 0.6 0.8 1.0 –0.8 –0.6 –0.4 –0.2 0.0 0.2 0.4
False positive rate Predicted SEMLS residual
Fig. 5 Analysis of models for treatment decision prediction. a Our CNN model outperformed ridge regression and random forest models that used
summary statistics of the time series (see Methods) and the logistic regression model using only GDI. b Residuals from the CNN model to predict SEMLS
treatment decisions correlate with GDI. The straight blue line corresponds to the best linear fit to predicted vs. observed data while the light band
corresponds to the 95% confidence interval for the regression curve derived using bootstrapping (n = 200 bootstrapping trials).
enable remote screenings in locations with limited access to in children with cerebral palsy, the protocol used in the clinic
specialty healthcare. We predicted treatment decisions—specifi- must be closely followed, including similar camera angles and
cally, whether a patient received a single-event multilevel surgery subject clothing. For deployment under more lax collection
(SEMLS) following the analyzed clinical gait visit. This analysis protocols, the methods should be tested with new videos recorded
revealed that patient videos contain information that is distinct by naive users. Second, our study only used sagittal-plane video,
from GDI and predictive of SEMLS decisions. Our model making it difficult to capture signals visible mainly in other
predicted whether a patient received a SEMLS with Area Under planes, such as step width. A similar framework to the one we
the Receiver Operating Characteristics Curve (AUC) of 0.71 describe in this study could be used to build models that incor-
(Fig. 5a). The CNN model slightly outperformed a logistic porate videos from multiple planes. Third, since videos and
regression model based on GDI from motion capture (AUC 0.68). motion capture data were collected separately, we could only
An ensemble of our CNN model and the GDI logistic regression design our models to capture visit-level parameters. For some
model-predicted SEMLS with AUC 0.73, suggesting there is some applications, stride-wise parameters might be required. With
additional information in GDI compared with our CNN model. additional data, researchers could test whether our models are
We found that residuals of the SEMLS prediction from our CNN suitable for this stride-level prediction, or, if needed, could train
model were correlated with GDI with r = 0.51 (Fig. 5b), further new models using a similar framework. In this study, we had
validating that the two signals have some uncorrelated predictive access to a large dataset to train our CNN model; if extending our
information. approach to a task where more limited data are available, more
extensive feature engineering and classical machine learning
Discussion models might lead to better results. Finally, the dataset we used
Our models can help parents and clinicians assess early symp- was from a single clinical center, and the robustness of our
toms of neurological disorders and enable low-cost surveillance of models should be tested with data from other centers. For
disease progression. For example, GMFCS predictions from our example, clinical decisions on SEMLS are subjective and must be
model had better agreement with clinicians’ assessments than did interpreted in the context of the clinic in which the data was
parents’ assessments. Our methods are dramatically lower in cost acquired.
than optical motion capture and do not require specialized Our approach shows the potential for using of video-based
equipment or training. A therapist or technician need not place pose estimation to predict gait metrics, which could enable
markers on a patient, and our models allow the use of commodity community-based measurement and fast and easy quantitative
hardware (i.e., a single video camera). In our experiments, we motion analysis of patients in their natural environment. We
downsampled the videos to 640 × 480 resolution, a resolution demonstrated the workflow on children with cerebral palsy and a
available in most modern mobile phone cameras. In fact, the most specific set of gait metrics, but the same method can be applied to
recent smartphones are equipped with cameras that record videos any patient population and metric (e.g., step width, maximum hip
in 3840 × 2160 resolution at 60 frames per second. flexion, and metabolic expenditure). Cost-efficient measurements
For a robust, production-ready deployment of our models or to outside of the clinic can complement and improve clinical prac-
extend our models to other patient populations, practitioners tice, enabling clinicians to remotely track rehabilitation or post-
would have to address several limitations of our study. First, to surgery outcome and researchers to conduct epidemiological scale
use our current models to assess the same set of gait parameters clinical studies. This is a significant leap forward from controlled
laboratory tests and allows virtually limitless repeated measures and test sets contained datapoints coming from different patients. For clinical
and longitudinal tracking. metrics that were independent of side (speed, cadence, GMFCS), we trained using
keypoints from both limbs along with side-independent keypoints and each trial
was a single datapoint.
Methods Patients walked back and forth starting with the camera facing their right side.
We analyzed clinical gait analysis videos from patients seen at Gillette Children’s For consistency, and to simplify training, we mirrored the frames and the labels
Specialty Healthcare. For each video, we used OpenPose14 to extract time series of when the patient reversed their walking direction and we kept track of this
anatomical landmarks. Next, we preprocessed these time series to create features orientation. As a result, all the walking was aligned so that the camera was always
for supervised machine learning models. We trained CNN, RF, and RR models to pointing at the right side or a mirrored version of the left side.
predict gait parameters and clinical decisions, and evaluated model performance on
a held-out test set.
Hand-engineered time series. We found two derived time series helpful for
Dataset. We analyzed a dataset of 1792 videos of 1026 unique patients diagnosed improving the performance of the neural network model. The first time series was
with cerebral palsy seen for a clinical gait analysis at Gillette Children’s Specialty the difference between the x-coordinates (horizontal image-plane coordinates) of
Healthcare between 1994 and 2015. Average patient age was 11 years (standard the left and right ankles throughout time, which approximated the 3D distance
deviation, 5.9). Average height and mass were 133 cm (s.d., 22) and 34 kg (s.d., 17), between ankle centers. The second time series was the image-plane angle formed by
respectively. About half (473) of these patients had multiple gait visits, allowing us the ankle, knee, and hip keypoints. Specifically, we computed the angle between the
to assess the ability of our models to detect longitudinal changes in gait. vector from the knee to the hip and the vector from the knee to the ankle. This
For each patient, optical motion capture (Vicon Motion Systems36) data were value approximated the true knee flexion angle.
collected to measure 3D lower extremity joint kinematics and compute gait
metrics37. These motion capture data were used as ground-truth training labels and
were collected at the same visit as the videos, though not simultaneously. While the Architecture and training of CNNs. CNNs are a type of neural network that use
video system in the gait analysis laboratory has changed multiple times, our post- parameter sharing and sparse connectivity to constrain the model architecture and
hoc analysis showed no statistical evidence that these changes affected predictions reduce the number of parameters that need to be learned12. In our case, the CNN
of our models. model is a parameterized mapping from a fixed-length time-series data (i.e., ana-
Ground-truth metrics of walking speed, cadence, knee flexion angle at tomical keypoints) to an outcome metric (e.g., speed). The key advantage of CNNs
maximum extension, and GDI were computed from optical motion capture data over classical machine learning models was the ability to build accurate models
following standard biomechanics practices38,39. The data collection protocol at without extensive feature engineering.
Gillette Children’s Specialty Healthcare is described in detail by Schwartz et al.40. The key building block of our model was a 1-D convolutional layer. The input
Briefly, physical therapists placed reflective markers on patients’ anatomical to a 1-D convolutional layer consisted of a T × D set of neurons, where T was the
landmarks. Specialized, high-frequency cameras and motion capture software number of points in the time dimension and D was the depth (in our case, the
tracked the 3D positions of these markers as patients walked over ground. dimension of the multivariate time-series input into the model). Each 1-D
Engineers semi-manually postprocessed these data to fill missing marker convolutional layer learned the weights of a set of filters of a given length. For
measurements, segment data by gait cycle, and compute 3D joint kinematics. These instance, suppose we chose to learn filters of length F in our convolutional layer.
processed data were used to compute gait metrics of interest—specifically, speed, Each filter connected only the neurons in a local region of time (but extending
cadence, knee flexion angle at maximum extension, and GDI—per patient and through the entire depth) to a given neuron in the output layer. Thus, each filter
per limb. consisted of FD + 1 weights (we included the bias term here), so the total number
The GMFCS score was rated by a physical therapist, based on the observation of of parameters to an output layer of depth D2 was (FD + 1)D2. Our model
the child’s function and an interview with the child’s parents or guardians. For architecture is illustrated in Fig. 6.
some visits, surgical recommendations were also recorded. Each convolutional layer had 32 filters and a filter length of eight. We used the
Videos were collected during the same lab visit as ground-truth motion capture rectified linear unit (ReLU), defined as f(x) = max(0, x), as the activation function
labels, but during a separate walking session without markers. The same protocol after each convolutional layer. After ReLU, we applied batch normalization
was used; i.e., the patient was asked to walk back and forth along a 10 m path 3–5 (empirically, we found this to have slightly better performance than applying batch
times. The patient was recorded with a camera ~3–4 m from the line of walking of normalization before ReLU). We defined a k-convolution block as k 1D
the patient. The camera was operated by an engineer who rotated it along its convolution layers followed by a max pooling layer and a dropout layer with rate
vertical axis to follow the patient. Subjects were asked to wear minimal comfortable 0.5 (see Fig. 6). We used a mini batch size of 32 and RMSProp (implemented in
clothing. keras software; keras.io/optimizers) as the optimizer. We experimented with k ∈ {1,
Raw videos in MP4 format with Advanced Video Coding encoding41 were 2, 3}-convolution blocks to identify sufficient model complexity to capture higher
collected at a resolution of 1280 × 960 and frame rate of 29.97 frames per second. order relations in the time series. After extensive experimentation, we settled on an
We downsampled videos to 640 × 480, imitating lower-end commodity cameras architecture with k = 3.
and matching the resolution of the training data of OpenPose. For each trial we had After selecting the architecture, we did a random search on a small grid to tune
500 frames, corresponding to around 16 s of walking. the initial learning rate of RMSProp and the learning rate decay schedule. We also
The study was approved by the University of Minnesota Institutional Review searched over different values of the L2 regularization weight (λ) to apply to the last
Board (IRB). Patients, and guardians, where appropriate, gave informed written four convolutional layers. We applied early stopping to iterations of the random
consent at the clinical visit for their data to be included. In accordance with IRB search that had problems converging. The final optimal setting of parameters was
guidelines, all patient data were de-identified prior to any analysis. an initial learning rate of 10−3, decaying the learning rate by 20% every 10 epochs,
and setting λ = 3.16 × 103 for the L2 regularization. Regularization (both L2 and
Extracting keypoints with OpenPose. For each frame in a video, OpenPose dropout) is fundamental for our training procedure since our final CNN model has
returned 2D image-plane coordinates of 25 keypoints together with prediction 47,840 trainable parameters, i.e., at the order of magnitude of the training sample.
confidence of each point for each detected person. Reported points were the Our input volume had dimension 124 × 12. The depth was only 12 because
estimated (x, y) coordinates, in pixels, of the centers of the torso, nose, and pelvis, preliminary analysis indicated that dropping several of the time series improved
and centers of the left and right shoulders, elbows, hands, hips, knees, ankles, heels, performance. We used the same set of features for all models to further simplify
first and fifth toes, ears, and eyes. Note that OpenPose explicitly distinguished right feature engineering. The features we used were the normalized (x, y) image-plane
and left keypoints. coordinates of ankles, knees, hips, first (big) toes, projected angles of the ankle and
We only analyzed videos with one person visible. After excluding 1443 cases knee flexion, the distance between the first toe and ankle, and the distance between
where OpenPose failed to detect patients or where more than one person was left ankle and right ankle. Our interpretation of this finding was that some time
visible, the dataset included 1792 videos of 1026 patients. For each video, we series, such as the x-coordinate of the left ear, were too noisy to be helpful.
worked with a 25-dimensional time series of keypoints across all frames. We We trained the CNN on 124-frame segments from the videos. We augmented
centered each univariate time series by subtracting the coordinates of the right hip the time-series data using a method sometimes referred to as window slicing, which
and scaled all values by dividing by the Euclidean distance between the right hip allowed us to generate many training segments from each video. By covering a
and the right shoulder. We then smoothed the time series using a one-dimensional variety of starting timepoints, this approach also made the model more robust to
unit-variance Gaussian filter. Since some of the downstream machine learning variations in the initial frame. From each input time series, X, with length 500 in
algorithms do not accept missing data, we imputed missing observations using the time dimension and an associated clinical metric (e.g., GDI), y, we extracted
linear interpolation. overlapping segments of 124 frames in length, with each segment separated by 31
For the clinical metrics where values for the right and left limb were computed frames. Thus for a given datapoint (y, X), we constructed the segments (y, X[:, 0:
separately (GDI, knee flexion angle at maximum extension, and SEMLS), we used 124]), (y, X[:, 31: 155]), …, (y, X[:, 372: 496]). Note that each video segment was
the time series of keypoints (knee, ankle, heel, and first toe) of the given limb as labeled with the same ground-truth clinical metric (y). We also dropped any
predictors. Other derived time series, such as the difference in x position between segments that had more than 25% of their data missing. For a given video Xj, we
ð iÞ
the ipsilateral and contralateral ankle, or joint angles (for knee and ankle), were use the notation Xj , j = 1, 2, …, c(i) to refer to its derived segments, where 1 ≤ c
also computed separately for each limb. We ensured that the training, validation, (i) ≤ 12 counts the number of segments that are in the dataset.
ConvBlock(8,32)
w/r
w
w
w
ConvBlock(8,32)
BN MaxPooling(2)
d f d d ConvBlock(8,32)
ConvBlock(s,f ) MaxPooling(r)
ConvBlock(8,32)
Input Output Input Output
MaxPooling(2)
ConvBlock(8,32)
ConvBlock(8,32)
d1
d2
dw
w
MaxPooling(3)
Flatten()
H Dense(10)
d 1
Flatten() Dense(d2) Dense(1)
Fig. 6 Convolutional neural network architecture. Our CNN is composed of four types of blocks. The convolutional block (ConvBlock) maps a multivariate
time series (w × d) into another multivariate time series (w × f) using f parameterized one-dimensional convolutions (d × s), i.e. sliding filters with learnable
parameters. Convolutions are followed by a nonlinear activation function and a normalization component. The maximum pooling block (MaxPooling)
extracts the maximum value from a sequence of r values, thus reducing the dimensionality from w to w/r. The flattening block (Flatten) changes the shape
of an array of dimensions w × d to a vector of dimensions dw. Dense block (dense) is a multiple linear regression from d1 dimensional space to d2
dimensional space with a nonlinear function at the output (see Methods). The diagram on the right shows the sequential combination of these blocks used
in our final model.
To train the neural network models we used two loss functions: mean squared RR is an example of penalized regression that combines L2 regularization with
error (for regression tasks) or cross-entropy (for classification tasks). The mean ordinary least squares. It seeks to find weights β that minimize the cost function:
squared error is the average squared difference between predicted and true labels.
X
m 2 X
p
The cross-entropy loss, L(y, p), is a distance between the true and predicted yi xiT β þ α β2j ; ð3Þ
distribution defined as i¼1 j¼1
Lðy; pÞ ¼ ðy logðpÞ þ ð1 yÞ logð1 pÞÞ; ð1Þ where xi are the input features, yi are the true labels, m is the number of
observations, and p is the number of input features.
where y is a true label and p is a predicted probability. One benefit of RR is that it allows us to trade-off between variance and bias;
Since some videos had more segments in the training set than others (due to lower values of α correspond to less regularization, hence greater variance and less
different amounts of missing data), we slightly modified the mean squared error bias. The reverse is true for higher values of α.
loss function, MSE0ðyi ; ^yi Þ, so that videos with more available segments were not The RF42 is a robust generalization of decision trees. A single decision tree
overly emphasized during training: consists of a series of branches where a new observation is put through a series of
binary decisions (e.g., median ankle position <0.5 or ≥0.5). The leaves of the tree at
MSE0ðyi ; ^yi Þ ¼ ðyi ^yi Þ2 =cðiÞ; ð2Þ the end of each sequence of branches contain filtered training observations that are
then used to make a prediction on the new observation (e.g., using the mean value
where yi is a true label, ^yi is a predicted label, and c(i) is the number of segments of the filtered training observations). The RF is comprised of a set of decision trees;
available for the i-th video. for each decision tree in the forest, the variables used to split at each branch (e.g.,
To get the final predicted gait metric for a given video, we averaged the median ankle position) are stochastically chosen, and the splitting thresholds (e.g.,
predicted values from the video segments. However, this averaging operation 0.5) are determined accordingly. To build a forest, the user must select
introduced some bias towards video segments that appeared more often in training hyperparameters, including the depth (i.e., number of sequential branches) of a
(e.g., those in the middle of the video). We reduced this bias by fitting a linear single tree d and total number of trees n. For inference on a new observation, RF
model on the training set, regressing true target values on predicted values. We models use the average prediction from all trees. Trees are scale invariant and are
then used this same linear model to remove the bias of the validation set often a method of choice by practitioners due to their robustness and ability to
predictions. capture complex nonlinear relationships between the input features and the label to
be predicted43.
We conducted a grid search to tune hyperparameters for the RR and RF models.
Ridge regression and random forest. We compared our deep learning model
Instead of doing k-fold cross validation, we used just one validation set to pick the
with classical supervised learning models, including RR and RF. We chose to use
parameters. This was to keep the results consistent with those of the CNN, which
RR for its simplicity and its accompanying tools for interpretability and inference,
and RF for its robustness in covering nonlinear effects. Both RF and RR require only used one validation set for computational reasons.
We found the best setting for the RF was n = 200, d = 10, and for the RR α = 0.
vectors of fixed length as input. The typical way to use these models in the context
The fact that α = 0 worked best for the RR suggests that variance was not the main
of time-series data is to first extract high level characteristics of the time series, then
bottleneck in the RR performance.
use them as features. In our work, we chose to compute the 10th, 25th, 50th, 75th,
and 90th percentiles, and the standard deviation of each of 12 univariate time series
used in CNNs. Note that for these methods, we used the entire 500-frame multi- Evaluation. We split the dataset into training, validation, and test sets, such that
variate time series from each video rather than 124-frame segments as in the test and validation sets contained 10% of all patients (i.e., 1091 patients in the
the CNNs. training set and 136 patients in each of the test and validation sets). We ensured
that each patient’s videos were only included in one of the sets. For CNNs, after 2. Al-Zahrani, K. S. & Bakheit, A. M. O. A study of the gait characteristics of
performing window slicing, we ended up with 16,414, 1943, and 1983 segments in patients with chronic osteoarthritis of the knee. Disabil. Rehabil. 24, 275–280
the training, validation, and test sets, respectively. (2002).
For the regression tasks, we evaluated the goodness of fit for each model using 3. von Schroeder, H. P., Coutts, R. D., Lyden, P. D., Billings, E. Jr & Nickel, V. L.
the correlation between true and predicted values in the test set. For the binary Gait parameters following stroke: a practical assessment. J. Rehabil. Res. Dev.
classification task (surgery prediction), we used the Receiver Operating 32, 25–31 (1995).
Characteristic (ROC) curve to visualize the results and evaluated model 4. Gage, J. R., Schwartz, M. H., Koop, S. E. & Novacheck, T. F. The identification
performance using the AUC. The ROC curve characterizes how a classifier’s true and treatment of gait problems in cerebral palsy. (John Wiley & Sons, 2009).
positive rate varies with the false positive rate, and the AUC is the integral of the 5. Martin, C. L. et al. Gait and balance impairment in early multiple sclerosis in
ROC curve. For the multiclass classification task (GMFCS), we evaluated model the absence of clinical disability. Mult. Scler. 12, 620–628 (2006).
performance using the quadratic-weighted Cohen’s κ defined as 6. D’Angelo, M. G. et al. Gait pattern in Duchenne muscular dystrophy. Gait
Pk Pk Posture 29, 36–41 (2009).
i¼1 j¼1 wij xij
κ ¼ 1 Pk Pk ; ð4Þ 7. Barton, G., Lisboa, P., Lees, A. & Attfield, S. Gait quality assessment using self-
i¼1 j¼1 wij mij organising artificial neural networks. Gait Posture 25, 374–379 (2007).
where wij, xij, and mij were weights, observed, and expected (under the null 8. Hannink, J. et al. Sensor-based gait parameter extraction with deep
hypothesis of independence) elements of confusion matrices, and k was the convolutional neural networks. IEEE J. Biomed. Health Inf. 21, 85–93 (2017).
number of classes. Quadratic-weighted Cohen’s κ measures disagreement between 9. Wahid, F., Begg, R. K., Hass, C. J., Halgamuge, S. & Ackland, D. C.
the true label and predicted label, penalizing quadratically large errors. For ordinal Classification of Parkinson’s disease gait using spatial-temporal gait features.
data, quadratic-weighted Cohen’s κ can be interpreted as a discrete version of the IEEE J. Biomed. Health Inf. 19, 1794–1802 (2015).
normalized mean squared error. 10. Xu et al. Accuracy of the microsoft kinectTM for measuring gait parameters
To better understand properties of our predictions we used analysis of variance during treadmill walking. Gait Posture 42, 145–151 (2015).
methodology44. We observed that total variability of parameters across subjects and 11. Luo, Z. et al. Computer vision-based descriptive analytics of seniors’ daily
trials can be decomposed to three components: patient variability, visit variability, activities for long-term health monitoring. Mach. Learning Healthc. 2, 1–18
and remaining trial variability. If we define SS as a sum of squares of differences (2018).
between true values and predictions, one can show that it follows 12. LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature 521, 436–444
(2015).
SS ¼ SSP þ SSV þ SST ; ð5Þ 13. Lin, T.-Y. et al. Microsoft COCO: common objects in context. Comput. Vis.
where SSP is patient-to-patient sum of squares and SSV is visit-to-visit 2014, 740–755 (2014).
variability for each patient and, SST is trial-to-trial variability for each visit. To 14. Cao, Z., Simon, T., Wei, S.-E. & Sheikh, Y. Realtime Multi-person 2D Pose
assess performance of the model we compare the SS of our model with the SS of Estimation Using Part Affinity Fields. 2017 IEEE Conference on Computer
the null model (population mean as a predictor). We refer to the ratio of the Vision and Pattern Recognition (CVPR), https://doi.org/10.1109/cvpr.2017.143
two as the unexplained variance (or one minus the ratio as the variance (2017).
explained). 15. Pishchulin, L. et al. DeepCut: Joint Subset Partition and Labeling for Multi
In our work, we were unable to assess SST since videos and ground-truth Person Pose Estimation. 2016 IEEE Conference on Computer Vision and
measurements were collected in different trials. However, for most of the gait Pattern Recognition (CVPR), https://doi.org/10.1109/cvpr.2016.533 (2016).
parameters of interest SST is negligible. In fact, if it was large, it would make lab 16. Seethapathi, N., Wang, S., Saluja, R., Blohm, G. & Kording, K. P. Movement
measurements unreliable and such parameters wouldn’t be practically useful. science needs different pose tracking algorithms. Preprint at https://arxiv.org/
Our metrics based on analysis of variance ignore bias in predictions, so it was abs/1907.10226 (2019).
important to explicitly check if predictions were unbiased. To that end, for each 17. Sato, K., Nagashima, Y., Mano, T., Iwata, A. & Toda, T. Quantifying normal
model we tested if the mean of residuals is significantly different than 0. Each and parkinsonian gait features from home movies: Practical application of a
p value was higher than 0.05, indicating there was no statistical evidence of bias at deep learning–based 2D pose estimator. PLOS ONE 14, e0223549 (2019).
the significance level 0.05. Given a relatively large number of subjects in our study, 18. Kidziński, Ł., Delp, S. & Schwartz, M. Automatic real-time gait event detection
this also corresponds to tight confidence intervals for the mean of residuals. This in children using deep neural networks. PLoS One 14, e0211466 (2019).
reassures us that the bias term can be neglected in the analysis. 19. Galli, M., Cimolin, V., De Pandis, M. F., Schwartz, M. H. & Albertini, G. Use
of the Gait Deviation Index for the evaluation of patients with Parkinson’s
Reporting summary. Further information on research design is available in the Nature disease. J. Mot. Behav. 44, 161–167 (2012).
Research Reporting Summary linked to this article. 20. Bohnen, N. I. et al. Gait speed in Parkinson disease correlates with cholinergic
degeneration. Neurology 81, 1611–1616 (2013).
21. O’keeffe, S. T. et al. Gait disturbance in Alzheimer’s disease: a clinical study.
Data availability Age Ageing 25, 313–316 (1996).
Video data used in this study were not publicly available due to restrictions on sharing 22. Muir, S. W. et al. Gait assessment in mild cognitive impairment and
patient health information. These data were processed by Gillette Specialty Healthcare to Alzheimer’s disease: the effect of dual-task challenges across the cognitive
a de-identified form using OpenPose software as described in the manuscript. The spectrum. Gait Posture 35, 96–100 (2012).
processed de-identified dataset together with clinical variables used in the paper 23. Mündermann, A., Dyrby, C. O., Hurwitz, D. E., Sharma, L. & Andriacchi, T. P.
associated with the processed datapoints, were shared by Gillette Specialty Healthcare Potential strategies to reduce medial compartment loading in patients with
and are now publicly available at https://simtk.org/projects/video-gaitlab, https://doi.org/ knee osteoarthritis of varying severity: reduced walking speed. Arthritis
10.18735/j0rz-0k12. Rheum. 50, 1172–1178 (2004).
24. Nadeau, S., Gravel, D., Arsenault, A. B. & Bourbonnais, D. Plantarflexor
Code availability weakness as a limiting factor of gait speed in stroke subjects and the
We ran OpenPose on a desktop equipped with an NVIDIA Titan X GPU. All other compensating role of hip flexors. Clin. Biomech. 14, 125–135 (1999).
computing was done on a Google Cloud instance with 8 cores and 16 GB of RAM and 25. Verghese, J. et al. Abnormality of gait as a predictor of non-Alzheimer’s
did not require GPU acceleration. We used scikit-learn (for training the RR and RF dementia. N. Engl. J. Med. 347, 1761–1768 (2002).
models; scikit-learn.org) and keras (for training the CNN; keras.io). SciPy (scipy.org) was 26. White, L. J. et al. Resistance training improves strength and functional
also used for smoothing and imputing the time series. Scripts for training machine capacity in persons with multiple sclerosis. Mult. Scler. 10, 668–674 (2004).
learning models, the analysis of the results and code used for generating all figures are 27. Chia, K. & Sangeux, M. Quantifying sources of variability in gait analysis. Gait
available in our GitHub repository http://github.com/stanfordnmbl/mobile-gaitlab/. Posture 56, 68–75 (2017).
28. Prosser, L. A., Lauer, R. T., VanSant, A. F., Barbe, M. F. & Lee, S. C. K.
Variability and symmetry of gait in early walkers with and without bilateral
Received: 27 January 2020; Accepted: 9 July 2020; cerebral palsy. Gait Posture 31, 522–526 (2010).
29. Schwartz, M. H. & Rozumalski, A. The Gait Deviation Index: a new
comprehensive index of gait pathology. Gait Posture 28, 351–357 (2008).
30. Palisano, R. et al. Development and reliability of a system to classify gross
motor function in children with cerebral palsy. Dev. Med. Child Neurol. 39,
214–223 (1997).
References 31. Rasmussen, H. M., Nielsen, D. B., Pedersen, N. W., Overgaard, S. &
1. Hanakawa, T., Fukuyama, H., Katsumi, Y., Honda, M. & Shibasaki, H. Holsgaard-Larsen, A. Gait Deviation Index, Gait Profile Score and Gait
Enhanced lateral premotor activity during paradoxical gait in Parkinson’s Variable Score in children with spastic cerebral palsy: Intra-rater reliability
disease. Ann. Neurol. 45, 329–336 (1999). and agreement across two repeated sessions. Gait Posture 42, 133–137 (2015).
32. Rackauskaite, G., Thorsen, P., Uldall, P. V. & Ostergaard, J. R. Reliability of Author contributions
GMFCS family report questionnaire. Disabil. Rehabil. 34, 721–724 (2012). Conceptualization: L.K., S.L.D., M.H.S. Methodology: L.K., B.Y., J.L.H., A.R., S.L.D.,
33. McDowell, B. C., Kerr, C. & Parkes, J. Interobserver agreement of the Gross M.H.S. Data curation: L.K., B.Y., A.R., M.H.S. Analysis: L.K., B.Y., J.L.H. Writing: L.K.,
Motor Function Classification System in an ambulant population of children B.Y., J.L.H., A.R., S.L.D., M.H.S. Funding acquisition: S.L.D., M.H.S.
with cerebral palsy. Dev. Med. Child Neurol. 49, 528–533 (2007).
34. Böhm, H. & Döderlein, L. Gait asymmetries in children with cerebral palsy: do
they deteriorate with running? Gait Posture 35, 322–327 (2012).
Competing interests
The authors declare no competing interests.
35. Tedroff, K., Hägglund, G. & Miller, F. Long-term effects of selective dorsal
rhizotomy in children with cerebral palsy: a systematic review. Dev. Med.
Child Neurol. 62, 554–562 (2020). Additional information
36. Merriaux, P., Dupuis, Y., Boutteau, R., Vasseur, P. & Savatier, X. A study of Supplementary information is available for this paper at https://doi.org/10.1038/s41467-
vicon system positioning performance. Sensors 17, 1591 https://doi.org/ 020-17807-z.
10.3390/s17071591 (2017).
37. Pinzone, O., Schwartz, M. H., Thomason, P. & Baker, R. The comparison of Correspondence and requests for materials should be addressed to Ł.Kńs. or M.H.S.
normative reference data from different gait analysis services. Gait Posture 40,
286–290 (2014). Peer review information Nature Communications thanks Elyse Passmore, Reinald
38. Kadaba, M. P., Ramakrishnan, H. K. & Wootten, M. E. Measurement of lower Brunner and the other, anonymous, reviewer(s) for their contribution to the peer review
extremity kinematics during level walking. J. Orthop. Res. 8, 383–392 (1990). of this work. Peer reviewer reports are available.
39. Davis, R. B., Õunpuu, S., Tyburski, D. & Gage, J. R. A gait analysis data
collection and reduction technique. Hum. Mov. Sci. 10, 575–587 (1991). Reprints and permission information is available at http://www.nature.com/reprints
40. Schwartz, M. H., Trost, J. P. & Wervey, R. A. Measurement and management
of errors in quantitative gait data. Gait Posture 20, 196–203 (2004). Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in
41. Sullivan, G. J., Topiwala, P. N. & Luthra, A. The H.264/AVC advanced video published maps and institutional affiliations.
coding standard: overview and introduction to the fidelity range extensions.
Appl. Digit. Image Process. https://doi.org/10.1117/12.564457. (2004).
42. Breiman, L. Random forests. Mach. Learn. 45, 5–32 (2001).
Open Access This article is licensed under a Creative Commons
43. Hastie, T., Tibshirani, R. & Friedman, J. The elements of statistical learning:
Attribution 4.0 International License, which permits use, sharing,
data mining, inference, and prediction. (Springer Science & Business Media,
adaptation, distribution and reproduction in any medium or format, as long as you give
2013).
appropriate credit to the original author(s) and the source, provide a link to the Creative
44. Box, G. E. P. Some theorems on quadratic forms applied in the study of
analysis of variance problems, II. effects of inequality of variance and of Commons license, and indicate if changes were made. The images or other third party
correlation between errors in the two-way classification. Ann. Math. Stat. 25, material in this article are included in the article’s Creative Commons license, unless
484–498 (1954). indicated otherwise in a credit line to the material. If material is not included in the
article’s Creative Commons license and your intended use is not permitted by statutory
regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder. To view a copy of this license, visit http://creativecommons.org/
Acknowledgements licenses/by/4.0/.
Our research was supported by the Mobilize Center, a National Institutes of Health
Big Data to Knowledge (BD2K) Center of Excellence through Grant U54EB020405,
This is a U.S. government work and not under copyright protection in the U.S.; foreign
and RESTORE Center, a National Institutes of Health Center through Grant
copyright protection may apply 2020
P2CHD10191301.