0% found this document useful (0 votes)
8 views11 pages

Cobra

Uploaded by

rayeliu233
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views11 pages

Cobra

Uploaded by

rayeliu233
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

npj | digital medicine Article

Published in partnership with Seoul National University Bundang Hospital

https://doi.org/10.1038/s41746-024-01173-x

Quantifying impairment and disease


severity using AI models trained on
healthy subjects
Check for updates
1 1 1 2,3 2 2
Boyang Yu , Aakash Kaku , Kangning Liu , Avinash Parnandi , Emily Fokas , Anita Venkatesan ,
Natasha Pandit3, Rajesh Ranganath 1,4, Heidi Schambra2,3 & Carlos Fernandez-Granda 1,4

Automatic assessment of impairment and disease severity is a key challenge in data-driven medicine.
We propose a framework to address this challenge, which leverages AI models trained exclusively on
1234567890():,;
1234567890():,;

healthy individuals. The COnfidence-Based chaRacterization of Anomalies (COBRA) score exploits


the decrease in confidence of these models when presented with impaired or diseased patients to
quantify their deviation from the healthy population. We applied the COBRA score to address a key
limitation of current clinical evaluation of upper-body impairment in stroke patients. The gold-standard
Fugl-Meyer Assessment (FMA) requires in-person administration by a trained assessor for 30-45
minutes, which restricts monitoring frequency and precludes physicians from adapting rehabilitation
protocols to the progress of each patient. The COBRA score, computed automatically in under one
minute, is shown to be strongly correlated with the FMA on an independent test cohort for two different
data modalities: wearable sensors (ρ = 0.814, 95% CI [0.700,0.888]) and video (ρ = 0.736, 95% C.I
[0.584, 0.838]). To demonstrate the generalizability of the approach to other conditions, the COBRA
score was also applied to quantify severity of knee osteoarthritis from magnetic-resonance imaging
scans, again achieving significant correlation with an independent clinical assessment (ρ = 0.644,
95% C.I [0.585,0.696]).

In current clinical practice, assessment of impairment and disease severity the existence of an objective quantitative metric that can be computed for
typically relies on examinations by medical professionals1,2. As a result, every patient in the dataset, but such metrics do not exist for many medical
assessment is often qualitative and its frequency is constrained by clinician conditions27,28.
availability. Developing data-driven quantitative metrics of impairment and To address these challenges, we consider the problem of performing
disease severity has the potential to enable continuous and objective mon- automatic patient assessment using AI models trained only on data from
itoring of patient recovery or decline. Such monitoring would facilitate healthy subjects. This is an anomaly detection problem, where the goal is to
personalized treatment and administration of appropriate therapeutic identify data points that are systematically different from a reference
interventions in telehealth and other remotely supervised contexts where population29. Existing anomaly-detection methods for medical data are
ongoing access to clinicians is not readily available3–5. mostly based on generative models30,31. These models are designed to
Artificial-intelligence (AI) models based on machine learning are a reconstruct high-dimensional data from a learned low-dimensional repre-
natural tool to perform data-driven patient assessment6–19. These models sentation. Once trained, they are typically unable to accurately reconstruct
can be trained in a supervised fashion to estimate labels associated with data that are anomalous, due to their inconsistency with the training set.
patient data from large curated datasets of examples11,14,20. Unfortunately, it Consequently, the model reconstruction error tends to be higher for
is often very challenging to assemble datasets containing an exhaustive anomalies than for normal data, and can therefore be used as an anomaly-
representation of severity or impairment levels, which is necessary to ensure detection score. This approach has been applied to identify chronic brain
the accuracy of the AI models21–26. Moreover, supervised approaches require infarcts32, Alzheimer’s disease33, microstructural abnormalities in diffusion

1
Center for Data Science, New York University, 60 Fifth Ave, New York, NY 10011, USA. 2Department of Neurology, NYU Grossman School of Medicine, 550
1st Ave, New York, NY 10016, USA. 3Department of Rehabilitation Medicine, NYU Grossman School of Medicine, 550 1st Ave, New York, NY 10016, USA.
4
Courant Institute of Mathematical Sciences, New York University, 251 Mercer St, New York, NY 10012, USA. e-mail: [email protected];
[email protected]

npj Digital Medicine | (2024)7:180 1


https://doi.org/10.1038/s41746-024-01173-x Article

MRI tractometry34, and abnormalities of cosmetic breast reconstruction in functional primitives, from video and wearable-sensor data. The models
cancer patients35. were trained on a cohort of healthy individuals. Crucially, although the
Anomaly detection based on generative models has an important healthy cohort is relatively small (25 individuals), the number of labeled
disadvantage: it does not constrain the AI model to learn clinically relevant primitives per patient is large (typically around 300,000), which provides a
features. Consequently, the model reconstruction error may depend on rich training dataset with more than 6 million examples. Once trained on the
properties of the data unrelated to the medical condition of interest. Here, healthy subjects, the models were applied to data from a test cohort of stroke
we propose a novel anomaly-detection framework that is tailored to a spe- patients and held-out healthy subjects performing nine different stroke
cific medical condition. This is achieved by utilizing an AI model that pre- rehabilitation activities. The confidence of the motion predictions for each
dicts an attribute of the data, which is directly relevant to the condition (e.g. test subject was averaged to compute the corresponding COBRA score. Our
type of motion primitive performed by the stroke-impaired side, or tissue results show that the COBRA score is correlated with the Fugl-Meyer
type for knee osteoarthritis). Crucially, the model is trained exclusively on Assessment of the patients, obtained in person by trained experts, for both
healthy subjects, using annotated data describing the attribute. When the data modalities. The score is computed in under a minute and does not
models are presented with data where the attribute is affected by the medical require expert input. This greatly expands on our preliminary findings,
condition of interest, we observe that the average model confidence tends to which used a similar approach with wearable-sensor data from a single
decrease proportionally to severity. This yields a quantitative patient- rehabilitation activity48.
assessment metric, which we call the COnfidence-Based chaRacterization of To demonstrate the general applicability of the COBRA framework, we
Anomalies (COBRA) score. Figure 1 provides a schematic description of the show that it can be used to evaluate severity of knee osteoarthritis from
proposed framework. magnetic resonance imaging (MRI) scans. Knee osteoarthritis is a muscu-
The COBRA score is inspired by a technique proposed in36, which loskeletal disorder characterized by a progressive loss of knee cartilage. To
identifies anomalous data points using the confidence of AI models. In this quantify severity, we trained an AI model to perform segmentation of dif-
and subsequent works37–41 anomalies were identified based on the loss of ferent knee tissues (including cartilage) on MRIs of healthy knees. We then
confidence of the AI models for a single data point. The effectiveness of this applied the model to knee MRIs from a test cohort of diseased patients and
approach depends on the overlap in the distribution of confidences42. In our held-out healthy subjects. The confidence of the tissue predictions for each
applications of interest, AI models trained on healthy subjects tend to lose test subject was averaged to compute the corresponding COBRA score. The
confidence on average when presented with multiple inputs from an resulting COBRA score is again highly correlated with an independent
impaired or diseased patient. However, the confidence for individual data assessment of disease severity (in this case, the Kellgren-Lawrence grade).
points is very noisy and results in an unreliable metric, as illustrated by Fig. 2.
For this reason, the COBRA score is computed using multiple data points Results
for each subject, corresponding to different motions in the application to Quantification of impairment in stroke patients
stroke and to different voxels in the application to knee osteoarthritis. The application of the COBRA score to the impairment quantification in
Aggregating the confidence associated with multiple data via averaging stroke patients was carried out using the publicly available StrokeRehab
dramatically reduces the noise, resulting in a stable and accurate subject- dataset47. StrokeRehab contains video and wearable-sensor data from a
level metric. cohort of 29 healthy individuals and 51 stroke patients performing multiple
Our proposed framework can be interpreted as a form of normative trials of nine rehabilitation activities (described in Supplementary Tables 1,
modeling, where the goal is to quantify individual deviations from a refer- 2). The impairment level of each patient was quantified via the Fugl-Meyer
ence population43,44. Existing normative models in neuroscience and psy- assessment (FMA)2. The FMA score is a number between 0 (maximum
chiatry are based on probabilistic regression, which explicitly captures the impairment) and 66 (healthy) equal to the sum of itemized scores (each
normal variation of brain-derived phenotypes45. In contrast, the AI models from 0 to 2) for 33 upper body mobility assessments carried out in-clinic by a
used to compute the COBRA score perform normative modeling implicitly, trained expert. The wearable-sensor and video data are labeled to indicate
by learning features associated with the attribute of interest within the what functional primitive is being performed by the paretic arm over time:
reference population. reach (UE motion to make contact with a target object), reposition (UE
We apply the COBRA score to automatically evaluate the impairment motion to move into proximity of a target object), transport (UE motion to
level of stroke patients. Stroke commonly causes motor impairment in the convey a target object in space), stabilization (minimal UE motion to hold a
upper extremity (UE), characterized by loss of strength, precise control, and target object still), and idle (minimal UE motion to stand at the ready near a
intrusive muscle co-activation, which collectively interfere with normal target object).
function. Rehabilitation seeks to reduce motor impairment through the The COBRA score was computed based on AI models trained to
repeated practice of functional movements with the UE. In this process, it is predict the functional primitives performed by a training cohort, which
crucial to monitor the impairment level of the patient. The gold-standard includes 25 of the 29 healthy individuals (selected at random). The model
method of measuring motor impairment is the Fugl-Meyer Assessment input was either wearable sensor or video data. Detailed descriptions of these
(FMA)2. Unfortunately, it requires in-person administration by a trained models are provided in the Methods section. The models were applied to a
assessor and is time-consuming (30-45 minutes), which makes it imprac- test cohort consisting of the remaining 4 healthy individuals and the
tical for frequent monitoring. Automatic assessment of motion impairment 51 stroke patients. Demographic and clinical information about the training
based on video or wearable-sensor data would address these limitations, and test cohorts is provided in Table 1. The COBRA score is equal to the
facilitating actionable and granular tracking of motor recovery. average of the model confidence for data points identified by the models as
Motor impairment evaluation in stroke patients illustrates the difficulty corresponding to functional primitives that involve motion (transport,
of applying standard supervised AI methodology to patient assessment. An reposition, and reach).
existing study shows the feasibility of the approach46, but only includes 17 The COBRA score was evaluated by computing its Pearson correlation
patients. Training a supervised model to predict impairment and rigorously coefficient with the Fugl-Meyer Assessment (FMA) score2 on the test cohort
evaluating its performance on held-out data requires a database of at least of 51 stroke patients and 4 healthy individuals (n = 55). The correlation
hundreds, and ideally thousands of patients, labeled with the corresponding coefficient is 0.814 (95% CI [0.700,0.888]) for the wearable-sensor data and
impairment level. However, the largest such publicly available dataset 0.736 (95% CI [0.584, 0.838]) for the video data. Figure 3 (a) shows scat-
consists of just 51 patients47. Here, we use this dataset as a held-out test set to terplots of the COBRA and FMA scores. For both data modalities, the
evaluate the proposed framework. COBRA score has a strong, statistically significant correlation with the in-
In order to assess impairment in stroke patients using the COBRA clinic assessment. The Supplementary Methods reports additional results on
score, we trained AI models to predict classes of UE motion, known as the wearable-sensor data using a completely different AI architecture for

npj Digital Medicine | (2024)7:180 2


https://doi.org/10.1038/s41746-024-01173-x Article

Quantification of motion impairment in stroke patients

Step 1): Training on healthy subjects Step 2): Anomaly characterization

Healthy Subject

Severe Mild Healthy

Model Model

Impairment
Healthy
Clinically meaningful task:
Mild
Prediction of functional movements
Severe

Confidence level
(COBRA score)

Quantification of knee-osteoarthritis severity

Step 1): Training on healthy subjects Step 2): Anomaly characterization

Healthy Knee

Severe Mild Healthy

Model Model

Disease severity

Healthy
Clinically meaningful task:
Mild
Knee tissue segmentation
Severe

Confidence level
(COBRA score)
Fig. 1 | The COnfidence-Based chaRacterization of Anomalies (COBRA) score. knee tissues from magnetic resonance imaging scans (bottom). In Step 2, the
In Step 1, an AI model is trained to perform a clinically meaningful task on data from COBRA score is computed based on the confidence of the AI model when per-
healthy individuals. For impairment quantification in stroke patients, the task is forming the task on patient data. Data from patients with higher degrees of
prediction of functional primitive motions from videos or wearable-sensor data impairment or severity differ more from the healthy population used for training,
(top). For severity quantification of knee osteoarthritis, the task is segmentation of which results in decreased model confidence and hence a lower COBRA score.

primitive prediction. The correlation coefficient between the resulting individual rehabilitation activities (see Supplementary Tables 1, 2 for a
COBRA score and the FMA score is again high: 0.774 (95% CI [0.636, detailed description of the activities). Scatterplots of the FMA and COBRA
0.865]). This indicates that the proposed approach is robust to the choice of scores for each activity are provided in Supplementary Figs. 2, 3. For both
underlying AI model. data modalities, the correlation is higher for more structured activities
Figure 4 reports the correlation coefficients between the FMA score (moving objects to targets on a table-top or shelf, donning glasses) and is
and the COBRA score computed using subsets of the data corresponding to lower for more complex activities (hair-combing, face-washing, teeth-

npj Digital Medicine | (2024)7:180 3


https://doi.org/10.1038/s41746-024-01173-x Article

Fig. 2 | Averaging model confidence yields a discriminative subject-level metric. systematically higher for healthy subjects, and therefore provides a discriminative
The plots show histograms and kernel density estimates of the confidence of a model subject-level metric. The first and second plot correspond to wearable-sensor and
trained on healthy subjects when presented with test data from an impaired or video data associated with the same healthy and impaired individuals from the test
diseased patient (red), and from a held-out healthy individual (blue). The confidence cohort for quantification of stroke-induced impairment. The third plot corresponds
distributions overlap, so individual values do not allow to discriminate between to MRI scans from a healthy and diseased individual in the knee-osteoarthritis test
healthy and impaired individuals. In contrast, the average confidence is cohort.

brushing, feeding), which tend to involve more heterogeneous motions by this confounding factor. As depicted in Fig. 5, we can correct for the
across individuals. The correlation coefficient with the FMA score is lower confounding factor by stratifying the subjects according to the object color.
for the COBRA score computed from individual activities than for the This increases the COBRA score from 0.615 (95% CI [0.411, 0.760]) to 0.679
COBRA score that aggregates all activities. The only exception is the table- (95% CI [0.294, 0.874]) for dark objects and 0.756 (95% CI [0.553, 0.874])
top task, which is the most regular and structured activity. The correlation for light objects. For comparison, the correlation of the video-based COBRA
between the corresponding COBRA score, computed from wearable-sensor score computed from all activities is 0.736 (95% CI [0.584,0.838]). Supple-
data, and the FMA score is very high (0.849, 95% CI [0.752, 0.910]), which mentary Fig. 4 shows that image quality can also act as a confounding factor:
suggests that it may be possible to obtain accurate impairment assessment blurring the video images results in a systematic decrease of the COBRA
from a reduced number of data using activities that are highly structured. score, which can also be corrected via stratification.
An important consideration when applying the proposed framework is The COBRA score is the average of the AI-model confidence for data
that extraneous factors may produce a spurious decrease in the confidence points identified by the model as corresponding to functional actions that
of the AI model. Figure 5 shows that this occurs for the table-top activity, involve motion (reach, reposition, transport), as opposed to functional
which was carried out with light-colored and dark-colored objects by dif- actions that do not (idle, stabilize). These data can be considered as clinically
ferent subjects. Dark objects are much more difficult to detect in videos, relevant to impairment quantification associated with motion. Figure 6
which produces a systematic loss of confidence in the video-based AI model shows that the correlation coefficient between the FMA score and a COBRA
that translates to lower COBRA scores. This explains why the correlation score computed from data points identified as non-motion functional
between the FMA score and the COBRA score is lower for the table-top actions is low (in fact, for the video data it is not even statistically significant).
video data than for the table-top wearable-sensor data, which is unaffected It also shows that a COBRA score computed from all actions has a lower
correlation with the FMA score than the proposed motion-based COBRA
score for both data modalities.
Table 1 | Demographic and clinical characteristics of the
training and testing cohorts for the application to Quantification of knee-osteoarthritis severity
quantification of motion impairment in stroke patients
The application of the COBRA score to the quantification of knee-
Training Testing osteoarthritis (OA) severity was carried out using the publicly available
Number of subjects 25 55
OAI-ZIB dataset49. This dataset provides 3D MRI scans of 101 healthy right
knees and 378 right knees affected by knee osteoarthritis, a long-term
Trials 1265 2183
degenerative joint condition. Each knee is labeled with the corresponding
Age 62.4 ± 13.1 57.7 ± 14.0 Kellgren-Lawrence (KL) grade50, retrieved from the NIH Osteoarthritis
Sex 13 male, 12 female 25 male, 30 female Initiative collection51. The KL grade quantifies OA severity on a scale from 0
Racea 10 W, 12 B, 0 A, 1 AI, 24 W, 14 B, 9 A, 0 AI, 8 O (healthy) to 4 (severe), as illustrated in Supplementary Fig. 1. Each voxel in
2O the MRI scans is labeled to indicate the corresponding tissue (tibia bone,
Paretic Side n/a 28 left, 23 right, 4 n/a tibia cartilage, femur bone, femur cartilage or background).
The COBRA score was computed based on an AI model trained to
Fugl-Meyer 66 43.5 ± 16.2
Assessment perform tissue segmentation on a training cohort of 44 healthy individuals
(selected at random). A detailed description of the model is provided in the
Impairment levelb 25 healthy 4 healthy, 20 mild, 23
moderate, 8 severe Methods section. The model was applied to a test cohort consisting of the
remaining 57 healthy individuals and the 378 patients with knee OA.
Time since stroke n/a 5.4 ± 6.1 years (for stroke
patients) Demographic and clinical information about the training and test cohorts is
The mean ± standard deviation is reported for age, Fugl-Meyer assessment and time since stroke.
provided in Table 2. The COBRA score is equal to the average of the model
a
Race: White (W), Black (B), Asian (A), American Indian (AI), Other (O). confidence for data points identified by the model as corresponding to
b
Based on FMA: 0–25 is severe, 26–52 moderate, 53–65 mild, and 66 healthy. cartilage tissue (tibia cartilage and femur cartilage).

npj Digital Medicine | (2024)7:180 4


https://doi.org/10.1038/s41746-024-01173-x Article

Fig. 3 | Correlation between the COBRA score and


clinical assessment. a The graphs show scatterplots
of the Fugl-Meyer assessment (FMA) score, based
on in-person examination by an expert, and the
proposed data-driven COBRA score computed from
wearable-sensor data (left) and from video data
(right). The correlation coefficient ρ between the two
scores is high, particularly for the wearable-sensor
data. b The graph shows scatterplots and density
plots of COBRA scores computed from magnetic-
resonance imaging (MRI) knee scans of patients
with different Kellgren-Lawrence (KL) grades. The
KL grade and the COBRA score exhibit significant
inverse correlation.

Fig. 4 | Impairment quantification from individual rehabilitation activities. The (right) data. The number of trials available for each activity are indicated by the
graph shows the correlation coefficient (indicated by black markers with 95% yellow bars. Simple, more structured activities (Glasses, Shelf, Table-top) have
confidence intervals) between the Fugl-Meyer score of stroke patients, and the higher correlation than more complicated activities (Face-wash, Feeding, Combing)
COBRA score computed from single activities using wearable-sensor (left) or video for both data modalities.

The COBRA score was evaluated by computing its Pearson corre- that the COBRA score quantifies knee OA severity. Figure 3(b) shows
lation coefficient with the Kellgren-Lawrence (KL) grading scores50 on scatterplots and density plots of the COBRA scores corresponding to
the test cohort of 378 patients with knee OA and 57 healthy subjects different KL grades. The Supplementary Methods section reports addi-
(n=435), which equals –0.644 (95% CI [–0.696, –0.585]). There is tional results using a different AI architecture for tissue segmentation.
therefore a significant inverse correlation between the scores, indicating The magnitude of the correlation coefficient between the resulting

npj Digital Medicine | (2024)7:180 5


https://doi.org/10.1038/s41746-024-01173-x Article

Fig. 5 | Object color as a confounding factor for the video-based COBRA score dark objects are difficult to detect, which results in a systematic loss of confidence in
and correction via stratification. The table-top rehabilitation activity in the stroke the video-based AI model, and hence lower COBRA scores (independently from the
impairment quantification task involves dark and light-colored objects (top row). FMA score). The bottom middle and right scatterplots show that stratifying
The bottom left scatterplot shows the COBRA score computed only using video data according to object color corrects for the confounding factor, improving the cor-
from this activity and the corresponding Fugl-Meyer assessment (FMA) score. The relation coefficient ρ between the COBRA and FMA scores.

COBRA score and the KL grade is lower, but still statistically significant: Our study identifies a key consideration when applying the proposed
–0.429 (95% CI [–0.503,–0.349]). framework: confounding factors unrelated to the medical condition of
The COBRA score is computed as an average of the AI-model con- interest (e.g. object color or blurriness in a video) can influence the con-
fidence for voxels identified by the model as corresponding to cartilage, as fidence of the AI models, and therefore distort the COBRA score. This is an
opposed to bone tissue. These data can be considered as clinically relevant instance of a general challenge inherent to the use of deep neural networks:
because knee OA produces gradual degradation of articular cartilage (bone these models are so flexible that they can easily learn spurious structure in
alterations and osteophyte formation may also occur, but are less high-dimensional data42,54. Our results suggest that the influence of con-
frequent)52,53. Figure 6 shows that the magnitude of the correlation coeffi- founding factors can be mitigated by gathering a training set of healthy
cient between the KL score and the COBRA score is significantly lower than subjects that is sufficiently diverse with respect to the population of interest.
for cartilage. The magnitude of the correlation coefficient for the COBRA In the case of stroke-induced impairment, we show that this can be achieved
score computed from all voxels is only slightly lower than that of the pro- by utilizing multiple different rehabilitation activities. In addition, we
posed cartilage-based COBRA score, indicating that including bone is not demonstrate that it is possible to explicitly correct for known confounding
very detrimental. factors via stratification. These factors could be identified by monitoring
their correlation with the average confidence of the AI models over multiple
Discussion individuals (under the assumption that the factors are uncorrelated with
In this work we introduce the COBRA score, a data-driven anomaly- impairment or disease severity). Nevertheless, automatic identification and
detection framework for automatic quantification of impairment and dis- control of confounders is an important topic for future research.
ease severity. We show its utility for clinically relevant quantification in two
different medical conditions (stroke and knee osteoarthritis) and for three Methods
different data modalities (wearable sensors, video and MRI). The framework In this section we describe a general framework to estimate impairment and
is suitable for applications where it is challenging to gather large-scale disease severity using AI models trained only on data from healthy subjects.
databases of patients with different degrees of impairment or severity, We frame this as an anomaly detection and quantification problem, where
because it only requires data from a healthy cohort of moderate size. The the goal is to identify subjects that deviate from the healthy population, and
domains of potential applicability are broad, as they encompass any con- to quantify the extent of this deviation.
dition affecting patient motion, as in our application to stroke, or producing
structural abnormalities in imaged tissues, as in our application to knee Confidence-based characterization of anomalies
osteoarthritis. The proposed COnfidence-Based chaRacterization of Anomalies (COBRA)
From a methodological perspective, our results suggest that fine- framework utilizes a model trained to perform an AI task only on healthy
grained annotations describing clinically relevant attributes can be useful patients. Intuitively, if the model has low confidence when performing the
even if they are only available for healthy subjects. We hypothesize that AI task on a new subject, this indicates that the subject deviates from the healthy
models trained with such annotations can be leveraged in different ways population. In order to ensure that this deviation is due to a certain type of
beyond the proposed approach. To illustrate this, an alternative anomaly- impairment or disease, it is crucial to choose an appropriate AI task. For
detection procedure that does not utilize model confidence is included in the quantification of stroke-induced impairment, we predict the functional
Supplementary Methods section. actions carried out by the subject from wearable sensor or video data. For the

npj Digital Medicine | (2024)7:180 6


https://doi.org/10.1038/s41746-024-01173-x Article

Fig. 6 | The COBRA score exploits clinically-relevant structure. a The first row higher correlation with the clinical assessment in both cases. b The graphs show
show scatterplots of the clinical Fugl-Meyer assessment and the proposed COBRA scatterplots of the Kellgren-Lawrence grade and the proposed COBRA score,
score, obtained from wearable-sensor data. In the left graph, the COBRA score is obtained from knee MRI scans. In the left graph, the COBRA score is computed only
computed only using data identified as clinically relevant (i.e. corresponding to using data identified as clinically relevant (i.e. corresponding to cartilage tissue). In
motion actions). In the middle graph, the score is computed using the remaining the middle graph, the score is computed using the remaining data. In the right graph,
data. In the right graph, it is computed using all of the data. The second row shows it is computed using all of the data. The COBRA score using clinically relevant data
the same scatterplots, with the only difference that the COBRA score is obtained again achieves a higher correlation with the clinical assessment.
from video data. The COBRA score based on clinically-relevant data achieves a

npj Digital Medicine | (2024)7:180 7


https://doi.org/10.1038/s41746-024-01173-x Article

Table 2 | Demographic and clinical characteristics of study This yields a prediction of the class associated with the data point
participants for the application to quantification of knee-
osteoarthritis severity zjtest :¼ arg max pjtest ½k;
1≤k≤K
ð6Þ
Training Testing
j ½k denotes the kth entry of pj . The estimated probability that
where ptest test
Number of individuals 44 435
the data point belongs to the predicted class is commonly known as the
Age 59.2 ± 8.2 62.0 ± 9.4 confidence of the model (see e.g.55),
Sex 20 male, 24 female 228 male, 207 female
Race a
36 W, 7 B, 1 O 339 W, 81 B, 5 A, 10 O ctest
j :¼ max pjtest ½k; ð7Þ
1≤k≤K
Kellgren-Lawrence 44 healthy (KL = 0) 57 healthy (KL = 0),
grades 58 doubtful (KL = 1), because it can be interpreted as an estimate of the probability that the model
109 minimal (KL = 2), prediction is correct.
138 moderate (KL = 3),
Several existing works propose to use confidence values to perform
73 severe (KL = 4)
anomaly detection36–41. Intuitively, if a model is well trained (and there is no
The mean ± standard deviation is reported for age.
a
Race: White (W), Black (B), Asian (A), Other (O).
inherent uncertainty in the training labels56), it should be able to confidently
classify new examples. Therefore low model confidence is evidence that the
data point may be anomalous, in the sense that it deviates from the training
distribution. Our proposed framework builds upon this idea, incorporating
application to knee osteoarthritis, we predict the tissue present in each voxel two novel elements. First, multiple data points are aggregated to perform
of a 3D MRI scan. subject-level anomaly detection. As illustrated by Fig. 2, this is critical to
Let us assume that we have access to a training cohort of Ntrain healthy achieve accurate anomaly detection in our applications of interest, because
subjects, and that each of them is associated with a set of annotated data the individual confidences are very noisy. Second, we determine which of the
relevant to the medical condition of interest: classes are most clinically relevant, and restrict our attention to data points
n   o predicted to belong to those classes. As reported in Fig. 6, for the stroke
Ti :¼ x½i ½i ½i ½i
1 ; y 1 ; . . . ; xMi ; y Mi ; 1 ≤ i ≤ Ntrain : ð1Þ application this provides a substantial improvement over using all the data.
Let CR  f1; . . . ; K g denote the subset of clinically relevant classes,
and
Here x½i j 2 R denotes the jth data point associated with the ith subject, and
L

Mi is the number of data available for that subject. The label y½i n o
j 2
Jrelevant :¼ j : ztest 2 CR ð8Þ
f1; . . . ; K g assigns x½i
j to one of K predefined classes. For the stroke appli-
j

cation, the label encodes the functional action carried out by the subject at a
certain time. The corresponding data point is a segment of wearable-sensor the subset of test data predicted to belong to those classes. We define the
or video data. For the knee-osteoarthritis application, the label encodes the COBRA score as the arithmetic average of the confidences associated with
type of tissue at a certain position in the knee, and the corresponding data are the data in Jrelevant ,
the surrounding MRI voxels.
The training dataset 1 X
COBRA ðX test Þ :¼ j :
ctest ð9Þ
n o jJrelevant j j2J
Strain :¼ T1 ; . . . ; TNtrain ð2Þ
relevant

The lower the COBRA score, the less confident the AI model is on
average when performing the task on the test subject, which indicates a
is used to train an AI model f : RL ! ½0; 1K to predict the labels from the greater degree of impairment or disease.
data. The input to the model is an L-dimensional data point and the output
is a K-dimensional vector Estimation of stroke-related motor impairment
In order to apply the COBRA framework to automatic impairment quan-
p½i ½i
j :¼ f ðx j Þ; 1 ≤ i ≤ N train ; 1 ≤ i ≤ M i ; ð3Þ tification in stroke patients, we propose to utilize auxiliary AI models trained
to predict the functional primitive carried out by the subjects’ paretic upper
where the kth entry is an estimate of the probability that the data point extremity (UE) while performing rehabilitation activities. The K ≔ 5 pri-
belongs to the kth class. In our applications of interest, the models are deep mitive classes are reach, reposition, transport, stabilize, and idle. UE motor
neural networks, described in detail below. Crucially, if the dataset asso- impairment affects the three functional primitives involving motion
ciated with each subject is large, then the total number of training examples
 
CR :¼ transport; reposition; reach ; ð10Þ
X
N train

M train :¼ Mi ð4Þ
i¼1
rendering them systematically different to those of healthy individuals. Our
hypothesis is that this causes AI models trained on healthy subjects to lose
confidence when they are applied to stroke patients, and that the loss of
is orders of magnitude larger than the number of training subjects Ntrain. confidence is indicative of the degree of impairment. In the following
This enables us to train deep-learning models using relatively small training paragraphs, we describe the AI models that we use to test this hypothesis for
cohorts. two different data modalities, wearable sensors and video.
Let X test :¼ fxtest
1 ; . . . ; xM test g denote a dataset associated with a test
test
The wearable-sensor data is a 77-dimensional time series, recorded at
subject. We can obtain probabilities corresponding to the jth test data point 100 Hz using nine inertial measurement units (IMUs) attached to the upper
by applying the trained AI model, body47. The data correspond to kinematic features of 3D linear accelerations,
3D quaternions, joint angles from the upper body, and a binary value that
pjtest :¼ f ðxtest
j Þ; 1 ≤ j ≤ M test : ð5Þ indicates the side (left or right) performing the motion. In order to identify

npj Digital Medicine | (2024)7:180 8


https://doi.org/10.1038/s41746-024-01173-x Article

functional primitives from these data, we utilized a Multi-Stage Temporal function, images augmented via RED were downweighted by a factor of
Convolutional Network (MS-TCN)57. This model was found to be effective 1/3. The Adam optimizer was used for minimization, with an initial
for primitive segmentation in a prior study47. In the Supplementary learning rate of 5 ⋅ 10−5 that was reduced by 10% after two consecutive
Methods section we report results with a different model architecture, based epochs without improvement in the validation Dice score. A criterion
on a sequence-to-sequence model47,58. based on the validation Dice score (excluding background) was used to
MS-TCN is a state-of-the-art deep-learning model for action seg- perform early stopping. Additional hyperparameters are listed in Sup-
mentation consisting of four convolutional stages, each composed of 10 plementary Tables of 64. The accuracy and precision of the resulting
layers of dilated residual convolutions with 64 output channels. A softmax model are reported in Supplementary Table 7.
layer at the end of the network produces the final output, which is a
5-dimensional vector indicating the probability that each entry in the time Ethics statement
series corresponds to each functional primitive. The model was trained on For the StrokeRehab dataset, all subjects provided written informed consent
the healthy training cohort using the weighted cross-entropy loss function in accordance with the Declaration of Helsinki. The study was approved by
proposed in59. This cost function was minimized for 50 epochs using the the Institutional Review Board at the New York University Grossman
Adam optimizer60 with a learning rate of 5 ⋅ 10−3 (selected via cross-vali- School of Medicine.
dation). The accuracy and precision of the resulting model are reported in For the OAI-ZIB dataset, the National Institute of Arthritis and
Supplementary Table 4. Musculoskeletal and Skin Diseases (NIAMS) at the National Institutes of
The video data were acquired with two high-speed (60 Hz), high- Health (NIH) appointed an independent Observational Study Monitoring
definition (1088 × 704 resolution) cameras (Ninox, Noraxon) positioned Board (OSMB) to oversee the Osteoarthritis Initiative (OAI) study from
orthogonally <2 m from the subject. The cameras have a focal length of f4.0 2002 to 2014. The OSMB was disbanded upon study completion when
mm and a large viewing window (length: 2.5 m, width: 2.5 m). The videos monitoring obligations were fulfilled.
were then downsampled to a resolution of 256 × 256 to enable efficient
processing. To perform functional primitive identification from these data, Reporting summary
we utilized the X3D model61, a 3D convolutional neural network designed Further information on research design is available in the Nature Research
for primitive classification from video data. The model was pretrained on Reporting Summary linked to this article.
the Kinetic dataset62, where the labels are high-level activities such as run-
ning, climbing, sitting, etc. Data availability
Following the approach proposed in47, after pretraining, the X3D Links to all the data used in this study are available at https://github.com/
model was fine-tuned to perform classification of functional primitives on fishneck/COBRA/tree/main/data.
the rehabilitation activities performed by the healthy training cohort. The
input to the model are video segments with duration two seconds, as sug- Code availability
gested in63, and the output is the estimated probability that the central frame Code to reproduce all results is available at https://github.com/
corresponds to each of the five functional primitives. Model fine-tuning was fishneck/COBRA.
carried out by minimizing the cross entropy between these probabilities and
the functional primitive labels via stochastic gradient descent with a base Received: 6 November 2023; Accepted: 21 June 2024;
learning rate of 0.01 and a cosine learning rate policy. The accuracy and
precision of the resulting model on held-out subjects are reported in the
Supplementary Tables 3, 4.
References
Estimation of knee-osteoarthritis severity 1. Medsger, T. A. et al. Assessment of disease severity and prognosis.
In order to apply the COBRA framework to automatic quantification of Clin. Exp. Rheumatol. 21, S42–S46 (2003).
knee-osteoarthritis severity we propose to utilize an auxiliary AI model 2. Fugl-Meyer, A. R., Jääskö, L., Leyman, I., Olsson, S. & Steglind, S. A
trained to predict the type of tissue in each voxel of a 3D MRI scan. The method for evaluation of physical performance. Scand. J. Rehabil.
K ≔ 5 classes for this classification problem are femur bone, femur cartilage, Med 7, 13–31 (1975).
tibia bone, tibia cartilage and background (indicating absence of tissue). 3. Raman, G. et al. Machine learning prediction for COVID-19 disease
Knee osteoarthritis deforms cartilage structure, so the clinically relevant severity at hospital admission. BMC Med. Inform. Decis. Mak. 23,
labels are chosen to be 1–6 (2023).
4. Hwangbo, S. et al. Machine learning models to predict the maximum
 
CR :¼ femur cartilage; tibia cartilage : ð11Þ severity of COVID-19 based on initial hospitalization record. Front.
Public Health 10, 1007205 (2022).
Our hypothesis is that the systematic difference in cartilage structure 5. Shamout, F. E. et al. An artificial intelligence system for predicting the
causes AI models trained on healthy knees to lose confidence when applied deterioration of COVID-19 patients in the emergency department.
to diseased knees, and that the loss of confidence is indicative of disease NPJ Digital Med. 4, 80 (2021).
severity. 6. Cottrell, M. A., Galea, O. A., O’Leary, S. P., Hill, A. J. & Russell, T. G. Real-
In order to predict tissue type we applied a Multi-Planar U-Net64. In the time telerehabilitation for the treatment of musculoskeletal conditions is
Supplementary Methods section, we report results with a different model effective and comparable to standard practice: a systematic review and
architecture, based on a 3D U-Net65. The Multi-Planar U-Net processes the meta-analysis. Clin. Rehab. 31, 625–638 (2017).
input 3D MRI scan from different views using a version of the 2D U-Net 7. Laver, K. E. et al. Telerehabilitation services for stroke. Cochrane
architecture66. The output from the different views are then averaged to Database Syst. Rev. 1, CD010255 (2020).
produce a probability estimate at each 3D voxel. During training, random 8. Hamet, P. & Tremblay, J. Artificial intelligence in medicine. Metabolism
elastic deformations (RED) are applied to a third of the images in each batch 69, S36–S40 (2017).
to improve generalization64. 9. Palanica, A., Docktor, M. J., Lieberman, M. & Fossat, Y. The need for
The model was trained by minimizing the cross entropy loss artificial intelligence in digital therapeutics. Digital Biomark. 4,
between the estimated probabilities and the 3D voxel-wise labels cor- 21–25 (2020).
responding to 37 of the 44 healthy individuals in the training cohort. The 10. Ting, D. S., Lin, H., Ruamviboonsuk, P., Wong, T. Y. & Sim, D. A.
remaining 7 individuals were used as a validation set. In the cost Artificial intelligence, the internet of things, and virtual clinics:

npj Digital Medicine | (2024)7:180 9


https://doi.org/10.1038/s41746-024-01173-x Article

ophthalmology at the digital translation forefront. Lancet Digital Health 33. Pinaya, W. H. et al. Using normative modelling to detect disease
2, e8–e9 (2020). progression in mild cognitive impairment and alzheimer’s disease in a
11. Topol, E. J. High-performance medicine: the convergence of human cross-sectional multi-cohort study. Sci. Rep. 11, 1–13 (2021).
and artificial intelligence. Nat. Med. 25, 44–56 (2019). 34. Chamberland, M. et al. Detecting microstructural deviations in
12. Barnes, R. & Zvarikova, K. Artificial intelligence-enabled wearable individuals with deep diffusion mri tractometry. Nat. computational
medical devices, clinical and diagnostic decision support systems, Sci. 1, 598–606 (2021).
and internet of things-based healthcare applications in COVID-19 35. Kim, D.-Y. et al. Feasibility of anomaly score detected with deep
prevention, screening, and treatment. Am. J. Med. Res. 8, 9–22 (2021). learning in irradiated breast cancer patients with reconstruction. npj
13. Jeddi, Z. & Bohr, A. Remote patient monitoring using artificial Digital Med. 5, 125 (2022).
intelligence. Artificial Intelligence in Healthcare, 203–234 (2020). 36. Hendrycks, D. & Gimpel, K. A baseline for detecting misclassified and
14. Shaik, T. et al. Remote patient monitoring using artificial intelligence: out-of-distribution examples in neural networks. ICLR (2017).
Current state, applications, and challenges. Wiley Interdiscip. Rev.: 37. Chen, J., Li, Y., Wu, X., Liang, Y. & Jha, S. Robust out-of-distribution
Data Min. Knowl. Discov. 13, e1485 (2023). detection for neural networks. AAAI-22 AdvML Workshop (2022).
15. Sawyer, J. et al. Wearable internet of medical things sensor devices, 38. Hsu, Y.-C., Shen, Y., Jin, H. & Kira, Z. Generalized ODIN: Detecting
artificial intelligence-driven smart healthcare services, and personalized out-of-distribution image without learning from out-of-distribution
clinical care in COVID-19 telemedicine. Am. J. Med. Res. 7, 71–77 (2020). data. Proc. IEEE/CVF CVPR, 10951–10960 (2020).
16. Akbilgic, O. et al. Machine learning to identify dialysis patients at high 39. Vyas, A. et al. Out-of-distribution detection using an ensemble of self
death risk. Kidney Int. Rep. 4, 1219–1229 (2019). supervised leave-out classifiers. Proc. ECCV, 550–564 (2018).
17. Chen, F., Kantagowit, P., Nopsopon, T., Chuklin, A. & Pongpirul, K. 40. Mohseni, S., Pitale, M., Yadawa, J. & Wang, Z. Self-supervised
Prediction and diagnosis of chronic kidney disease development and learning for generalizable out-of-distribution detection. Proc. AAAI,
progression using machine-learning: Protocol for a systematic review vol. 34, no. 04, 5216–5223 (2020).
and meta-analysis of reporting standards and model performance. 41. DeVries, T. & Taylor, G. W. Learning confidence for out-of-distribution
Plos one 18, e0278729 (2023). detection in neural networks. arXiv preprint arXiv:1802.04865 (2018).
18. Babenko, B. et al. Detection of signs of disease in external 42. Zhang, C., Bengio, S., Hardt, M., Recht, B. & Vinyals, O.
photographs of the eyes via deep learning. Nat. Biomed. Eng. 6, Understanding deep learning (still) requires rethinking generalization.
1370–1383 (2022). Commun. ACM 64, 107–115 (2021).
19. Shen, Y. et al. An interpretable classifier for high-resolution breast 43. Marquand, A. F., Rezek, I., Buitelaar, J. & Beckmann, C. F.
cancer screening images utilizing weakly supervised localization. Understanding heterogeneity in clinical cohorts using normative models:
Med. image Anal. 68, 101908 (2021). beyond case-control studies. Biol. psychiatry 80, 552–561 (2016).
20. Beam, A. L. & Kohane, I. S. Big data and machine learning in health 44. Rutherford, S. et al. Evidence for embracing normative modeling. Elife
care. JAMA 319, 1317–1318 (2018). 12, e85082 (2023).
21. Ching, T. et al. Opportunities and obstacles for deep learning in 45. Rutherford, S. et al. The normative modeling framework for
biology and medicine. J. R. Soc. Interface 15, 20170387 (2018). computational psychiatry. Nat. Protoc. 17, 1711–1734 (2022).
22. Norori, N., Hu, Q., Aellen, F. M., Faraci, F. D. & Tzovara, A. Addressing 46. Park, E., Lee, K., Han, T. & Nam, H. S. Automatic grading of stroke
bias in big data and ai for health care: A call for open science. Patterns symptoms for rapid assessment using optimized machine learning
2, 100347 (2021). and 4-limb kinematics: clinical validation study. J. Med. Internet Res.
23. Van Horn, J. D. et al. The functional magnetic resonance imaging data 22, e20641 (2020).
center (fMRIDC): the challenges and rewards of large-scale 47. Kaku, A. et al. StrokeRehab: A benchmark dataset for sub-second action
databasing of neuroimaging studies. Philos. Trans. R. Soc. Lond. Ser. identification. Adv. Neural Inf. Process. Syst. 35, 1671–1684 (2022).
B: Biol. Sci. 356, 1323–1339 (2001). 48. Parnandi, A. et al. Data-driven quantitation of movement abnormality
24. Langs, G., Hanbury, A., Menze, B. & Müller, H. VISCERAL: Towards after stroke. Bioengineering 10, 648 (2023).
large data in medical imaging-challenges and directions. MCBR- 49. Ambellan, F., Tack, A., Ehlke, M. & Zachow, S. Automated
CDS (2012). segmentation of knee bone and cartilage combining statistical shape
25. Oakden-Rayner, L., Dunnmon, J., Carneiro, G. & Ré, C. Hidden knowledge and convolutional neural networks: Data from the
stratification causes clinically meaningful failures in machine learning osteoarthritis initiative. Med. image Anal. 52, 109–118 (2019).
for medical imaging. Proc. ACM CHIL, 151–159 (2020). 50. Kohn, M. D., Sassoon, A. A. & Fernando, N. D. Classifications in brief:
26. Roy, S., Meena, T. & Lim, S.-J. Demystifying supervised learning in Kellgren-Lawrence classification of osteoarthritis. Clin. Orthop. Relat.
healthcare 4.0: A new reality of transforming diagnostic medicine. Res.® 474, 1886–1893 (2016).
Diagnostics 12, 2549 (2022). 51. Eckstein, F., Wirth, W. & Nevitt, M. C. Recent advances in
27. Jarrett, D., Stride, E., Vallis, K. & Gooding, M. J. Applications and osteoarthritis imaging-the osteoarthritis initiative. Nat. Rev.
limitations of machine learning in radiation oncology. Br. J. Radiol. 92, Rheumatol. 8, 622–630 (2012).
20190001 (2019). 52. Hsu, H. & Siwiec, R. M. Knee osteoarthritis (2018) .
28. Varoquaux, G. & Cheplygina, V. Machine learning for medical imaging: 53. Brody, L. T. Knee osteoarthritis: Clinical connections to articular
methodological failures and recommendations for the future. NPJ cartilage structure and function. Phys. Ther. Sport 16, 301–316 (2015).
digital Med. 5, 48 (2022). 54. Geirhos, R. et al. Shortcut learning in deep neural networks. Nat.
29. Chandola, V., Banerjee, A. & Kumar, V. Anomaly detection: A survey. Mach. Intell. 2, 665–673 (2020).
ACM Comput. Surv. (CSUR) 41, 1–58 (2009). 55. Guo, C., Pleiss, G., Sun, Y. & Weinberger, K. Q. On calibration of
30. Akcay, S., Atapour-Abarghouei, A. & Breckon, T. P. Ganomaly: Semi- modern neural networks. Proc. ICML, PMLR 70:1321–1330 (2017).
supervised anomaly detection via adversarial training. ACCV 56. Liu, S. et al. Deep probability estimation. Proc. ICML, PMLR
622–637 (2018). 162:13746–13781 (2022).
31. Deecke, L., Vandermeulen, R., Ruff, L., Mandt, S. & Kloft, M. Image 57. Farha, Y. A. & Gall, J. MS-TCN: Multi-stage temporal convolutional
anomaly detection with generative adversarial networks. Proc. ECML network for action segmentation. Proc. of the IEEE/CVF CVPR,
PKDD Part I 18, 3–17 (2018). 3575–3584 (2019).
32. van Hespen, K. M. et al. An anomaly detection approach to identify 58. Parnandi, A. et al. PrimSeq: A deep learning-based pipeline to quantitate
chronic brain infarcts on mri. Sci. Rep. 11, 7714 (2021). rehabilitation training. PLOS digital health 1, e0000044 (2022).

npj Digital Medicine | (2024)7:180 10


https://doi.org/10.1038/s41746-024-01173-x Article

59. Ishikawa, Y., Kasai, S., Aoki, Y. & Kataoka, H. Alleviating over- input from all authors. All authors approved the completed version and are
segmentation errors by detecting action boundaries. Proc. of the accountability for all aspects of the work.
IEEE/CVF WACV, 2322–2331 (2021).
60. Kingma, D. P. & Ba, J. Adam: A method for stochastic optimization. Competing interests
ICLR (2015). The authors declare no competing interests.
61. Feichtenhofer, C. X3D: Expanding architectures for efficient video
recognition. Proc. of the IEEE/CVF CVPR, 203–213 (2020). Additional information
62. Kay, W. et al. The Kinetics human action video dataset. arXiv preprint Supplementary information The online version contains
arXiv:1705.06950 (2017). supplementary material available at
63. Kaku, A. et al. Towards data-driven stroke rehabilitation via https://doi.org/10.1038/s41746-024-01173-x.
wearable sensors and deep learning. MLHC, 143-171. PMLR
(2020). Correspondence and requests for materials should be addressed to
64. Perslev, M., Dam, E. B., Pai, A. & Igel, C. One network to segment Heidi Schambra or Carlos Fernandez-Granda.
them all: A general, lightweight system for accurate 3D medical
image segmentation. Proc. MICCAI Part II 22 (30–38) Reprints and permissions information is available at
(2019). http://www.nature.com/reprints
65. Çiçek, Ö., Abdulkadir, A., Lienkamp, S. S., Brox, T. & Ronneberger, O.
3D U-Net: learning dense volumetric segmentation from sparse Publisher’s note Springer Nature remains neutral with regard to
annotation. Proc. MICCAI Part II 19 (424–432). Springer International jurisdictional claims in published maps and institutional affiliations.
Publishing (2016).
66. Ronneberger, O., Fischer, P. & Brox, T. U-Net: Convolutional Open Access This article is licensed under a Creative Commons
networks for biomedical image segmentation. Proc. MICCAI Part III 18 Attribution 4.0 International License, which permits use, sharing,
(234–241) (2015). adaptation, distribution and reproduction in any medium or format, as long
as you give appropriate credit to the original author(s) and the source,
Acknowledgements provide a link to the Creative Commons licence, and indicate if changes
This work was supported by NIH grant R01 LM013316, Alzheimer’s were made. The images or other third party material in this article are
Association grant AARG-NTF-21-848627, NSF grant NRT-1922658, and included in the article’s Creative Commons licence, unless indicated
NSF CAREER Award 2145542. otherwise in a credit line to the material. If material is not included in the
article’s Creative Commons licence and your intended use is not permitted
Author contributions by statutory regulation or exceeds the permitted use, you will need to
H.S. and C.F.G. conceived the project. B.Y., A.K., K.L. and A.P. designed, obtain permission directly from the copyright holder. To view a copy of this
implemented and evaluated the methodology with guidance from R.R., H.S. licence, visit http://creativecommons.org/licenses/by/4.0/.
and C.F.G. A.P., E.F, A.V., and N.P. quality-checked the data and their
annotations. B.Y., A.K., K.L., A.P., R.R., H.S. and C.F.G. wrote the paper with © The Author(s) 2024

npj Digital Medicine | (2024)7:180 11

You might also like