Early Detection of Asd Using Data Mechine
Early Detection of Asd Using Data Mechine
BMJ Health & Care Informatics: first published as 10.1136/bmjhci-2022-100544 on 8 September 2022. Downloaded from https://informatics.bmj.com on 4 November 2024 by guest. Protected by
disorder in young children with machine
learning using medical claims data
Yu-Hsin Chen ,1 Qiushi Chen ,1 Lan Kong ,2 Guodong Liu 2,3,4,5
copyright.
online (http://dx.doi.org/10.
1136/bmjhci-2022-100544). our study cohort. We developed logistic regression (LR) sively to assess the risk of ASD in young children.
with least absolute shrinkage and selection operator and
WHAT THIS STUDY ADDS
random forest (RF) models for predicting ASD diagnosis
Received 06 January 2022 at ages of 18–30 months, using demographics, medical ⇒ This study demonstrated the feasibility of predicting
Accepted 19 August 2022
diagnoses and healthcare service procedures extracted ASD diagnosis with promising accuracy based on an
from individual’s medical claims during early years individual’s medical record from health claims data
postbirth as predictor variables. using machine learning models.
Results For predicting ASD diagnosis at age of 24 ⇒ Our prediction models were clinically interpretable,
months, the LR and RF models achieved the area under which systematically identified key predictors in line
the receiver operating characteristic curve (AUROC) of with known risk factors and symptoms among ASD
0.758 and 0.775, respectively. Prediction accuracy further children in the literature.
increased with age. With predictor variables separated by HOW THIS STUDY MIGHT AFFECT RESEARCH,
outpatient and inpatient visits, the RF model for prediction PRACTICE OR POLICY
at age of 24 months achieved an AUROC of 0.834, with
⇒ This study may serve as a basis for integrating pre-
96.4% specificity and 20.5% positive predictive value at
dictive modelling into the health information system
40% sensitivity, representing a promising improvement
and the clinical workflow to enhance the current
over the existing screening tool in practice.
ASD screening practice.
Conclusions Our study demonstrates the feasibility of
using machine learning models and health claims data
to identify children with ASD at a very young age. It is challenges with daily life, education and
deemed a promising approach for monitoring ASD risk
employment.3
in the general children population and early detection of
Early diagnosis is the key to early interven-
high-risk children for targeted screening.
tion for improving the long-term outcomes
of children with ASD. However, despite the
INTRODUCTION growing evidence shows that accurate and
Autism spectrum disorder (ASD) is a devel- stable diagnoses can be made by 2 years,4 in
opmental disorder that involves persistent real-world settings, the median age of ASD
© Author(s) (or their challenges in social interaction, speech and diagnosis is 50 months.2 To improve early
employer(s)) 2022. Re-use
nonverbal communication, and restricted diagnosis, the American Academy of Pedi-
permitted under CC BY-NC. No
commercial re-use. See rights and repetitive behaviours.1 In the USA, the atrics (AAP) has recommended universal
and permissions. Published by prevalence of ASD has increased substantially screening among all children at 18- month
BMJ. in the past two decades, with an estimate of and 24-month well-child visits in the primary
For numbered affiliations see every 1 in 44 children to be identified with care settings using the Modified Checklist
end of article. ASD by age 8 in 2016.2 Although there exist for Autism in Toddlers (M-CHAT),5 a ques-
Correspondence to evidence-based interventions which improve tionnaire that assesses children’s behaviour
Dr Qiushi Chen; core symptoms in children with ASD, many for toddlers.6 However, growing evidence
q.chen@psu.edu children with ASD still experience long-term has shown that using M- CHAT alone may
not yield sufficient accuracy in detecting ASD cases, with demographics (eg, sex, birth year, postal region), service
BMJ Health & Care Informatics: first published as 10.1136/bmjhci-2022-100544 on 8 September 2022. Downloaded from https://informatics.bmj.com on 4 November 2024 by guest. Protected by
a sensitivity below 40% and a positive predictive value providers, insurance plans, medical diagnoses (in inter-
(PPV) under 20%.7 8 national Classification of Diseases (ICD)-9/10 codes) and
In addition to ASD-specific behavioural questionnaires, procedures (in Healthcare Common Procedure Coding
general clinical and healthcare records may also contain System (HCPCS) and Current Procedural Terminology-4
meaningful signals to differentiate the ASD risks among codes) at each encounter of healthcare services.
very young children. Studies have found that children
with ASD are oftentimes accompanied by certain symp- Study population
toms and medical issues such as gastrointestinal prob- We constructed an initial cohort consisting of young
lems,9 infections10 11 and feeding problems.12 This implies children with and without ASD (figure 1). The inclusion
that past diagnosis and healthcare encounter informa- criteria of the ASD cohort are as follows: (1) having at
tion, commonly available from health insurance claims or least 2 outpatient or 1 inpatient ASD diagnosis encoun-
Electronic Healthcare Record (EHR), could potentially ters (299 for ICD- 9 and F84 for ICD- 10) throughout
be used for ASD risk prediction. In fact, medical claims the existing records20 21; and (2) having continuous
and EHR data have been widely used in the health infor- enrolment from 4 months to 30 months to ensure the
matics literature for identifying disease- specific early completeness of health records from the claims data that
phenotypes even before the hallmark symptoms start to can be used for diagnosis prediction at up to 30 months
manifest, such as for chronic diseases like heart failures,13 (online supplemental figure S1). To create the non-ASD
diabetes14 and Alzheimer’s disease.15 In the context of cohort, we first identified individuals without any ASD
ASD, health record data has been used to identify the ASD diagnosis throughout their health records, then downs-
subtypes16 17 and to predict the suicidal risk in adolescents ampled 5% of the population to obtain a computation-
with ASD18; however, its use for predicting ASD diagnosis ally manageable yet sufficiently large subset of samples.
copyright.
in young children has remained limited. To fill this gap, To ensure patients had adequate follow-up time to receive
the objective of this study is to examine the feasibility of confirmed ASD diagnosis in the database, we restricted
using large-scale real-world medical claims data to develop our selection of non- ASD patients by requiring a full
a prediction model for ASD diagnosis in young children, enrolment period from 4 months to 60 months (online
which can be used to support effective ASD screening supplemental table S1).
strategies and facilitate early detection.
Predictor variables for ASD diagnosis
We examined all diagnosis and procedure codes of a
METHODS child’s medical encounters available from as early as
Data source within 4 months after birth up to the age for prediction
We used the deidentified individual- level longitu- of ASD. We applied the Clinical Classifications Software
dinal healthcare claims data from the IBM MarketScan (CCS),22 a commonly used tool in health informatics
Commercial Claims and Encounters Database from research, to aggregate the large number of distinct diag-
2005 to 2016. This database includes over 273 million nosis and procedure codes into clinically meaningful
unique individuals for both privately and publicly insured groups (figure 1). The single-level CCS maps the ICD-
people in the USA.19 The claims data include baseline 9/10 and HCPCS codes to a substantially smaller yet
Figure 1 Overview of study design for the predictive analysis. ASD, autism spectrum disorder; AUROC, area under receiver
operating characteristic curve; AUPRC, area under precision-recall curve; LASSO, least absolute shrinkage and selection
operator; PPV, positive predictive value.
practical set that includes 285 diagnosis and 231 proce- is defined as the harmonic mean of PPV and sensitivity,
BMJ Health & Care Informatics: first published as 10.1136/bmjhci-2022-100544 on 8 September 2022. Downloaded from https://informatics.bmj.com on 4 November 2024 by guest. Protected by
dure categories.22 We further removed the same- day which are suited for evaluating the prediction perfor-
duplications of CCS codes after the mapping by counting mance for the imbalanced testing sample.26 27 To assess
at most one encounter of a specific CCS category for each the stability and the uncertainty of prediction perfor-
person on each day. mance, we repeated the training and testing set sampling,
To predict the ASD diagnosis at the age of 24 months in model training, testing and performance evaluation with
our base case model, in line with the age when a diagnosis 50 independent replications. The 95% CIs of all perfor-
can possibly be made by an experienced professional,4 mance measures were reported.
we defined the predictor variables as the total number
of encounters for each CCS category up to the age for Predicting ASD diagnosis at different ages
prediction of 24 months. We also included sex and the In addition to the base case prediction model where the
encounters of emergency department visits, which are risk of ASD diagnosis was assessed based on clinical infor-
well-known clinically relevant factors associated with the mation up to 24 months, we compared the accuracy of
autism population.23 Variables that were present in <1% ASD prediction with varying lengths of available medical
of both ASD and non-ASD cohorts were excluded.24 A history at (1) a younger age, 18 months, considering
total of 170 input predictor variables were included for that the universal ASD screening is recommended for
prediction at the age of 24 months. Having considered children at both 18 months and 24 months5; and (2) an
that the course of clinical events may be following a older age, 30 months, which is still a critical time point for
different pattern after an encounter with ASD diagnosis, monitoring the developmental delays and consideration
we excluded any children who had at least one encounter of early intervention.28 We followed the same approach
with ASD diagnosis code prior to the age for prediction in the base case to exclude predictor variables of low
in our analysis. frequency (resulting in 150 and 180 predictor variables
copyright.
in total for prediction at 18 and 30 months, respectively)
Prediction model development and validation and the children with ASD diagnosis prior to the age for
We employed two machine learning methods, logistic prediction.
regression (LR) and random forest (RF), which have
been widely used for developing risk prediction models Identifying key predictor variables
in various clinical settings. LR assumes that the indepen- We further explored how many and which key predictive
dent variables are linearly related to the log odds and that variables had the most impact on the prediction perfor-
the effects of multiple variables are additive, whereas RF mance using the Gini importance index from the RF
is particularly suitable for exploiting nonlinear interac- model. We added variables incrementally following the
tive effects in high-dimensional data. For the LR model, order of Gini Index (ie, starting with the most important
we also applied the least absolute shrinkage and selec- variable) and evaluated how the prediction accuracy
tion operator (LASSO) as a feature selection technique changed as more variables were included. Selected key
to enforce the coefficients of weak predictors to be zero. predictive variables were then compared with those iden-
The RF model was limited to up to 100 decision trees tified by alternative strategies using (1) the absolute value
in the base case setting (other choices of the maximum of coefficients from the LASSO LR model and (2) the
number of trees were tested in sensitivity analysis). prevalence of each variable in the identified ASD cohort.
To train our model, we sampled 10 000 ASD and 10 000
non-ASD subjects (N=20 000) from the initial cohort to Separating inpatient and outpatient visits
build a large balanced training sample for maximising the Considering that the underlying severity of the symp-
discriminatory power learnt by the prediction model. To toms could potentially differ by inpatient hospitalisations
evaluate the model prediction performance, we created and outpatient visits,29 we split the number of encoun-
an independent imbalanced testing set (N=16 201) ters for each diagnosis and procedure by inpatient and
comprised of ASD and non- ASD patients from the outpatient visit separately and augmented the predic-
remaining cohort that were mutually exclusive from the tion model with more detailed encounter variables. We
training set. The testing set resembled the real-world esti- compared the prediction performance of the models
mates for ASD prevalence of 2.3% (ie, 1 in every 44) in using the augmented variables with our base case models.
the general population.2
We measured the prediction performance with sensi- Sensitivity analysis
tivity (also known as true positive rate or recall), speci- We performed sensitivity analysis on several modelling
ficity (or true negative rate) and PPV (or precision)25 assumptions to assess the robustness of our prediction
at various selected risk thresholds. The model’s overall models. Specifically, we strengthened the inclusion
discrimination ability was measured using the area under criteria for non-ASD subjects by requiring one additional
the receiver operating characteristic curve (AUROC). We year of enrollment, that is, increased from 4–60 months
also calculated the area under the precision-recall curve to 4–72 months. Furthermore, we assessed the potential
(AUPRC) where the precision-recall curve represents the loss of information due to excluding variables with <1%
relationship between PPV and sensitivity, and F1 score prevalence, to verify that such a variable prescreening
BMJ Health & Care Informatics: first published as 10.1136/bmjhci-2022-100544 on 8 September 2022. Downloaded from https://informatics.bmj.com on 4 November 2024 by guest. Protected by
tive information.
11 (10.6 to 11.5)
8.7 (8.4 to 9.0)
6.7 (6.5 to 6.9)
*The sensitivity threshold of 40% was selected to be comparable with the estimated sensitivity of 33%–39% for the existing autism-specific screening tools from real-world clinical settings.7 8
AUPRC, area under precision-recall curve; AUROC, area under receiver operator characteristic curve; LASSO, least absolute shrinkage and selection operator; PPV, positive predictive value.
RESULTS
Predicting ASD diagnosis at age of 24 months
We identified the study cohort consisting of 12 743 ASD
subjects and 25 833 non-ASD subjects (more details in
online supplemental table S1). When predicting the ASD
diagnosis at the age of 24 months in independent testing
samples, the LR and RF models achieved the AUROC of
Performance of LASSO logistic regression and random forest models in prediction of autism spectrum disorder
of up to 100 trees in the RF model was deemed sufficient
to achieve stable performance. Further increasing the
Sensitivity target,
model complexity did not translate to an improvement in
prediction accuracy (online supplemental table S2).
% (95% CI)
copyright.
Predicting ASD diagnosis at different ages
Comparing the prediction models at the ages of 18, 24 and
40*
50
70
40
50
70
50
50
50
50
30 months, we found that the prediction performance
increased substantially with the age. Specifically for the
0.193 (0.188 to 0.197)
Random forest
Random forest
BMJ Health & Care Informatics: first published as 10.1136/bmjhci-2022-100544 on 8 September 2022. Downloaded from https://informatics.bmj.com on 4 November 2024 by guest. Protected by
Figure 2 Receiver operating characteristic curves (A) and
precision-recall (PR) curves (B) for prediction of autism
spectrum disorder (ASD) diagnosis at age of 24 months. The
prevalence stands for the baseline 2.27% (ie, 1 in 44) ASD Figure 4 Comparison of area under the receiver operating
prevalence in the general population. AUC, area under curve; characteristic curve (AUROC) with combined versus
LR, logistic regression; RF, random forest. separated inpatient and outpatient encounters by LASSO
logistic regression (LR) and random forest (RF) models, at
were also highly consistent with high prevalence variables, the age of 18, 24 and 30 months, respectively. Error bars
sharing 47 out of 50 most common variables in the ASD in the figure represent the 95% CIs based on results from
50 replications of independent runs. LASSO, least absolute
cohort (online supplemental figure S4).
shrinkage and selection operator.
Prediction using separated inpatient and outpatient data
copyright.
Separating inpatient and outpatient encounters further the non-ASD cohort, children would be less likely to be
increased the AUROC for prediction at the age of 24 misclassified. We also verified that including the low-
months to 0.766 (95% CI 0.762 to 0.769) in the LR model prevalence variables would not result in substantial differ-
and 0.834 (95% CI 0.831 to 0.837) in the RF model. At ences but only marginal changes of AUROC within 0.01
the target sensitivity of 40%, the RF model achieved a across all model specifications.
higher specificity of 96.4% (95% CI 96.2% to 96.5%) with
a PPV of 20.5% (95% CI 19.8% to 21.1%), outperforming
the existing screening tool M-CHAT/F (with a sensitivity DISCUSSION
of 38.8%, specificity of 94.9% and PPV of 14.6%). We Early identification is vital for children with ASD to ensure
found that using claims data separated by inpatient and their access to timely intervention and to optimise long-
outpatient visits improved the prediction performance term outcomes. In this study, we demonstrated the feasi-
consistently across all ages (figure 4). bility of predicting ASD diagnosis at early ages using health
claims data and machine learning models. We found that
Robustness check and sensitivity analysis LASSO LR and RF models achieved an overall AUROC
With a more stringent inclusion criterion for non-ASD above 0.75 when predicting ASD diagnosis at age of 24
subjects by requiring a longer full enrollment period up months. Our results also showed that prediction perfor-
to 72 months (vs 60 months in our base case), we found mance increased with age at the time of prediction. This
that the prediction performance had modest improve- is reasonable because more clinical information accumu-
ment (online supplemental table S3). It could be partially lated over a longer follow-up period since birth may contain
attributed to the fact that with longer years to ascertain more distinctive patterns to effectively differentiate chil-
dren with ASD. The prediction models developed in our
study are clinically interpretable. Key predictors, such as sex
(male), developmental delays, gastrointestinal disorders,
respiratory system infections and otitis media have shown
strong predictive values for ASD diagnosis, which are in line
with previous clinical studies that have shown these symp-
toms being associated with ASD children. Finally, our study
showed that separating inpatient and outpatient claims as
predictors could further improve the prediction accuracy.
In our study, both LASSO LR and RF models showed
promising accuracy in predicting ASD diagnosis based on
an individual’s medical claims data. This robust finding
Figure 3 Receiver operating characteristic curves (A) and implies that there may exist distinct patterns in health
precision-recall curves (B) for prediction of autism spectrum conditions and health service needs among young chil-
disorder at ages of 18, 24 and 30 months, respectively, by the dren with ASD, well before the onset of most hallmark
random forest model. AUC, area under curve. ASD behavioural symptoms. Such predictive signals can
be easily extracted from the electronic health records or rather a tool to enhance the screening accuracy. If more
BMJ Health & Care Informatics: first published as 10.1136/bmjhci-2022-100544 on 8 September 2022. Downloaded from https://informatics.bmj.com on 4 November 2024 by guest. Protected by
medical claims administrative data, and used for the early detailed electronic health record data were available, the
identification of ASD cases. We also observed differences in proposed risk prediction model could be further extended
the performance between the two models. The RF model by incorporating screening results with clinical informa-
outperformed the LASSO LR model in general, likely tion, or by differentiating the clinical information before
because, with its tree-based model structure, the RF model versus after the screening events, to further improve the
is better at capturing complex interactive effects among accuracy of identifying high-risk ASD cases for further diag-
the predictor variables to distinguish between the ASD nostic evaluation.
and non-ASD cases, whereas the LR model synthesises the Our study has several limitations. First, diagnosis of ASD
effects of multiple variables additively. The advantage of the established only based on existing diagnosis codes from
RF model became more salient when input variables were claims data could be inaccurate and unreliable sometimes
separated by inpatient and outpatient claims into a more in practice. We followed a validated approach in ASD
granular level. health service research literature to identify the ASD cohort
Our study has made an important contribution to in our study.31 Second, the absence of ASD diagnosis codes
applying health informatics in the field of ASD. Although in one’s health record may not necessarily indicate an indi-
there exists a plethora of literature identifying individual vidual not having ASD, especially for children born in later
risk factors of ASD, using large healthcare service data and years, due to limited follow-up time prior to the cut-off date
machine learning models to systematically predict ASD in the database. Thus, we required full enrollment up to
diagnosis has remained much less explored. Unlike existing 60 months without ASD diagnoses to identify the non-ASD
clinical informatics studies that focused on detecting ASD cohort, and verified the robustness of our base case results
subtypes,16 17 we aim to detect ASD cases among the general in a sensitivity analysis requiring full enrolment up to 72
children population, that is, the early detection. This could months. Third, as autistic children are likely to have a wide
copyright.
be particularly challenging due to the low prevalence of range of comorbid conditions with various frequencies, for
ASD in the general population (ie, a highly imbalanced individuals who do not present comorbid conditions from
dataset), and the scarcity of information available at such the past healthcare encounter data, our model may provide
a young age. Nevertheless, our model showed promising limited value. Our risk prediction model can be further
prediction performance. The RF model with separated augmented by additional information other than informa-
inpatient and outpatient encounters achieved a specificity tion from the health claims database, such as ASD/develop-
of 96.4% at a sensitivity of 40% for the ASD prediction at mental screening results and behaviour-related information
the age of 24 months, outperforming the accuracy of the from a more comprehensive EHR dataset in future studies.
existing ASD- specific screening tool (sensitivity: 38.8%; Lastly, the diagnosis and procedure codes in insurance
specificity: 94.9%) from a clinical observational study.7 It is claims data may be subject to variabilities and irregulari-
worth noting that under a similar ASD prevalence (2.2%), ties. Instead of the original detailed clinical codes, we used
our model showed a higher PPV (20.5% vs 14.6%). aggregated CCS categories for diagnoses and procedures
Our prediction model for ASD diagnosis could lead to for more robust clinical measures.
a significant impact on the screening strategies for ASD in
young children. Although the AAP guidelines recommend
universal screening in all children, it has been debated that,
without the perfect screening tool, universal screening may CONCLUSIONS
result in overburdened diagnostic services in the health- Using real-world health claims data and machine learning
care system as these clinical resources are in extremely methods, we developed a prediction model that can success-
short supply.30 Our prediction models have demonstrated fully predict ASD diagnosis for children under 30 months
promising improvement over the existing ASD screening with promising prediction accuracy. Our model also identi-
tool by using clinical information, which could potentially fied the important predictors for the diagnosis prediction,
serve as a ‘triaging tool’ for identifying high-risk patients for which showed meaningful clinical relevance and intuition.
diagnostic evaluation. Moreover, the models only based on Our predictive modelling approach could potentially be
health claims data makes it practically feasible to integrate generalised to broader clinical settings for predicting the
into an EHR system or insurance claims database. It could diseases that may show early signals from past healthcare
further enable an automatic screening tool, which can service encounters in claims or EHR data. Future studies
continuously monitor an individual’s risk as new diagnosis could explore the prediction of ASD diagnosis dynamically
and procedure information emerges, and send reminders over time as new healthcare encounter occurs, and inves-
to patients or providers for a timely clinical assessment if tigate how validated risk prediction models could be inte-
necessary. On the other hand, it is possible that some diag- grated and used to inform ASD screening strategies.
nosis and procedure information appear after a concern
Author affiliations
that the child had autism has already existed, such as 1
The Harold and Inge Marcus Department of Industrial and Manufacturing
following a positive screening event, which could alter the Engineering, The Pennsylvania State University, University Park, Pennsylvania, USA
course of subsequent clinical events. As such, our prediction 2
Department of Public Health Sciences, The Pennsylvania State University College of
model is not designed to direct the screening decisions, but Medicine, Hershey, Pennsylvania, USA
BMJ Health & Care Informatics: first published as 10.1136/bmjhci-2022-100544 on 8 September 2022. Downloaded from https://informatics.bmj.com on 4 November 2024 by guest. Protected by
College of Medicine, Hershey, Pennsylvania, USA 2001;31:131–44.
4
Department of Pediatrics, The Pennsylvania State University College of Medicine, 7 Guthrie W, Wallis K, Bennett A, et al. Accuracy of autism screening in
a large pediatric network. Pediatrics 2019;144.
Hershey, Pennsylvania, USA 8 Carbone PS, Campbell K, Wilkes J, et al. Primary care autism
5
The Center for Applied Studies in Health Economics (CASHE), The Pennsylvania screening and later autism diagnosis. Pediatrics 2020;146.
State University College of Medicine, Hershey, Pennsylvania, USA doi:10.1542/peds.2019-2314. [Epub ahead of print: 06 07 2020].
9 Chaidez V, Hansen RL, Hertz-Picciotto I. Gastrointestinal problems in
Contributors GL, QC and Y-HC conceived of the presented idea, and developed it children with autism, developmental delays or typical development.
J Autism Dev Disord 2014;44:1117–27.
with support from GL and QC. Y-HC cleaned and preprocessed the data, developed 10 Rosen NJ, Yoshida CK, Croen LA. Infection in the first 2 years of life
prediction models, and performed model evaluations. All authors interpreted the and autism spectrum disorders. Pediatrics 2007;119:e61–9.
model results. Y-HC and QC drafted the manuscript, which was critically revised by 11 Adams DJ, Susi A, Erdie-Lalena CR, et al. Otitis media and related
all authors. QC is the guarantor of the project. complications among children with autism spectrum disorders.
J Autism Dev Disord 2016;46:1636–42.
Funding This work has been supported by Penn State Social Science Research
12 Ledford JR, Gast DL. Feeding problems in children with autism
Institute Level 1 Seed Grant (QC, GL), Penn State College of Engineering spectrum disorders. Focus Autism Other Dev Disabl 2006;21:153–66.
Multidisciplinary Research Seed Grant (Y-HC, QC, GL) and NIH R21 grant: 1 R21 13 Sideris C, Alshurafa N, Pourhomayoun M, et al. A data-driven feature
MH119480-01A1 (GL, LK). extraction framework for predicting the severity of condition of
congestive heart failure patients. Annu Int Conf IEEE Eng Med Biol
Competing interests None declared.
Soc 2015;2015:2534–7.
Patient consent for publication Not applicable. 14 Nguyen BP, Pham HN, Tran H, et al. Predicting the onset of type 2
diabetes using wide and deep learning with electronic health records.
Provenance and peer review Not commissioned; externally peer reviewed. Comput Methods Programs Biomed 2019;182:105055.
Data availability statement Data may be obtained from a third party and are not 15 Park JH, Cho HE, Kim JH, et al. Machine learning prediction of
publicly available. incidence of Alzheimer's disease using large-scale administrative
health data. NPJ Digit Med 2020;3:46.
Supplemental material This content has been supplied by the author(s). It has 16 Lingren T, Chen P, Bochenek J, et al. Electronic health record based
not been vetted by BMJ Publishing Group Limited (BMJ) and may not have been algorithm to identify patients with autism spectrum disorder. PLoS
peer-reviewed. Any opinions or recommendations discussed are solely those One 2016;11:e0159621.
17 Vargason T, Frye RE, McGuinness DL, et al. Clustering of co-
copyright.
of the author(s) and are not endorsed by BMJ. BMJ disclaims all liability and
occurring conditions in autism spectrum disorder during early
responsibility arising from any reliance placed on the content. Where the content
childhood: a retrospective analysis of medical claims data. Autism
includes any translated material, BMJ does not warrant the accuracy and reliability Res 2019;12:1272–85.
of the translations (including but not limited to local regulations, clinical guidelines, 18 Downs J, Velupillai S, George G, et al. Detection of suicidality in
terminology, drug names and drug dosages), and is not responsible for any error adolescents with autism spectrum disorders: developing a natural
and/or omissions arising from translation and adaptation or otherwise. language processing approach for use in electronic health records.
AMIA Annu Symp Proc 2017;2017:641–9.
Open access This is an open access article distributed in accordance with the 19 IBM MarketScan research databases, 2020. Available: https://www.
Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which ibm.com/products/marketscan-research-databases
permits others to distribute, remix, adapt, build upon this work non-commercially, 20 Burke JP, Jain A, Yang W, et al. Does a claims diagnosis of autism
and license their derivative works on different terms, provided the original work is mean a true case? Autism 2014;18:321–30.
properly cited, appropriate credit is given, any changes made indicated, and the use 21 Coleman KJ, Lutsky MA, Yau V, et al. Validation of autism spectrum
is non-commercial. See: http://creativecommons.org/licenses/by-nc/4.0/. disorder diagnoses in large healthcare systems with electronic
medical records. J Autism Dev Disord 2015;45:1989–96.
ORCID iDs 22 Agency for Healthcare Research and Quality R, MD. HCUP clinical
Yu-Hsin Chen http://orcid.org/0000-0002-3678-7517 classifications software (CCS) for ICD-9-CM Healthcare Cost and
Utilization Project (HCUP) 2006-2009; 2020. www.hcup-us.ahrq.gov/
Qiushi Chen http://orcid.org/0000-0003-4031-2669
toolssoftware/ccs/ccs.jsp
Lan Kong http://orcid.org/0000-0001-6098-9445 23 Loomes R, Hull L, Mandy WPL. What is the male-to-female ratio in
Guodong Liu http://orcid.org/0000-0001-8683-0803 autism spectrum disorder? A systematic review and meta-analysis.
J Am Acad Child Adolesc Psychiatry 2017;56:466–74.
24 He D, Mathews SC, Kalloo AN, et al. Mining high-dimensional
administrative claims data to predict early hospital readmissions.
REFERENCES J Am Med Inform Assoc 2014;21:272–9.
1 American Psychological Association. Diagnostic and statistical 25 Hunink MGM, Weinstein MC, Wittenberg E. Decision making in health
manual of mental disorders (DSM-5®). American Psychiatric Pub, and medicine : Integrating evidence and values. 2nd ed. Cambridge
2013. University Press, 2014.
2 Maenner MJ, Shaw KA, Bakian AV, et al. Prevalence and 26 Jeni LA, Cohn JF, De La Torre F. Facing imbalanced data
characteristics of autism spectrum disorder among children aged 8 recommendations for the use of performance metrics. Int Conf Affect
Years - autism and developmental disabilities monitoring network, 11 Comput Intell Interact Workshops 2013;2013:245–51.
Sites, United States, 2018. MMWR Surveill Summ 2021;70:1–16. 27 Ozenne B, Subtil F, Maucort-Boulch D. The precision--recall curve
3 McPheeters ML, Weitlauf A, Vehorn A. U.S. preventive services overcame the optimism of the receiver operating characteristic curve
Task force evidence syntheses, formerly systematic evidence in rare diseases. J Clin Epidemiol 2015;68:855–9.
reviews. screening for autism spectrum disorder in young children: a 28 Hyman SL, Levy SE, Myers SM, et al. Identification, evaluation, and
systematic evidence review for the US preventive services Task force. management of children with autism spectrum disorder. Pediatrics
Rockville (MD): Agency for Healthcare Research and Quality (US), 2020;145:e20193447.
2016. 29 Pottick K, Hansell S, Gutterman E, et al. Factors associated with
4 Lord C, Risi S, DiLavore PS, et al. Autism from 2 to 9 years of age. inpatient and outpatient treatment for children and adolescents
Arch Gen Psychiatry 2006;63:694–701. with serious mental illness. J Am Acad Child Adolesc Psychiatry
5 Lipkin PH, Macias MM, Council on children with disabilities, 1995;34:425–33.
section on developmental and behavioral pediatrics. Promoting 30 Siu AL, Bibbins-Domingo K, et al, US Preventive Services Task Force
optimal development: identifying infants and young children with (USPSTF). Screening for autism spectrum disorder in young children:
developmental disorders through developmental surveillance and US preventive services task force recommendation statement. JAMA
screening. Pediatrics 2020;145. doi:10.1542/peds.2019-3449. [Epub 2016;315:691–6.
ahead of print: 16 Dec 2019]. 31 Liu G, Pearl AM, Kong L, et al. Risk factors for emergency
6 Robins DL, Fein D, Barton ML, et al. The modified checklist for department utilization among adolescents with autism spectrum
autism in toddlers: an initial study investigating the early detection of disorder. J Autism Dev Disord 2019;49:4455–67.