Artificial Intelligence in Fracture Detection
Artificial Intelligence in Fracture Detection
Background: Patients with fractures are a common emergency presentation and may be misdiagnosed at radiologic imaging. An
increasing number of studies apply artificial intelligence (AI) techniques to fracture detection as an adjunct to clinician diagnosis.
Purpose: To perform a systematic review and meta-analysis comparing the diagnostic performance in fracture detection between AI
and clinicians in peer-reviewed publications and the gray literature (ie, articles published on preprint repositories).
Materials and Methods: A search of multiple electronic databases between January 2018 and July 2020 (updated June 2021) was per-
formed that included any primary research studies that developed and/or validated AI for the purposes of fracture detection at any
imaging modality and excluded studies that evaluated image segmentation algorithms. Meta-analysis with a hierarchical model to
calculate pooled sensitivity and specificity was used. Risk of bias was assessed by using a modified Prediction Model Study Risk of
Bias Assessment Tool, or PROBAST, checklist.
Results: Included for analysis were 42 studies, with 115 contingency tables extracted from 32 studies (55 061 images). Thirty-seven
studies identified fractures on radiographs and five studies identified fractures on CT images. For internal validation test sets, the
pooled sensitivity was 92% (95% CI: 88, 93) for AI and 91% (95% CI: 85, 95) for clinicians, and the pooled specificity was 91%
(95% CI: 88, 93) for AI and 92% (95% CI: 89, 92) for clinicians. For external validation test sets, the pooled sensitivity was 91%
(95% CI: 84, 95) for AI and 94% (95% CI: 90, 96) for clinicians, and the pooled specificity was 91% (95% CI: 81, 95) for AI and
94% (95% CI: 91, 95) for clinicians. There were no statistically significant differences between clinician and AI performance. There
were 22 of 42 (52%) studies that were judged to have high risk of bias. Meta-regression identified multiple sources of heterogeneity
in the data, including risk of bias and fracture type.
Conclusion: Artificial intelligence (AI) and clinicians had comparable reported diagnostic performance in fracture detection,
suggesting that AI technology holds promise as a diagnostic adjunct in future clinical practice.
Figure 1: Preferred Reporting Items for Systematic Reviews and Meta-Analyses flowchart shows studies selected for review. ACM = Association for Computing Ma-
chinery, AI = artificial intelligence, CENTRAL = Central Register of Controlled Trials, CINAHL = Cumulative Index to Nursing and Allied Health Literature, IEEE = Institute of
Electrical and Electronics Engineers and Institution of Engineering and Technology.
fer learning as covariates. Statistical significance was indicated at Model for Individual Prognosis or Diagnosis (TRIPOD) check-
a P value of .05. All calculations were performed by using statis- list, which is a 22-item list of recommendations to aid transpar-
tical software (Stata version 14.2, Midas and Metandi modules; ent reporting of studies that develop and/or validate prediction
StataCorp) (18,19). models (20). We used a modified version of TRIPOD (Appen-
dix E2, Table E3 [online]), as we considered that not all items
Quality Assessment on the checklist were informative for deep learning studies; for
We assessed studies for adherence to reporting guidelines by example, reporting follow-up time is irrelevant for diagnostic
using the Transparent Reporting of a Multivariable Prediction accuracy studies. The checklist therefore is limited in granular
discrimination between studies, but instead acts as a general in- servers and citation searching. After full-text screening, 42 studies
dicator of reporting standards. were included in the review, of which 35 were peer-reviewed pub-
We used the Prediction Model Study Risk of Bias Assess- lications and seven were preprint publications (Fig 1). Thirty-seven
ment Tool, or PROBAST, checklist to assess papers for bias and studies identified fractures on radiographs, of which 18 focused on
applicability (Appendix E2, Table E4 [online]) (21). This tool lower limb, 15 on upper limb, and four on other fractures (Tables
uses signaling questions in four domains (participants, predic- 1–3) (23–58). Five studies identified fractures on CT images (59–
tors, outcomes, and analysis) to provide both an overall and a 63). All studies performed their analyses with a computer, with
granular assessment. We considered both the images used to retrospectively collected data, by using a supervised learning ap-
develop algorithms, and the patient population or populations proach; and one study also performed a prospective nonrandom-
the models were tested on, to assess bias and applicability in the ized clinical trial (34). Thirty-six studies developed and internally
first domain. We did not include an assessment of bias or ap- validated an algorithm, and nine of these studies also externally
plicability for predictors. The diagnostic performance of both AI validated their algorithm (23,24,26–30,32,33,35–55,57–59,61–
and clinicians at internal and external validation was examined 63). Six studies externally validated or made an incremental
separately in studies assessed at low risk of bias. change to a previously developed algorithm (25,31,34,56,60,63).
Twenty-three studies restricted their analysis to a single radiologic
Publication Bias view (25,27,29–35,37,38,40–43,46–48,50–53,58).
We minimized the effect of publication bias by searching pre- Sixteen studies compared the performance of AI with ex-
print servers and hand-searching the reference lists of included pert clinicians, seven compared AI to experts and nonexperts,
studies. We performed a formal assessment of publication bias and one compared AI to nonexperts only (24,26–31,33–
through a regression analysis by using diagnostic log odds ratios 39,42,43,46,50,51,53–55,58,59). Six studies included clini-
and testing for asymmetry (22). cian performance with and without AI assistance as a compari-
son group (29–31,34,54,59). The size of comparison groups
Results ranged from three to 58 (median, six groups; interquartile range,
Study Selection and Characteristics 4–15). Three studies compared their algorithm against other al-
We identified 8783 peer-reviewed studies, of which 1981 were gorithms and 16 studies did not include a comparison group
duplicates. A further 149 studies were identified through preprint (23,25,32,40,41,44–49,52,56,57,60–64).
Figure 2: Summary of study adherence to Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis (TRIPOD) reporting guidelines.
To generate a reference standard for image labeling, 18 stud- 30,32,33,35–44,46,48–59,61,62). Twenty-six studies used ran-
ies used expert consensus, six relied on the opinion of a single dom split sample validation as a method of internal validation,
expert reader, seven used pre-existing radiologic reports or other five used stratified split sampling, and four used a resampling
imaging modalities, five studies used mixed methods, and one method (Table E6 [online]) (23–31,33,35–55,57,58,61,62).
defined their reference standard as surgically confirmation frac- Twenty-two studies included localization of fractures in
tures (23,24,26–32,34–39,41–63). Three studies did not report model output to improve end-user interpretability (23,24,30–
how their reference standard was generated (25,33,40). 39,41,42,44,48,49,52,54,57,60,64). Metrics used to evaluate
model performance varied widely, including sensitivity and
Study Participants specificity (38 studies); area under the receiver operating charac-
The number of participants represented by each data set ranged teristic curve and Youden index (23 studies); accuracy (22 stud-
widely (median, 1169; range 65–21,456; interquartile range, ies); positive and negative predictive values (nine studies); and
425–2417; Appendix E3, Table E5 [online]). The proportion of F1, precision, and recall (nine studies).
participants with a fracture in each data set also ranged widely
(median, 50%; interquartile range, 40%–64%). Seventeen stud- Quality Assessment
ies did not include the proportion of study participants who were Adherence to TRIPOD reporting standards was variable (Fig 2).
men or women, and 15 studies did not include information about Four items were poorly reported (,50% adherence): clarity of
participant age (23,25,27,34–37,41,46,54,56,57,58,60–63). study title and abstract (19% and 17% adherence, respectively),
sample size calculation (2.4%), discussion and attempt to im-
Algorithm Development and Model Output prove model interpretability (43%), and a statement about sup-
The size of training (median, 1898; interquartile range, 784– plementary code or data availability (19%).
7646), tuning (median, 739; interquartile range, 142–980) and Prediction Model Study Risk of Bias Assessment Tool, or
test (median, 306; interquartile range, 233–1111) data sets at PROBAST, led to an overall rating of 22 (52%) and 21 (50%)
the patient level varied widely (Tables 1–3). Five of 33 (15.2%) studies as high risk of bias and concerns regarding applicability,
studies that developed an algorithm did not report the size of respectively (Fig 3). The main contributing factors to this as-
each data set separately (24,28,32,33,40). In studies that per- sessment were studies that did not perform external validation,
formed external validation of an algorithm, the median size of or internally validated models with small sample sizes. Fifteen
the data set was 511 (range, 100–1696). Thirty studies used (36%) studies were judged to be at high risk of bias and 18
data augmentation, and 30 studies used transfer learning (23– (43%) at high concern for applicability in participant selection
Figure 3: Summary of Prediction Model Study Risk of Bias Assessment Tool (PROBAST) risk of bias and concern about generalizability scores.
Figure 4: Hierarchical summary receiver operating characteristic (HSROC) curves for (A) fracture detection algorithms and (B)
clinicians with internal validation test sets. The 95% prediction region is a visual representation of between-study heterogeneity.
Figure 5: Hierarchical summary receiver operating characteristic (HSROC) curves for (A) fracture detection algorithms and (B)
clinicians with external validation test sets. The 95% prediction region is a visual representation of between-study heterogeneity.
Table 4: Pooled Sensitivities, Specificities, and Areas Under the Curve for Artificial Intelligence Algorithms and Clinicians
tables were extracted from four studies for human performance CI: 87, 91; P , .01), use of data augmentation (92%; 95% CI:
with AI assistance (29–31,34). 90, 93; P , .01), and transfer learning (91%; 95% CI: 90, 93;
Hierarchical summary receiver operating characteristic P , .01). Higher model sensitivity was associated with algo-
curves from the studies evaluating AI and clinician perfor- rithms focusing on lower limb fractures (95%; 95% CI: 93, 97;
mance on internal validation test sets are included in Figure 4. P , .01) and use of resampling methods (97%; 95% CI: 94, 100;
The pooled sensitivity was 92% (95% CI: 88, 94) for AI and P , .01). We performed a sensitivity analysis, separately evaluat-
91% (95% CI: 85, 95) for clinicians. The pooled specificity ing studies with low risk of bias. We found that all performance
was 91% (95% CI: 88, 93) for AI and 92% (95% CI: 89, 95) metrics were lower, although only the reduction in area under
for clinicians. At external validation, the pooled sensitivity was the curve in studies assessing the performance of algorithms at
91% (95% CI: 84, 95) for AI and 94% (95% CI: 90, 96) for external validation reached statistical significance (96%; 95%
clinicians on matched test sets (Fig 5). The pooled specificity CI: 94, 98; P , .01; Table 4, Fig 6). We report findings of sen-
was 91% (95% CI: 82, 96) for AI and 94% (95% CI: 91, sitivity analyses for other covariates in Figure E1, Appendix E4,
95) for clinicians. When clinicians were provided with AI as- and Tables E9–E13 (online).
sistance, the pooled sensitivity and specificity were 97% (95%
CI: 83, 99) and 92% (95% CI: 88, 95), respectively. Publication Bias
Meta-regression of all studies showed that lower model We assessed publication bias by using a regression analysis to
specificity was associated with lower risk of bias (89%; 95% quantify funnel plot asymmetry (Fig E2 [online]) (22). We
Figure 6: Summary of pooled sensitivity, specificity, and area under the curve (AUC) of algorithms and clinicians comparing all studies versus
low-bias studies with 95% CIs.
found that the slope coefficient was 25.4 (95% CI: 213.7, radiographs, an AI that always predicts a fracture will have a re-
2.77; P = .19), suggesting a low risk of publication bias. ported accuracy of 82%, despite being deeply flawed (30). A meta-
analysis of nine studies by Yang et al (14) reported a pooled sen-
Discussion sitivity and specificity of 87% (95% CI: 78, 93) and 91% (95%
An increasing number of studies are investigating the potential CI: 85, 95), respectively. This is consistent with the findings of
for artificial intelligence (AI) as a diagnostic adjunct in fracture our meta-analysis of 32 studies. We provided further granularity
diagnosis. We performed a systematic review of the methods, of results, reporting pooled sensitivity and specificity separately for
results, reporting standards, and quality of studies in assess- internal (sensitivity, 92% [95% CI: 88, 94]; and specificity, 91%
ing deep learning in fracture detection tasks. We performed a [95% CI: 88, 93]) and external (sensitivity, 91% [95% CI: 84,
meta-analysis of diagnostic performance, grouped into internal 95]; and specificity, 91% [95% CI: 81, 95]) validation.
and external validation results, and compared with clinician Our study had limitations. First, we only included studies in
performance. Our review highlighted four principal findings. the English language that were published after 2018, excluding
First, AI had high reported diagnostic accuracy, with a pooled other potentially eligible studies. Second, we were only able to
sensitivity of 91% (95% CI: 84, 85) and specificity of 91% extract contingency tables from 32 studies. Third, many studies
(95% CI: 81, 95). Second, AI and clinicians had comparable had methodologic flaws and half were classified as high concern
performance (pooled sensitivity, 94% [95% CI: 90, 96]; and for bias and applicability, limiting the conclusions that could be
specificity, 94% [95% CI: 91, 95]) at external validation. The drawn from the meta-analysis because studies with high risk of
addition of AI assistance improved clinician performance fur- bias consistently overestimated algorithm performance. Fourth,
ther (pooled sensitivity, 97% [95% CI: 83, 99]; and specificity, although adherence to TRIPOD items was generally fair, many
92% [95% CI: 88, 95]), and one study found that clinicians manuscripts omitted vital information such as the size of train-
reached a diagnosis in a shorter time with AI assistance (29–31, ing, tuning, and test sets.
34). Third, there were significant flaws in study methods that The results from this meta-analysis cautiously suggest that AI
may limit the real-world applicability of study findings. For is noninferior to clinicians in terms of diagnostic performance in
example, it is likely that clinician performance was underesti- fracture detection, showing promise as a useful diagnostic tool.
mated: Only one study provided clinicians with background Many studies have limited real-world applicability because of
clinical information. Half of the studies that had a clinician flawed methods or unrepresentative data sets. Future research
comparison group used small groups (ie, less than five) at high must prioritize pragmatic algorithm development. For example,
risk of interrater variation. All studies performed experiments imaging views may be concatenated, and databases should mir-
on a computer or via computer simulation, and only one eval- ror the target population (eg, in fracture prevalence, and age and
uated human-algorithm performance in a prospective clini- sex of patients). It is crucial that studies include an objective as-
cal trial. Fourth, there was high heterogeneity across studies, sessment of sample size adequacy as a guide to readers (66). Data
partly attributable to variations in study methods. Heterogene- and code sharing across centers may spread the burden of gener-
ity in sensitivity and specificity was higher when methodologic ating large and precisely labeled data sets, and this is encouraged
choices, such as internal validation methods or reference stan- to improve research reproducibility and transparency (67,68).
dards, were used. There was a wide range of study sample size, Transparency of study methods and clear presentation of results
but only one study (63) performed a sample size calculation. is necessary for accurate critical appraisal. Machine learning ex-
Previous narrative reviews have reported a wide range of AI ac- tensions to TRIPOD, or TRIPOD-ML, and Standards for Re-
curacy (11–13). However, the use of accuracy as an outcome met- porting of Diagnostic Accuracy Studies, or STARD-AI, guide-
ric in image classification tasks can be misleading (65). For exam- lines are currently being developed and may improve conduct
ple, in a data set consisting of 82% fracture and 18% unfractured and reporting of deep learning studies (69–71).
Future research should seek to externally validate algorithms 7. Hallas P, Ellingsen T. Errors in fracture diagnoses in the emergency depart-
ment--characteristics of patients and diurnal variation. BMC Emerg Med
in prospective clinical settings and provide a fair comparison 2006;6(1):4.
with relevant clinicians: for example, providing clinicians with 8. Zha N, Patlas MN, Duszak R Jr. Radiologist burnout is not just iso-
routine clinical detail. External validation and evaluation of al- lated to the united states: Perspectives from Canada. J Am Coll Radiol
2019;16(1):121–123.
gorithms in prospective randomized clinical trials is a necessary
9. Bender CE, Bansal S, Wolfman D, Parikh JR. 2018 ACR commission
next step toward clinical deployment. Current artificial intelli- on human resources workforce survey. J Am Coll Radiol 2019;16(4 Pt
gence (AI) is designed as a diagnostic adjunct and may improve A):508–512.
workflow through screening or prioritizing images on worklists 10. McKinney SM, Sieniek M, Godbole V, et al. International evaluation of
an AI system for breast cancer screening. Nature 2020;577(7788):89–94
and highlighting regions of interest for a reporting radiologist. [Published correction appears in Nature 2020;586(7829):E19.].
AI may also improve diagnostic certainty through acting as a 11. Smets J, Shevroja E, Hügle T, Leslie WD, Hans D. Machine learn-
“second reader” for clinicians or as an interim report prior to ing solutions for Osteoporosis—A review. J Bone Miner Res
2021;36(5):833–851.
radiologist interpretation. However, it is not a replacement for 12. Langerhuizen DWG, Janssen SJ, Mallee WH, et al. What are the applica-
the clinical workflow, and clinicians must understand AI perfor- tions and limitations of artificial intelligence for fracture detection and
mance and exercise judgement in interpreting algorithm output. classification in orthopaedic trauma imaging? A systematic review. Clin
Orthop Relat Res 2019;477(11):2482–2491.
We advocate for transparent reporting of study methods and re- 13. Kalmet PHS, Sanduleanu S, Primakov S, et al. Deep learning in fracture
sults as crucial to AI integration. By addressing these areas for detection: a narrative review. Acta Orthop 2020;91(2):215–220.
development, deep learning has potential to streamline fracture 14. Yang S, Yin B, Cao W, Feng C, Fan G, He S. Diagnostic accuracy of deep
learning in orthopaedic fractures: A systematic review and meta-analysis.
diagnosis in a way that is safe and sustainable for patients and Clin Radiol 2020;75(9):713.e17–713.e28.
health care systems. 15. Falconer J. Removing duplicates from an EndNote library/2021. http://
blogs.lshtm.ac.uk/library/2018/12/07/removing-duplicates-from-an-end-
Acknowledgment: We thank Eli Harriss, BA, MS, Bodleian Libraries Outreach Li- note-library/. Accessed May 6, 2021.
brarian for the Bodleian Health Care Libraries, who formulated the search strategies 16. Jackson D, Turner R. Power analysis for random-effects meta-analysis. Res
and ran the database searches. Synth Methods 2017;8(3):290–302.
17. Macaskill P, Gatsonis C, Deeks J, Harbord R, Takwoingi Y. Cochrane
handbook for systematic reviews of diagnostic test accuracy. Version 0.9.0.
Author contributions: Guarantors of integrity of entire study, R.Y.L.K., D.F.; study London, England: The Cochrane Collaboration, 2010; 83.
concepts/study design or data acquisition or data analysis/interpretation, all authors; 18. Harbord RM, Whiting P. Metandi: Meta-analysis of diagnostic accuracy using
manuscript drafting or manuscript revision for important intellectual content, all au- hierarchical logistic regression. Stata J 2009;9(2):211–229.
thors; approval of final version of submitted manuscript, all authors; agrees to ensure 19. Dwamena B. MIDAS: Stata module for meta-analytical integration of
any questions related to the work are appropriately resolved, all authors; literature diagnostic test accuracy studies. https://ideas.repec.org/c/boc/bocode/
research, R.Y.L.K., C.H., T.A.C., B.J., A.F., D.C., D.F.; statistical analysis, R.Y.L.K., s456880.html. Published 2009. Accessed January 2, 2022.
D.C., G.S.C.; and manuscript editing, R.Y.L.K., C.H., B.J., A.F., D.C., M.S., 20. Collins GS, Reitsma JB, Altman DG, Moons KG; TRIPOD Group.
G.S.C., D.F. Transparent reporting of a multivariable prediction model for individual
prognosis or diagnosis (TRIPOD): the TRIPOD statement. The TRI-
Data sharing: All data generated or analyzed during the study are included in the POD Group. Circulation 2015;131(2):211–219.
published paper. 21. Wolff RF, Moons KGM, Riley RD, et al. PROBAST: A tool to assess the
risk of bias and applicability of prediction model studies. Ann Intern Med
2019;170(1):51–58.
Disclosures of conflicts of interest: R.Y.L.K. No relevant relationships. C.H. No
22. Deeks JJ, Macaskill P, Irwig L. The performance of tests of publication
relevant relationships. T.A.C. No relevant relationships. B.J. No relevant relationships.
bias and other sample size effects in systematic reviews of diagnostic test
A.F. No relevant relationships. D.C. No relevant relationships. M.S. No relevant re-
accuracy was assessed. J Clin Epidemiol 2005;58(9):882–893.
lationships. G.S.C. No relevant relationships. D.F. Chair, British Society for Surgery
23. Yoon AP, Lee YL, Kane RL, Kuo CF, Lin C, Chung KC. Development
of the Hand Research Committee; member, British Association of Plastic, Recon-
and validation of a deep learning model using convolutional neural net-
structive, and Aesthetic Surgeons Research Committee; member, British Lymphology
works to identify scaphoid fractures in radiographs. JAMA Netw Open
Society Research Committee; chair, Scientific Advisory Committee Restore Research;
2021;4(5):e216096.
Trustee, British Dupuytren Society.
24. Raisuddin AM, Vaattovaara E, Nevalainen M, et al. Critical evalu-
ation of deep neural networks for wrist fracture detection. Sci Rep
References 2021;11(1):6006.
1. Bergh C, Wennergren D, Möller M, Brisby H. Fracture incidence in 25. Uysal F, Hardalaç F, Peker O, Tolunay T, Tokgöz N. Classification of
adults in relation to age and gender: A study of 27,169 fractures in the shoulder X-ray images with deep learning ensemble models. Appl Sci (Ba-
Swedish Fracture Register in a well-defined catchment area. PLoS One sel) 2021;11(6):2723.
2020;15(12):e0244291. 26. Yamada Y, Maki S, Kishida S, et al. Automated classification of hip frac-
2. Amin S, Achenbach SJ, Atkinson EJ, Khosla S, Melton LJ 3rd.Trends in fracture tures using deep convolutional neural networks with orthopedic surgeon-
incidence: a population-based study over 20 years. J Bone Miner Res level accuracy: ensemble decision-making with antero-posterior and lat-
2014;29(3):581–589. eral radiographs. Acta Orthop 2020;91(6):699–704.
3. Curtis EM, van der Velde R, Moon RJ, et al. Epidemiology of fractures 27. Ozkaya E, Topal FE, Bulut T, Gursoy M, Ozuysal M, Karakaya Z. Evalu-
in the United Kingdom 1988-2012: Variation with age, sex, geography, ation of an artificial intelligence system for diagnosing scaphoid fracture
ethnicity and socioeconomic status. Bone 2016;87:19–26. on direct radiography. Eur J Trauma Emerg Surg 2020. 10.1007/s00068-
4. UK NHS Annual Report. Hospital accident & emergency activity 020-01468-0. Published online August 30, 2020.
2019-20. https://digital.nhs.uk/data-and-information/publications/ 28. Murata K, Endo K, Aihara T, et al. Artificial intelligence for the de-
statistical/hospital-accident–emergency-activity/2019-20. Accessed tection of vertebral fractures on plain spinal radiography. Sci Rep
December 21, 2021. 2020;10(1):20031.
5. Wei CJ, Tsai WC, Tiu CM, Wu HT, Chiou HJ, Chang CY. Systematic 29. Mawatari T, Hayashida Y, Katsuragawa S, et al. The effect of deep convolu-
analysis of missed extremity fractures in emergency radiology. Acta Radiol tional neural networks on radiologists’ performance in the detection of hip
2006;47(7):710–717. fractures on digital pelvic radiographs. Eur J Radiol 2020;130:109188.
6. Williams SM, Connelly DJ, Wadsworth S, Wilson DJ. Radiological re- 30. Krogue JD, Cheng KV, Hwang KM, et al. Automatic hip fracture identi-
view of accident and emergency radiographs: a 1-year audit. Clin Radiol fication and functional subclassification with deep learning. Radiol Artif
2000;55(11):861–865. Intell 2020;2(2):e190023.
31. Duron L, Ducarouge A, Gillibert A, et al. Assessment of an AI aid in 51. Gan K, Xu D, Lin Y, et al. Artificial intelligence detection of distal radius
detection of adult appendicular skeletal fractures by emergency physicians fractures: a comparison between the convolutional neural network and pro-
and radiologists: A multicenter cross-sectional diagnostic study. Radiology fessional assessments. Acta Orthop 2019;90(4):394–400.
2021;300(1):120–129. 52. Derkatch S, Kirby C, Kimelman D, Jozani MJ, Davidson JM, Leslie WD.
32. Choi J, Hui JZ, Spain D, Su YS, Cheng CT, Liao CH. Practical computer Identification of vertebral fractures by convolutional neural networks to pre-
vision application to detect hip fractures on pelvic X-rays: a bi-institutional dict nonvertebral and hip fractures: A registry-based cohort study of dual
study. Trauma Surg Acute Care Open 2021;6(1):e000705. X-ray absorptiometry. Radiology 2019;293(2):405–411.
33. Cheng CT, Wang Y, Chen HW, et al. A scalable physician-level deep learn- 53. Chung SW, Han SS, Lee JW, et al. Automated detection and classification
ing algorithm detects universal trauma on pelvic radiographs. Nat Commun of the proximal humerus fracture by using deep learning algorithm. Acta
2021;12(1):1066. Orthop 2018;89(4):468–473.
34. Cheng CT, Chen CC, Cheng FJ, et al. A human-algorithm integration sys- 54. Lindsey R, Daluiski A, Chopra S, et al. Deep neural network im-
tem for hip fracture detection on plain radiography: System development proves fracture detection by clinicians. Proc Natl Acad Sci U S A
and validation study. JMIR Med Inform 2020;8(11):e19416. 2018;115(45):11591–11596.
35. Chen HY, Hsu BW, Yin YK, et al. Application of deep learning algorithm 55. Langerhuizen DWG, Bulstra AEJ, Janssen SJ, et al. Is deep learning on par
to detect and visualize vertebral fractures on plain frontal radiographs. PLoS with human observers for detection of radiographically visible and occult
One 2021;16(1):e0245992. fractures of the scaphoid? Clin Orthop Relat Res 2020;478(11):2653–2659.
36. Choi JW, Cho YJ, Lee S, et al. Using a dual-input convolutional neural 56. Kitamura G, Chung CY, Moore BE 2nd. Ankle fracture detection utiliz-
network for automated detection of pediatric supracondylar fracture on ing a convolutional neural network ensemble implemented with a small
conventional radiography. Invest Radiol 2020;55(2):101–110. sample, de novo training, and multiview incorporation. J Digit Imaging
37. Wang Y, Lu L, Cheng C, et al. Weakly supervised universal fracture de- 2019;32(4):672–677.
tection in pelvic x-rays. In: Shen D, Liu T, Peters TM, et al, eds. Medical 57. Grauhan NF, Niehues SM, Gaudin RA, et al. Deep learning for accurately
Image Computing and Computer Assisted Intervention – MICCAI 2019. recognizing common causes of shoulder pain on radiographs. Skeletal Ra-
MICCAI 2019. Lecture Notes in Computer Science, vol 11769. Cham, diol 2022;51(2):355–362.
Switzerland: Springer, 2019; 459–467. 58. Kim DH, MacKinnon T. Artificial intelligence in fracture detection:
38. Cheng CT, Ho TY, Lee TY, et al. Application of a deep learning algorithm transfer learning from deep convolutional neural networks. Clin Radiol
for detection and visualization of hip fractures on plain pelvic radiographs. 2018;73(5):439–445.
Eur Radiol 2019;29(10):5469–5477. 59. Zhou QQ, Wang J, Tang W, et al. Automatic detection and classification of
39. Blüthgen C, Becker AS, Vittoria de Martini I, Meier A, Martini K, Frauen- rib fractures on thoracic CT using convolutional neural network: Accuracy
felder T. Detection and localization of distal radius fractures: Deep learning and feasibility. Korean J Radiol 2020;21(7):869–879.
system versus radiologists. Eur J Radiol 2020;126:108925. 60. Weikert T, Noordtzij LA, Bremerich J, et al. Assessment of a deep learning
40. Beyaz S, Açıcı K, Sümer E. Femoral neck fracture detection in X-ray images algorithm for the detection of rib fractures on whole-body trauma com-
using deep learning and genetic algorithm approaches. Jt Dis Relat Surg puted tomography. Korean J Radiol 2020;21(7):891–899.
2020;31(2):175–183. 61. Raghavendra U, Bhat NS, Gudigar A, Acharya UR. Automated system for
41. Yahalomi E, Chernofsky M, Werman M. Detection of distal radius fractures the detection of thoracolumbar fractures using a CNN architecture. Future
trained by a small set of X-ray images and faster R-CNN. arXiv preprint Gener Comput Syst 2018;85:184–189.
arXiv:1812.09025. https://arxiv.org/abs/1812.09025. Posted December 21, 62. Pranata YD, Wang KC, Wang JC, et al. Deep learning and SURF for au-
2018. Accessed May 6, 2021. tomated classification and detection of calcaneus fractures in CT images.
42. Yu JS, Yu SM, Erdal BS, et al. Detection and localisation of hip fractures Comput Methods Programs Biomed 2019;171:27–37.
on anteroposterior radiographs with artificial intelligence: proof of concept. 63. Kolanu N, Silverstone E, Pham H, et al. Utility of computer-aided vertebral
Clin Radiol 2020;75(3):237.e1–237.e9. fracture detection software. JOURNAL 2020;31(Suppl 1):S179.
43. Urakawa T, Tanaka Y, Goto S, Matsuzawa H, Watanabe K, Endo N. Detect- 64. Sato Y, Takegami Y, Asamoto T, et al. A computer-aided diagnosis system
ing intertrochanteric hip fractures with orthopedist-level accuracy using a using artificial intelligence for hip fractures -multi-institutional joint de-
deep convolutional neural network. Skeletal Radiol 2019;48(2):239–244. velopment research-. arXiv preprint arXiv:2003.12443. https://arxiv.org/
44. Thian YL, Li Y, Jagmohan P, Sia D, Chan VEY, Tan RT. Convolutional abs/2003.12443. Posted March 11, 2020. Accessed May 6, 2021.
neural networks for automated fracture detection and localization on wrist 65. Kuo RYL, Harrison CJ, Jones BE, Geoghegan L, Furniss D. Perspectives: A
radiographs. Radiol Artif Intell 2019;1(1):e180001. surgeon’s guide to machine learning. Int J Surg 2021;94:106133.
45. Rayan JC, Reddy N, Kan JH, Zhang W, Annapragada A. Binomial clas- 66. Balki I, Amirabadi A, Levman J, et al. Sample-size determination meth-
sification of pediatric elbow fractures using a deep learning multiview odologies for machine learning in medical imaging research: A systematic
approach emulating radiologist decision making. Radiol Artif Intell review. Can Assoc Radiol J 2019;70(4):344–353.
2019;1(1):e180015. 67. Liu X, Rivera SC, Moher D, Calvert MJ, Denniston AK; SPIRIT-AI and
46. Adams M, Chen W, Holcdorf D, McCusker MW, Howe PD, Gail- CONSORT-AI Working Group. Reporting guidelines for clinical trial re-
lard F. Computer vs human: Deep learning versus perceptual training for ports for interventions involving artificial intelligence: the CONSORT-AI
the detection of neck of femur fractures. J Med Imaging Radiat Oncol Extension. BMJ 2020;370:m3164.
2019;63(1):27–32. 68. Rivera SC, Liu X, Chan AW, Denniston AK, Calvert MJ; SPIRIT-AI and
47. Mehta SD, Sebro R. Computer-aided detection of incidental lumbar CONSORT-AI Working Group. Guidelines for clinical trial protocols for
spine fractures from routine dual-energy X-ray absorptiometry (DEXA) interventions involving artificial intelligence: the SPIRIT-AI Extension.
studies using a support vector machine (SVM) classifier. J Digit Imaging BMJ 2020;370:m3210.
2020;33(1):204–210. 69. Turner L, Shamseer L, Altman DG, Schulz KF, Moher D. Does use of
48. Mutasa S, Varada S, Goel A, Wong TT, Rasiej MJ. Advanced deep learning the CONSORT Statement impact the completeness of reporting of ran-
techniques applied to automated femoral neck fracture detection and clas- domised controlled trials published in medical journals? A Cochrane review.
sification. J Digit Imaging 2020;33(5):1209–1217. Syst Rev 2012;1(1):60.
49. Starosolski ZA, Kan H, Annapragada AV. CNN-based radiographic acute 70. Collins GS, Moons KGM. Reporting of artificial intelligence prediction
tibial fracture detection in the setting of open growth plates. bioRxiv pre- models. Lancet 2019;393(10181):1577–1579.
print bioRxiv:506154. https://www.biorxiv.org/content/10.1101/506154. 71. Sounderajah V, Ashrafian H, Aggarwal R, et al. Developing specific report-
Posted January 3, 2019. Accessed May 6, 2021. ing guidelines for diagnostic accuracy studies assessing AI interventions: The
50. Jiménez-Sánchez A, Kazi A, Albarqouni S, et al. Precise proximal femur STARD-AI Steering Group. Nat Med 2020;26(6):807–808.
fracture classification for interactive training and surgical planning. Int J
CARS 2020;15(5):847–857.