0% found this document useful (0 votes)
29 views13 pages

Artificial Intelligence in Fracture Detection

This systematic review and meta-analysis compared the diagnostic performance of artificial intelligence and clinicians in fracture detection. It analyzed 42 studies evaluating AI on radiographs and CT images. The analysis found that AI and clinicians had comparable diagnostic accuracy for fracture detection, with sensitivities and specificities over 90% for both. Only a minority of studies externally validated results or evaluated AI in a prospective clinical trial.

Uploaded by

Carlos
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views13 pages

Artificial Intelligence in Fracture Detection

This systematic review and meta-analysis compared the diagnostic performance of artificial intelligence and clinicians in fracture detection. It analyzed 42 studies evaluating AI on radiographs and CT images. The analysis found that AI and clinicians had comparable diagnostic accuracy for fracture detection, with sensitivities and specificities over 90% for both. Only a minority of studies externally validated results or evaluated AI in a prospective clinical trial.

Uploaded by

Carlos
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

ORIGINAL RESEARCH • EVIDENCE-BASED PRACTICE

Artificial Intelligence in Fracture Detection:


A Systematic Review and Meta-Analysis
Rachel Y. L. Kuo, MB BChir, MA, MRCS • Conrad Harrison, BSc, MBBS, MRCS •
Terry-Ann Curran, MB BCh BAO, MD • Benjamin Jones, BMBCh, BA •
Alexander Freethy, BSc, MBBS, MSc, MRCS • David Cussons, BSc, MBBS • Max Stewart, MB BChir, BA •
Gary S. Collins, BSc, PhD • Dominic Furniss, DM, MA, MBBCh, FRCS (Plast)
From the Nuffield Department of Orthopedics, Rheumatology and Musculoskeletal Sciences, Botnar Research Centre, Old Road Headington, Oxford OX3 7LD, UK
(R.Y.L.K., C.H., M.S., G.S.C., D.F.); Department of Plastic Surgery, John Radcliffe Hospital, Oxford, UK (T.A.C., A.F.); Department of Vascular Surgery, Royal Berkshire
Hospital, Reading, UK (B.J.); Department of Plastic Surgery, Stoke Mandeville Hospital, Aylesbury, Buckinghamshire UK (D.C.); and UK EQUATOR Center, Nuffield
Department of Orthopaedics, Rheumatology and Musculoskeletal Sciences, University of Oxford Centre for Statistics in Medicine, Oxford UK (G.S.C.). Received July
14, 2021; revision requested August 16; revision received January 15, 2022; accepted January 21. Address correspondence to R.Y.K. (e-mail: [email protected]).
R.K. supported by a National Institute for Health Research (NIHR) Academic Clinical Fellowship. C.H. supported by a NIHR Doctoral Research Fellowship
(NIHR300684). D.F. supported by the Oxford NIHR Biomedical Research Centre. This publication presents independent research funded by the NIHR.

Conflicts of interest are listed at the end of this article.


See also the editorial by Cohen and McInnes in this issue.

Radiology 2022; 304:50–62 • https://doi.org/10.1148/radiol.211785 • Content codes:

Background: Patients with fractures are a common emergency presentation and may be misdiagnosed at radiologic imaging. An
increasing number of studies apply artificial intelligence (AI) techniques to fracture detection as an adjunct to clinician diagnosis.

Purpose: To perform a systematic review and meta-analysis comparing the diagnostic performance in fracture detection between AI
and clinicians in peer-reviewed publications and the gray literature (ie, articles published on preprint repositories).

Materials and Methods: A search of multiple electronic databases between January 2018 and July 2020 (updated June 2021) was per-
formed that included any primary research studies that developed and/or validated AI for the purposes of fracture detection at any
imaging modality and excluded studies that evaluated image segmentation algorithms. Meta-analysis with a hierarchical model to
calculate pooled sensitivity and specificity was used. Risk of bias was assessed by using a modified Prediction Model Study Risk of
Bias Assessment Tool, or PROBAST, checklist.

Results: Included for analysis were 42 studies, with 115 contingency tables extracted from 32 studies (55 061 images). Thirty-seven
studies identified fractures on radiographs and five studies identified fractures on CT images. For internal validation test sets, the
pooled sensitivity was 92% (95% CI: 88, 93) for AI and 91% (95% CI: 85, 95) for clinicians, and the pooled specificity was 91%
(95% CI: 88, 93) for AI and 92% (95% CI: 89, 92) for clinicians. For external validation test sets, the pooled sensitivity was 91%
(95% CI: 84, 95) for AI and 94% (95% CI: 90, 96) for clinicians, and the pooled specificity was 91% (95% CI: 81, 95) for AI and
94% (95% CI: 91, 95) for clinicians. There were no statistically significant differences between clinician and AI performance. There
were 22 of 42 (52%) studies that were judged to have high risk of bias. Meta-regression identified multiple sources of heterogeneity
in the data, including risk of bias and fracture type.

Conclusion: Artificial intelligence (AI) and clinicians had comparable reported diagnostic performance in fracture detection,
suggesting that AI technology holds promise as a diagnostic adjunct in future clinical practice.

Clinical trial registration no. CRD42020186641


© RSNA, 2022

Online supplemental material is available for this article.

F ractures have an incidence of between 733 and 4017 per


100 000 patient-years (1–3). In the financial year April
2019 to April 2020, 1.2 million patients presented to an
found a 30% increase in job openings from 2017 to 2018
(9). Strategies (6,10) to reduce rates of fracture misdiagno-
sis and to streamline patient pathways are crucial to main-
emergency department in the United Kingdom with an tain high standards of patient care.
acute fracture or dislocation, an increase of 23% from the Artificial intelligence (AI) is a branch of computer
year before (4). Missed or delayed diagnosis of fractures on science in which algorithms perform tasks traditionally
radiographs is a common diagnostic error, ranging from assigned to humans. Machine learning is a term that
3% to 10% (5–7). There is an inverse relationship between refers to a group of techniques in the field of AI that
clinician experience and rate of fracture misdiagnosis, but allow algorithms to learn from data, iteratively improv-
timely access to expert opinion is not widely available (6). ing their own performance without the need for explicit
Growth in imaging volumes continues to outpace radiolo- programming. Deep learning is a term often used in-
gist recruitment: A Canadian study (8) from 2019 found terchangeably with machine learning but refers to al-
an increase in radiologist workloads of 26% over 12 years, gorithms that use multiple processing layers to extract
whereas a study from the American College of Radiologists high level information from any input. Health care, and
This copy is for personal use only. To order printed copies, contact [email protected]
Kuo et al

summarized 32 studies, finding a wide range of accuracy (78%–


Abbreviations 99%) similar to Langerhuizen et al (12), who found a range of
AI = artificial intelligence, TRIPOD = Transparent Reporting of a 77%–90% accuracy across 10 studies. Recent studies reported
Multivariable Prediction Model for Individual Prognosis or Diagnosis
higher accuracy estimates (93%–99%) (12–14). Yang et al (14)
Summary performed a meta-analysis of nine studies with a pooled sensitiv-
Artificial intelligence is noninferior to clinicians in terms of diagnos- ity and specificity of 87% and 91%, respectively.
tic performance in fracture detection, showing promise as a useful Our study is a systematic review and meta-analysis of 42
diagnostic tool. studies, comparing the diagnostic performance in fracture detec-
Key Results tion between AI and clinicians in peer-reviewed publications and
N In a systematic review and meta-analysis of 42 studies (37 studies in the gray literature (ie, articles published on preprint reposito-
with radiography and five studies with CT), the pooled diagnostic ries) on radiographs or CT images. We described study methods,
performance from the use of artificial intelligence (AI) to detect adherence to reporting guidelines, and we performed a detailed
fractures had a sensitivity of 92% and 91% and specificity of 91% assessment of risk of bias and study applicability.
and 91%, on internal and external validation, respectively.
N Clinician performance had comparable performance to AI in frac- Materials and Methods
ture detection (sensitivity 91%, 92%; specificity 94%, 94%).
N Only 13 studies externally validated results, and only one study
evaluated AI performance in a prospective clinical trial. Protocol and Registration
This systematic review was prospectively registered with PROS-
in particular, radiologic image classification, has been identi- PERO (CRD42020186641). Our study was prepared by using
fied as a key sector in which AI could streamline pathways, guidelines from the Preferred Reporting Items for a Systematic
acting as a triage or screening service, as a decision aid, or as Review and Meta-analysis of Diagnostic Test Accuracy Studies
second-reader support for radiologists (10). (15,16). All stages of the review (title and abstract screening,
Recent narrative reviews have reported high accuracy for deep full-text screening, data extraction, assessment of adherence to
learning in fracture detection and classification. Smets et al (11) reporting guidelines, bias, and applicability) were performed in

Figure 1: Preferred Reporting Items for Systematic Reviews and Meta-Analyses flowchart shows studies selected for review. ACM = Association for Computing Ma-
chinery, AI = artificial intelligence, CENTRAL = Central Register of Controlled Trials, CINAHL = Cumulative Index to Nursing and Allied Health Literature, IEEE = Institute of
Electrical and Electronics Engineers and Institution of Engineering and Technology.

Radiology: Volume 304: Number 1—July 2022 n radiology.rsna.org 51


Artificial Intelligence in Fracture Detection

Table 1: Characteristics of Studies, Developing and Internally Validating Algorithms

No. of Images per Set Peer


Imaging Target Comparison Reference Model Review
First Author Year Modality Condition View Group Training Tuning Testing Standard Output Status
With a
comparison
group
Adams (46)2018 Radio​ Proximal AP view Comparison 643 ... 161 Surgical NR Yes
graphy femur of two confirmation
fracture algorithms,
nonexpert
clinicians
Chen (35) 2021 Radio​ Vertebral Frontal view Expert 1045 N 261 Expert Binary Yes
graphy fractures clinicians consensus classification
and saliency
map
Chung* 2018 Radio​ Upper AP view Expert NR … … Expert NR Yes
(53) graphy humerus clinicians consensus
fracture
Gan (51) 2019 Radio​ Distal AP view Expert 5202 918 300 Expert Binary Yes
graphy radius clinicians consensus classification
fracture
Jimenez- 2020 Radio​ Proximal AP view Expert 943 135 269 Expert NR Yes
Sanchez graphy femur clinicians consensus
(50) fractures
Kim (58) 2018 Radio​ Distal Lateral Expert 8890 1111 1111 Single expert Probability of Yes
graphy radius or view clinicians opinion fracture
ulna
fracture
Krogue 2020 Radio​ Proximal AP view Expert 1849 739 438 Nonexpert Probability of Yes
(30) graphy femur clinicians consensus, fracture and
fractures with with saliency map
and reference
without to other
algorithm imaging
assistance in cases of
uncertainty
Langer​ 2020 Radio​ Scaphoid Scaphoid Expert 180 20 100 MRI report Probability of Yes
huizen (55) graphy fracture series clinicians fracture
Mawatari 2020 Radio​ Proximal AP view Expert and 550 N 50 Expert Probability of Yes
(29) graphy femur nonexpert consensus, fracture
fractures clinicians, using CT/
with and MRI for
without reference
algorithm
assistance
Murata† 2020 Radio​ Vertebral AP and Expert and NR … … MRI report NR Yes
(28) graphy fractures lateral view nonexpert
clinicians
Ozkaya 2020 Radio​ Scaphoid AP view Expert and 203 87 100 CT report and NR Yes
(27) graphy fracture nonexpert single expert
clinicians opinion
Pranata 2019 CT Calcaneal NR Comparison 1550 N 381 Radiological NR Yes
(62) fractures of multiple report
algorithms
Table 1 (continues)

52 radiology.rsna.org n Radiology: Volume 304: Number 1—July 2022


Kuo et al

Table 1 (continued): Characteristics of Studies, Developing and Internally Validating Algorithms

No. of Images per Set Peer


Imaging Target Comparison Reference Model Review
First Author Year Modality Condition View Group Training Tuning Testing Standard Output Status
Raisuddin‡ 2020 Radio​ Distal Concatenated Expert and 1946 N N Expert Probability of No
(24) graphy radius AP nonexpert consensus, fracture and
fracture and lateral clinicians with CT saliency map
view verification
in “Test set
2”
Urakawa 2018 Radio​ Intertro​ AP view Expert 2678 334 334 Single expert Binary Yes
(43) graphy chanteric clinicians opinion classification
proximal
femur
fractures
Yamada 2020 Radio​ Proximal Separate and Expert 2632 N 300 Expert NR Yes
(26) graphy femur combined clinicians consensus
fractures AP/lateral and CT/
view MRI results
Yu§ (42) 2020 Radio​ Proximal AP view Expert 637 212 212 Radiological Binary Yes
graphy femur clinicians report classification
fracture with
bounding
box
Without a
comparison
group
Beyaz|| (40) 2020 Radio​ Proximal AP view None NR … … NR NR Yes
graphy femur
fracture
Derkatch 2019 Radio​ Vertebral None 7646 1274 3822 Expert Binary Yes
(52) graphy fractures, consensus classification
lateral view and saliency
map
Grauhan 2021 Radio​ Proximal Unspecified None 2700 675 269 Single expert Probability of Yes
(57) graphy humerus views opinion, fracture and
fracture with expert saliency map
consensus for
test set
Mehta (47) 2019 Radio​ L1–4 AP view None 246 N 61 Expert NR Yes
graphy vertebral consensus
fractures
Mutasa 2020 Radio​ Proximal AP view None 7250 N 1813 Single expert Probability of Yes
(48) graphy femur opinion fracture and
fractures saliency map
Ragha​ 2018 CT T11-L1 Sagittal None 783 N 336 Single expert NR Yes
vendra (61) vertebral view opinion
fractures
Rayan (45) 2019 Radio​ Supra​ AP or lateral None 20 350 N 3096 Radiological Probability of Yes
graphy condylar view report, fracture
or lateral single expert
condyle opinion
elbow in test set
fracture
Sato (64) 2020 Radio​ Proximal AP view None 8484 1000 1000 Expert Probability of No
graphy femur consensus fracture and
fractures saliency map
Table 1 (continues)

Radiology: Volume 304: Number 1—July 2022 n radiology.rsna.org 53


Artificial Intelligence in Fracture Detection

Table 1 (continued): Characteristics of Studies, Developing and Internally Validating Algorithms

No. of Images per Set Peer


Imaging Target Comparison Reference Model Review
First Author Year Modality Condition View Group Training Tuning Testing Standard Output Status
Starosolski 2019 Radio​ Tibial AP or lateral None 784 98 98 Radiological Probability of No
(49) graphy fracture view report fracture and
saliency map
Yahalomi 2018 Radio​ Distal AP view None 3583 N 893 Single expert Binary No
(41) graphy radius opinion classification
fracture with
bounding
box
Yoon (23) 2021 Radio​ Scaphoid AP and ulnar None 8356 1177 2305 Expert Probability of Yes
graphy fracture deviated consensus fracture and
views saliency map
Note.—AP = anteroposterior, N = no tuning set, NR = not reported.
* Ten-fold cross-validation (n = 189).

Fivefold cross-validation (n = 300).

Test set 1, 207 images; test set 2, 105 images.
§
Twenty-fold cross validation.
||
Fivefold cross-validation (n = 2106).

duplicate by two independent reviewers (R.Y.L.K. and either Data Extraction


C.H., T.A.C., B.J., A.F., D.C., or M.S.), and disagreements Titles and abstracts were screened before full-text screening.
were resolved by discussion with a third independent reviewer Data were extracted by using a predefined data extraction sheet.
(G.S.C. or D.F.). A list of excluded studies, including the reason for exclusion, was
recorded in a Preferred Reporting Items for Systematic Reviews
Search Strategy and Study Selection and Meta-Analyses flow diagram. Any further papers identified
A search was performed to identify studies that developed through reference lists underwent the same process of screening
and/or validated an AI algorithm for the purposes of fracture and data extraction in duplicate.
detection. A search strategy was developed with an information We extracted information from each study including peer-
specialist, including variations of the terms artificial intelligence review status, study design, target condition, sample size, com-
and diagnostic imaging. The full search strategy is included in Ap- parator groups, and results. Where possible, we extracted diag-
pendix E1 (online) and Tables E1 and E2 (online). We searched nostic performance information to construct contingency tables
the following electronic databases for English language peer-re- for each model and used them to calculate sensitivity and speci-
viewed and gray literature between January 2018 and July 2020 ficity. When studies included more than one contingency table,
(updated in June 2021): Ovid Medline, Ovid Embase, EBSCO they were included in the analysis.
Cumulative Index to Nursing and Allied Health Literature,
Web of Science, Cochrane Central, Institute of Electrical and Statistical Analysis
Electronics Engineers and Institution of Engineering and Tech- We estimated the diagnostic performance of the deep learning
nology Xplore, Association for Computing Machinery Digital algorithms and clinicians by carrying out a meta-analysis of stud-
Library, arXiv, medRxiv, and bioRxiv. The reference lists of all ies providing contingency tables at both internal and external
included articles were screened to identify relevant publications validation. We planned to perform a meta-analysis if at least five
that were missed from our search. studies were eligible for inclusion, recommended for random-
We included all articles that fulfilled the following inclusion effects meta-analysis (16). We used the contingency tables to
criteria: primary research studies that developed and/or validated construct hierarchical summary receiver operating characteris-
a deep learning algorithm for fracture detection or classification tic curves and to calculate pooled sensitivities and specificities,
in any user-independent imaging modality, English language, anticipating a high level of heterogeneity (17). We constructed
and human subjects. We applied the following exclusion criteria a visual representation of between-study heterogeneity by using
to our search: conference abstracts, letters to the editor, review a 95% prediction region in the hierarchical summary receiver
articles, and studies that performed purely segmentation tasks operating characteristic curves. We performed a meta-regression
or radiomics analysis. We excluded duplicates by using Endnote analysis to identify sources of between-studies heterogeneity by
39, following the method described by Falconer (15). We did introducing level of bias; study and fracture type; the reference
not place any limits on the target population, study setting, or standard; peer-review status; and whether the algorithm used
comparator group. single or multiple radiologic views, data augmentation, or trans-

54 radiology.rsna.org n Radiology: Volume 304: Number 1—July 2022


Kuo et al

Table 2: Characteristics of Studies Developing, Internally and Externally Validating Algorithms

No. of Images per Set Peer


First Imaging Comparison Reference Review
Author Year Modality Target Condition View Group Training Tuning Testing Standard Model Output Status
Bluthgen 2019 Radio​ Distal radius Concatenated Expert 524 N 300 Expert Probability of Yes
(39) graphy fracture AP clinicians consensus fracture and
and lateral saliency map
views
Cheng* 2021 Radio​ Proximal femur AP view Expert NR … … NR Probability Yes
(33) graphy and pelvic clinicians of fracture,
fractures saliency map
and point
annotation
Cheng† 2019 Radio​ Proximal femur AP view Expert 23 288 N 5822 Single expertProbability of Yes
(38) graphy fracture clinicians opinion fracture and
saliency map
Choi‡ 2021 Radio​ Proximal femur AP view None NR … … CT report Probability of No
(32) graphy fracture fracture and
saliency map
Choi§ 2020 Radio​ Upper humerus AP or lateral Expert 1012 254 N Expert Probability of Yes
(36) graphy fracture view clinicians consensus fracture and
saliency map
Lindsey|| 2018 Radio​ Any wrist fracture Any view Expert and 28 341 3149 3500 Expert Binary Yes
(54) graphy nonexpert consensus classification
clinicians, and
with and segmentation
without prediction
algorithm
assistance
Thian 2019 Radio​ Distal radius AP or lateral None 13 153 N 1461 Expert Saliency map Yes
(44) graphy fracture view consensus
Wang 2019 Radio​ Proximal femur or AP view Expert 3087 882 441 Expert Binary No
(37) graphy pelvic fracture clinicians consensus classification
with
bounding
box
Zhou 2020 CT Rib fractures NR Expert 876 98 105 Expert Binary Yes
(59) clinicians, consensus classification
with and
without
algorithm
assistance
Note.—AP = anteroposterior, N = no tuning set, NR = not reported.
* Fivefold cross-validation (n = 5204); external test set, 1888 images.

External test set, 100 images.

n = 4235; external test set, 500 images.
§
External test set 1, 258 images; external test set 2, 95 images.
||
External set, 1400 images.

fer learning as covariates. Statistical significance was indicated at Model for Individual Prognosis or Diagnosis (TRIPOD) check-
a P value of .05. All calculations were performed by using statis- list, which is a 22-item list of recommendations to aid transpar-
tical software (Stata version 14.2, Midas and Metandi modules; ent reporting of studies that develop and/or validate prediction
StataCorp) (18,19). models (20). We used a modified version of TRIPOD (Appen-
dix E2, Table E3 [online]), as we considered that not all items
Quality Assessment on the checklist were informative for deep learning studies; for
We assessed studies for adherence to reporting guidelines by example, reporting follow-up time is irrelevant for diagnostic
using the Transparent Reporting of a Multivariable Prediction accuracy studies. The checklist therefore is limited in granular

Radiology: Volume 304: Number 1—July 2022 n radiology.rsna.org 55


Artificial Intelligence in Fracture Detection

Table 3: Characteristics of Studies Making Incremental Changes, or Externally Validating Algorithms

No. of Images per Set Peer


First Imaging Target Comparison Reference Review
Author Year Modality Condition View Group Training Tuning Testing Standard Model Output Status
Cheng* 2020 Radio​ Proximal femur AP view Expert and … … … Expert Probability of Yes
(34) graphy fracture nonexpert consensus fracture and
clinicians, with saliency map
and without
algorithm
assistance
Duron† 2021 Radio​ Any Unspecified Expert and … … … Expert Binary Yes
(31) graphy appendicular views nonexpert consensus classification
fracture clinicians, with and
and without bounding
algorithm box
assistance
Kitamura 2019 Radio​ Ankle fracture AP or lateralComparison 1441 N 240 Expert NR Yes
(56) graphy view of multiple consensus
algorithms
Kolanu‡ 2020 CT Vertebral NR None … … … Expert NR Yes
(63) compression consensus
fractures
Uysal 2021 Radio​ Any shoulder Views not Comparison 8379 N 563 NR Binary No
(25)§ graphy fracture specified of multiple classification
algorithms
Note.—AP = anteroposterior, N = no tuning set NR = not reported.
* External test set, 100 images; prospective clinical trial, 632 images.

External test set, 600 images.

External validation, 1696 images.
§
External validation, 150 images.

discrimination between studies, but instead acts as a general in- servers and citation searching. After full-text screening, 42 studies
dicator of reporting standards. were included in the review, of which 35 were peer-reviewed pub-
We used the Prediction Model Study Risk of Bias Assess- lications and seven were preprint publications (Fig 1). Thirty-seven
ment Tool, or PROBAST, checklist to assess papers for bias and studies identified fractures on radiographs, of which 18 focused on
applicability (Appendix E2, Table E4 [online]) (21). This tool lower limb, 15 on upper limb, and four on other fractures (Tables
uses signaling questions in four domains (participants, predic- 1–3) (23–58). Five studies identified fractures on CT images (59–
tors, outcomes, and analysis) to provide both an overall and a 63). All studies performed their analyses with a computer, with
granular assessment. We considered both the images used to retrospectively collected data, by using a supervised learning ap-
develop algorithms, and the patient population or populations proach; and one study also performed a prospective nonrandom-
the models were tested on, to assess bias and applicability in the ized clinical trial (34). Thirty-six studies developed and internally
first domain. We did not include an assessment of bias or ap- validated an algorithm, and nine of these studies also externally
plicability for predictors. The diagnostic performance of both AI validated their algorithm (23,24,26–30,32,33,35–55,57–59,61–
and clinicians at internal and external validation was examined 63). Six studies externally validated or made an incremental
separately in studies assessed at low risk of bias. change to a previously developed algorithm (25,31,34,56,60,63).
Twenty-three studies restricted their analysis to a single radiologic
Publication Bias view (25,27,29–35,37,38,40–43,46–48,50–53,58).
We minimized the effect of publication bias by searching pre- Sixteen studies compared the performance of AI with ex-
print servers and hand-searching the reference lists of included pert clinicians, seven compared AI to experts and nonexperts,
studies. We performed a formal assessment of publication bias and one compared AI to nonexperts only (24,26–31,33–
through a regression analysis by using diagnostic log odds ratios 39,42,43,46,50,51,53–55,58,59). Six studies included clini-
and testing for asymmetry (22). cian performance with and without AI assistance as a compari-
son group (29–31,34,54,59). The size of comparison groups
Results ranged from three to 58 (median, six groups; interquartile range,
Study Selection and Characteristics 4–15). Three studies compared their algorithm against other al-
We identified 8783 peer-reviewed studies, of which 1981 were gorithms and 16 studies did not include a comparison group
duplicates. A further 149 studies were identified through preprint (23,25,32,40,41,44–49,52,56,57,60–64).

56 radiology.rsna.org n Radiology: Volume 304: Number 1—July 2022


Kuo et al

Figure 2: Summary of study adherence to Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis (TRIPOD) reporting guidelines.

To generate a reference standard for image labeling, 18 stud- 30,32,33,35–44,46,48–59,61,62). Twenty-six studies used ran-
ies used expert consensus, six relied on the opinion of a single dom split sample validation as a method of internal validation,
expert reader, seven used pre-existing radiologic reports or other five used stratified split sampling, and four used a resampling
imaging modalities, five studies used mixed methods, and one method (Table E6 [online]) (23–31,33,35–55,57,58,61,62).
defined their reference standard as surgically confirmation frac- Twenty-two studies included localization of fractures in
tures (23,24,26–32,34–39,41–63). Three studies did not report model output to improve end-user interpretability (23,24,30–
how their reference standard was generated (25,33,40). 39,41,42,44,48,49,52,54,57,60,64). Metrics used to evaluate
model performance varied widely, including sensitivity and
Study Participants specificity (38 studies); area under the receiver operating charac-
The number of participants represented by each data set ranged teristic curve and Youden index (23 studies); accuracy (22 stud-
widely (median, 1169; range 65–21,456; interquartile range, ies); positive and negative predictive values (nine studies); and
425–2417; Appendix E3, Table E5 [online]). The proportion of F1, precision, and recall (nine studies).
participants with a fracture in each data set also ranged widely
(median, 50%; interquartile range, 40%–64%). Seventeen stud- Quality Assessment
ies did not include the proportion of study participants who were Adherence to TRIPOD reporting standards was variable (Fig 2).
men or women, and 15 studies did not include information about Four items were poorly reported (,50% adherence): clarity of
participant age (23,25,27,34–37,41,46,54,56,57,58,60–63). study title and abstract (19% and 17% adherence, respectively),
sample size calculation (2.4%), discussion and attempt to im-
Algorithm Development and Model Output prove model interpretability (43%), and a statement about sup-
The size of training (median, 1898; interquartile range, 784– plementary code or data availability (19%).
7646), tuning (median, 739; interquartile range, 142–980) and Prediction Model Study Risk of Bias Assessment Tool, or
test (median, 306; interquartile range, 233–1111) data sets at PROBAST, led to an overall rating of 22 (52%) and 21 (50%)
the patient level varied widely (Tables 1–3). Five of 33 (15.2%) studies as high risk of bias and concerns regarding applicability,
studies that developed an algorithm did not report the size of respectively (Fig 3). The main contributing factors to this as-
each data set separately (24,28,32,33,40). In studies that per- sessment were studies that did not perform external validation,
formed external validation of an algorithm, the median size of or internally validated models with small sample sizes. Fifteen
the data set was 511 (range, 100–1696). Thirty studies used (36%) studies were judged to be at high risk of bias and 18
data augmentation, and 30 studies used transfer learning (23– (43%) at high concern for applicability in participant selection

Radiology: Volume 304: Number 1—July 2022 n radiology.rsna.org 57


Artificial Intelligence in Fracture Detection

Figure 3: Summary of Prediction Model Study Risk of Bias Assessment Tool (PROBAST) risk of bias and concern about generalizability scores.

Figure 4: Hierarchical summary receiver operating characteristic (HSROC) curves for (A) fracture detection algorithms and (B)
clinicians with internal validation test sets. The 95% prediction region is a visual representation of between-study heterogeneity.

because of inclusion and exclusion criteria. In general, studies (23–25,27–40,42,43,45,47–49,51,53,55–58,60,61,63).


were at low concern for bias (six; 14% high concern) and appli- Thirty-seven contingency tables from 26 studies were extracted
cability (nine; 21% high concern) in specifying outcomes and in for reported algorithm performance on internal validation, and
analysis (nine; 21% high concern). 15 were extracted from seven studies on external validation.
Thirty-six contingency tables from 12 studies were extracted
Meta-Analysis for human performance on the same internal validation test
We extracted 115 contingency tables from 32 studies (55 061 sets, and 23 contingency tables from seven studies were ex-
images) that provided sufficient information to calculate contin- tracted for performance on the same external validation test
gency tables for binary fracture detection (Tables E7, E8 [online]) sets (24,27,30,31,33–39,42,43,51,53,55). Four contingency

58 radiology.rsna.org n Radiology: Volume 304: Number 1—July 2022


Kuo et al

Figure 5: Hierarchical summary receiver operating characteristic (HSROC) curves for (A) fracture detection algorithms and (B)
clinicians with external validation test sets. The 95% prediction region is a visual representation of between-study heterogeneity.

Table 4: Pooled Sensitivities, Specificities, and Areas Under the Curve for Artificial Intelligence Algorithms and Clinicians

Parameter Sensitivity (%) Specificity (%) AUC No. of Contingency Tables


Algorithms, internal validation, all studies 92 (88, 94) 91 (88, 93) 0.97 (0.95, 0.98) 37
Studies with low bias 90 (86, 93) 89 (85, 92) 0.95 (0.93, 0.97) 21
Clinicians, internal validation, all studies 91 (85, 95) 92 (89, 95) 0.97 (0.95, 0.98) 36
Studies with low bias 89 (76, 95) 86 (80, 90) 0.93 (0.90, 0.95) 13
Algorithms, external validation, all studies 91 (84, 85) 91 (81, 95) 0.96 (0.94, 0.98) 15
Studies with low bias 89 (76, 95) 80 (74, 85) 0.87 (0.84, 0.90) 10
Clinicians, external validation, all studies 94 (90, 96) 94 (91, 95) 0.98 (0.96, 0.99) 23
Studies with low bias 93 (87, 96) 93 (89, 95) 0.97 (0.95, 0.98) 16
Clinicians with AI assistance, all studies 97 (83, 99) 92 (88, 95) 0.95 (0.92, 0.96) 4
Studies with low bias 97 (83, 99) 92 (88, 95) 0.95 (0.92, 0.96) 4
Note.—Data in parentheses are 95% CIs. Results of all studies and studies with low bias are compared. AI = artificial intelligence, AUC =
area under the receiver operating characteristic curve.

tables were extracted from four studies for human performance CI: 87, 91; P , .01), use of data augmentation (92%; 95% CI:
with AI assistance (29–31,34). 90, 93; P , .01), and transfer learning (91%; 95% CI: 90, 93;
Hierarchical summary receiver operating characteristic P , .01). Higher model sensitivity was associated with algo-
curves from the studies evaluating AI and clinician perfor- rithms focusing on lower limb fractures (95%; 95% CI: 93, 97;
mance on internal validation test sets are included in Figure 4. P , .01) and use of resampling methods (97%; 95% CI: 94, 100;
The pooled sensitivity was 92% (95% CI: 88, 94) for AI and P , .01). We performed a sensitivity analysis, separately evaluat-
91% (95% CI: 85, 95) for clinicians. The pooled specificity ing studies with low risk of bias. We found that all performance
was 91% (95% CI: 88, 93) for AI and 92% (95% CI: 89, 95) metrics were lower, although only the reduction in area under
for clinicians. At external validation, the pooled sensitivity was the curve in studies assessing the performance of algorithms at
91% (95% CI: 84, 95) for AI and 94% (95% CI: 90, 96) for external validation reached statistical significance (96%; 95%
clinicians on matched test sets (Fig 5). The pooled specificity CI: 94, 98; P , .01; Table 4, Fig 6). We report findings of sen-
was 91% (95% CI: 82, 96) for AI and 94% (95% CI: 91, sitivity analyses for other covariates in Figure E1, Appendix E4,
95) for clinicians. When clinicians were provided with AI as- and Tables E9–E13 (online).
sistance, the pooled sensitivity and specificity were 97% (95%
CI: 83, 99) and 92% (95% CI: 88, 95), respectively. Publication Bias
Meta-regression of all studies showed that lower model We assessed publication bias by using a regression analysis to
specificity was associated with lower risk of bias (89%; 95% quantify funnel plot asymmetry (Fig E2 [online]) (22). We

Radiology: Volume 304: Number 1—July 2022 n radiology.rsna.org 59


Artificial Intelligence in Fracture Detection

Figure 6: Summary of pooled sensitivity, specificity, and area under the curve (AUC) of algorithms and clinicians comparing all studies versus
low-bias studies with 95% CIs.

found that the slope coefficient was 25.4 (95% CI: 213.7, radiographs, an AI that always predicts a fracture will have a re-
2.77; P = .19), suggesting a low risk of publication bias. ported accuracy of 82%, despite being deeply flawed (30). A meta-
analysis of nine studies by Yang et al (14) reported a pooled sen-
Discussion sitivity and specificity of 87% (95% CI: 78, 93) and 91% (95%
An increasing number of studies are investigating the potential CI: 85, 95), respectively. This is consistent with the findings of
for artificial intelligence (AI) as a diagnostic adjunct in fracture our meta-analysis of 32 studies. We provided further granularity
diagnosis. We performed a systematic review of the methods, of results, reporting pooled sensitivity and specificity separately for
results, reporting standards, and quality of studies in assess- internal (sensitivity, 92% [95% CI: 88, 94]; and specificity, 91%
ing deep learning in fracture detection tasks. We performed a [95% CI: 88, 93]) and external (sensitivity, 91% [95% CI: 84,
meta-analysis of diagnostic performance, grouped into internal 95]; and specificity, 91% [95% CI: 81, 95]) validation.
and external validation results, and compared with clinician Our study had limitations. First, we only included studies in
performance. Our review highlighted four principal findings. the English language that were published after 2018, excluding
First, AI had high reported diagnostic accuracy, with a pooled other potentially eligible studies. Second, we were only able to
sensitivity of 91% (95% CI: 84, 85) and specificity of 91% extract contingency tables from 32 studies. Third, many studies
(95% CI: 81, 95). Second, AI and clinicians had comparable had methodologic flaws and half were classified as high concern
performance (pooled sensitivity, 94% [95% CI: 90, 96]; and for bias and applicability, limiting the conclusions that could be
specificity, 94% [95% CI: 91, 95]) at external validation. The drawn from the meta-analysis because studies with high risk of
addition of AI assistance improved clinician performance fur- bias consistently overestimated algorithm performance. Fourth,
ther (pooled sensitivity, 97% [95% CI: 83, 99]; and specificity, although adherence to TRIPOD items was generally fair, many
92% [95% CI: 88, 95]), and one study found that clinicians manuscripts omitted vital information such as the size of train-
reached a diagnosis in a shorter time with AI assistance (29–31, ing, tuning, and test sets.
34). Third, there were significant flaws in study methods that The results from this meta-analysis cautiously suggest that AI
may limit the real-world applicability of study findings. For is noninferior to clinicians in terms of diagnostic performance in
example, it is likely that clinician performance was underesti- fracture detection, showing promise as a useful diagnostic tool.
mated: Only one study provided clinicians with background Many studies have limited real-world applicability because of
clinical information. Half of the studies that had a clinician flawed methods or unrepresentative data sets. Future research
comparison group used small groups (ie, less than five) at high must prioritize pragmatic algorithm development. For example,
risk of interrater variation. All studies performed experiments imaging views may be concatenated, and databases should mir-
on a computer or via computer simulation, and only one eval- ror the target population (eg, in fracture prevalence, and age and
uated human-algorithm performance in a prospective clini- sex of patients). It is crucial that studies include an objective as-
cal trial. Fourth, there was high heterogeneity across studies, sessment of sample size adequacy as a guide to readers (66). Data
partly attributable to variations in study methods. Heterogene- and code sharing across centers may spread the burden of gener-
ity in sensitivity and specificity was higher when methodologic ating large and precisely labeled data sets, and this is encouraged
choices, such as internal validation methods or reference stan- to improve research reproducibility and transparency (67,68).
dards, were used. There was a wide range of study sample size, Transparency of study methods and clear presentation of results
but only one study (63) performed a sample size calculation. is necessary for accurate critical appraisal. Machine learning ex-
Previous narrative reviews have reported a wide range of AI ac- tensions to TRIPOD, or TRIPOD-ML, and Standards for Re-
curacy (11–13). However, the use of accuracy as an outcome met- porting of Diagnostic Accuracy Studies, or STARD-AI, guide-
ric in image classification tasks can be misleading (65). For exam- lines are currently being developed and may improve conduct
ple, in a data set consisting of 82% fracture and 18% unfractured and reporting of deep learning studies (69–71).

60 radiology.rsna.org n Radiology: Volume 304: Number 1—July 2022


Kuo et al

Future research should seek to externally validate algorithms 7. Hallas P, Ellingsen T. Errors in fracture diagnoses in the emergency depart-
ment--characteristics of patients and diurnal variation. BMC Emerg Med
in prospective clinical settings and provide a fair comparison 2006;6(1):4.
with relevant clinicians: for example, providing clinicians with 8. Zha N, Patlas MN, Duszak R Jr. Radiologist burnout is not just iso-
routine clinical detail. External validation and evaluation of al- lated to the united states: Perspectives from Canada. J Am Coll Radiol
2019;16(1):121–123.
gorithms in prospective randomized clinical trials is a necessary
9. Bender CE, Bansal S, Wolfman D, Parikh JR. 2018 ACR commission
next step toward clinical deployment. Current artificial intelli- on human resources workforce survey. J Am Coll Radiol 2019;16(4 Pt
gence (AI) is designed as a diagnostic adjunct and may improve A):508–512.
workflow through screening or prioritizing images on worklists 10. McKinney SM, Sieniek M, Godbole V, et al. International evaluation of
an AI system for breast cancer screening. Nature 2020;577(7788):89–94
and highlighting regions of interest for a reporting radiologist. [Published correction appears in Nature 2020;586(7829):E19.].
AI may also improve diagnostic certainty through acting as a 11. Smets J, Shevroja E, Hügle T, Leslie WD, Hans D. Machine learn-
“second reader” for clinicians or as an interim report prior to ing solutions for Osteoporosis—A review. J Bone Miner Res
2021;36(5):833–851.
radiologist interpretation. However, it is not a replacement for 12. Langerhuizen DWG, Janssen SJ, Mallee WH, et al. What are the applica-
the clinical workflow, and clinicians must understand AI perfor- tions and limitations of artificial intelligence for fracture detection and
mance and exercise judgement in interpreting algorithm output. classification in orthopaedic trauma imaging? A systematic review. Clin
Orthop Relat Res 2019;477(11):2482–2491.
We advocate for transparent reporting of study methods and re- 13. Kalmet PHS, Sanduleanu S, Primakov S, et al. Deep learning in fracture
sults as crucial to AI integration. By addressing these areas for detection: a narrative review. Acta Orthop 2020;91(2):215–220.
development, deep learning has potential to streamline fracture 14. Yang S, Yin B, Cao W, Feng C, Fan G, He S. Diagnostic accuracy of deep
learning in orthopaedic fractures: A systematic review and meta-analysis.
diagnosis in a way that is safe and sustainable for patients and Clin Radiol 2020;75(9):713.e17–713.e28.
health care systems. 15. Falconer J. Removing duplicates from an EndNote library/2021. http://
blogs.lshtm.ac.uk/library/2018/12/07/removing-duplicates-from-an-end-
Acknowledgment: We thank Eli Harriss, BA, MS, Bodleian Libraries Outreach Li- note-library/. Accessed May 6, 2021.
brarian for the Bodleian Health Care Libraries, who formulated the search strategies 16. Jackson D, Turner R. Power analysis for random-effects meta-analysis. Res
and ran the database searches. Synth Methods 2017;8(3):290–302.
17. Macaskill P, Gatsonis C, Deeks J, Harbord R, Takwoingi Y. Cochrane
handbook for systematic reviews of diagnostic test accuracy. Version 0.9.0.
Author contributions: Guarantors of integrity of entire study, R.Y.L.K., D.F.; study London, England: The Cochrane Collaboration, 2010; 83.
concepts/study design or data acquisition or data analysis/interpretation, all authors; 18. Harbord RM, Whiting P. Metandi: Meta-analysis of diagnostic accuracy using
manuscript drafting or manuscript revision for important intellectual content, all au- hierarchical logistic regression. Stata J 2009;9(2):211–229.
thors; approval of final version of submitted manuscript, all authors; agrees to ensure 19. Dwamena B. MIDAS: Stata module for meta-analytical integration of
any questions related to the work are appropriately resolved, all authors; literature diagnostic test accuracy studies. https://ideas.repec.org/c/boc/bocode/
research, R.Y.L.K., C.H., T.A.C., B.J., A.F., D.C., D.F.; statistical analysis, R.Y.L.K., s456880.html. Published 2009. Accessed January 2, 2022.
D.C., G.S.C.; and manuscript editing, R.Y.L.K., C.H., B.J., A.F., D.C., M.S., 20. Collins GS, Reitsma JB, Altman DG, Moons KG; TRIPOD Group.
G.S.C., D.F. Transparent reporting of a multivariable prediction model for individual
prognosis or diagnosis (TRIPOD): the TRIPOD statement. The TRI-
Data sharing: All data generated or analyzed during the study are included in the POD Group. Circulation 2015;131(2):211–219.
published paper. 21. Wolff RF, Moons KGM, Riley RD, et al. PROBAST: A tool to assess the
risk of bias and applicability of prediction model studies. Ann Intern Med
2019;170(1):51–58.
Disclosures of conflicts of interest: R.Y.L.K. No relevant relationships. C.H. No
22. Deeks JJ, Macaskill P, Irwig L. The performance of tests of publication
relevant relationships. T.A.C. No relevant relationships. B.J. No relevant relationships.
bias and other sample size effects in systematic reviews of diagnostic test
A.F. No relevant relationships. D.C. No relevant relationships. M.S. No relevant re-
accuracy was assessed. J Clin Epidemiol 2005;58(9):882–893.
lationships. G.S.C. No relevant relationships. D.F. Chair, British Society for Surgery
23. Yoon AP, Lee YL, Kane RL, Kuo CF, Lin C, Chung KC. Development
of the Hand Research Committee; member, British Association of Plastic, Recon-
and validation of a deep learning model using convolutional neural net-
structive, and Aesthetic Surgeons Research Committee; member, British Lymphology
works to identify scaphoid fractures in radiographs. JAMA Netw Open
Society Research Committee; chair, Scientific Advisory Committee Restore Research;
2021;4(5):e216096.
Trustee, British Dupuytren Society.
24. Raisuddin AM, Vaattovaara E, Nevalainen M, et al. Critical evalu-
ation of deep neural networks for wrist fracture detection. Sci Rep
References 2021;11(1):6006.
1. Bergh C, Wennergren D, Möller M, Brisby H. Fracture incidence in 25. Uysal F, Hardalaç F, Peker O, Tolunay T, Tokgöz N. Classification of
adults in relation to age and gender: A study of 27,169 fractures in the shoulder X-ray images with deep learning ensemble models. Appl Sci (Ba-
Swedish Fracture Register in a well-defined catchment area. PLoS One sel) 2021;11(6):2723.
2020;15(12):e0244291. 26. Yamada Y, Maki S, Kishida S, et al. Automated classification of hip frac-
2. Amin S, Achenbach SJ, Atkinson EJ, Khosla S, Melton LJ 3rd.Trends in fracture tures using deep convolutional neural networks with orthopedic surgeon-
incidence: a population-based study over 20 years. J Bone Miner Res level accuracy: ensemble decision-making with antero-posterior and lat-
2014;29(3):581–589. eral radiographs. Acta Orthop 2020;91(6):699–704.
3. Curtis EM, van der Velde R, Moon RJ, et al. Epidemiology of fractures 27. Ozkaya E, Topal FE, Bulut T, Gursoy M, Ozuysal M, Karakaya Z. Evalu-
in the United Kingdom 1988-2012: Variation with age, sex, geography, ation of an artificial intelligence system for diagnosing scaphoid fracture
ethnicity and socioeconomic status. Bone 2016;87:19–26. on direct radiography. Eur J Trauma Emerg Surg 2020. 10.1007/s00068-
4. UK NHS Annual Report. Hospital accident & emergency activity 020-01468-0. Published online August 30, 2020.
2019-20. https://digital.nhs.uk/data-and-information/publications/ 28. Murata K, Endo K, Aihara T, et al. Artificial intelligence for the de-
statistical/hospital-accident–emergency-activity/2019-20. Accessed tection of vertebral fractures on plain spinal radiography. Sci Rep
December 21, 2021. 2020;10(1):20031.
5. Wei CJ, Tsai WC, Tiu CM, Wu HT, Chiou HJ, Chang CY. Systematic 29. Mawatari T, Hayashida Y, Katsuragawa S, et al. The effect of deep convolu-
analysis of missed extremity fractures in emergency radiology. Acta Radiol tional neural networks on radiologists’ performance in the detection of hip
2006;47(7):710–717. fractures on digital pelvic radiographs. Eur J Radiol 2020;130:109188.
6. Williams SM, Connelly DJ, Wadsworth S, Wilson DJ. Radiological re- 30. Krogue JD, Cheng KV, Hwang KM, et al. Automatic hip fracture identi-
view of accident and emergency radiographs: a 1-year audit. Clin Radiol fication and functional subclassification with deep learning. Radiol Artif
2000;55(11):861–865. Intell 2020;2(2):e190023.

Radiology: Volume 304: Number 1—July 2022 n radiology.rsna.org 61


Artificial Intelligence in Fracture Detection

31. Duron L, Ducarouge A, Gillibert A, et al. Assessment of an AI aid in 51. Gan K, Xu D, Lin Y, et al. Artificial intelligence detection of distal radius
detection of adult appendicular skeletal fractures by emergency physicians fractures: a comparison between the convolutional neural network and pro-
and radiologists: A multicenter cross-sectional diagnostic study. Radiology fessional assessments. Acta Orthop 2019;90(4):394–400.
2021;300(1):120–129. 52. Derkatch S, Kirby C, Kimelman D, Jozani MJ, Davidson JM, Leslie WD.
32. Choi J, Hui JZ, Spain D, Su YS, Cheng CT, Liao CH. Practical computer Identification of vertebral fractures by convolutional neural networks to pre-
vision application to detect hip fractures on pelvic X-rays: a bi-institutional dict nonvertebral and hip fractures: A registry-based cohort study of dual
study. Trauma Surg Acute Care Open 2021;6(1):e000705. X-ray absorptiometry. Radiology 2019;293(2):405–411.
33. Cheng CT, Wang Y, Chen HW, et al. A scalable physician-level deep learn- 53. Chung SW, Han SS, Lee JW, et al. Automated detection and classification
ing algorithm detects universal trauma on pelvic radiographs. Nat Commun of the proximal humerus fracture by using deep learning algorithm. Acta
2021;12(1):1066. Orthop 2018;89(4):468–473.
34. Cheng CT, Chen CC, Cheng FJ, et al. A human-algorithm integration sys- 54. Lindsey R, Daluiski A, Chopra S, et al. Deep neural network im-
tem for hip fracture detection on plain radiography: System development proves fracture detection by clinicians. Proc Natl Acad Sci U S A
and validation study. JMIR Med Inform 2020;8(11):e19416. 2018;115(45):11591–11596.
35. Chen HY, Hsu BW, Yin YK, et al. Application of deep learning algorithm 55. Langerhuizen DWG, Bulstra AEJ, Janssen SJ, et al. Is deep learning on par
to detect and visualize vertebral fractures on plain frontal radiographs. PLoS with human observers for detection of radiographically visible and occult
One 2021;16(1):e0245992. fractures of the scaphoid? Clin Orthop Relat Res 2020;478(11):2653–2659.
36. Choi JW, Cho YJ, Lee S, et al. Using a dual-input convolutional neural 56. Kitamura G, Chung CY, Moore BE 2nd. Ankle fracture detection utiliz-
network for automated detection of pediatric supracondylar fracture on ing a convolutional neural network ensemble implemented with a small
conventional radiography. Invest Radiol 2020;55(2):101–110. sample, de novo training, and multiview incorporation. J Digit Imaging
37. Wang Y, Lu L, Cheng C, et al. Weakly supervised universal fracture de- 2019;32(4):672–677.
tection in pelvic x-rays. In: Shen D, Liu T, Peters TM, et al, eds. Medical 57. Grauhan NF, Niehues SM, Gaudin RA, et al. Deep learning for accurately
Image Computing and Computer Assisted Intervention – MICCAI 2019. recognizing common causes of shoulder pain on radiographs. Skeletal Ra-
MICCAI 2019. Lecture Notes in Computer Science, vol 11769. Cham, diol 2022;51(2):355–362.
Switzerland: Springer, 2019; 459–467. 58. Kim DH, MacKinnon T. Artificial intelligence in fracture detection:
38. Cheng CT, Ho TY, Lee TY, et al. Application of a deep learning algorithm transfer learning from deep convolutional neural networks. Clin Radiol
for detection and visualization of hip fractures on plain pelvic radiographs. 2018;73(5):439–445.
Eur Radiol 2019;29(10):5469–5477. 59. Zhou QQ, Wang J, Tang W, et al. Automatic detection and classification of
39. Blüthgen C, Becker AS, Vittoria de Martini I, Meier A, Martini K, Frauen- rib fractures on thoracic CT using convolutional neural network: Accuracy
felder T. Detection and localization of distal radius fractures: Deep learning and feasibility. Korean J Radiol 2020;21(7):869–879.
system versus radiologists. Eur J Radiol 2020;126:108925. 60. Weikert T, Noordtzij LA, Bremerich J, et al. Assessment of a deep learning
40. Beyaz S, Açıcı K, Sümer E. Femoral neck fracture detection in X-ray images algorithm for the detection of rib fractures on whole-body trauma com-
using deep learning and genetic algorithm approaches. Jt Dis Relat Surg puted tomography. Korean J Radiol 2020;21(7):891–899.
2020;31(2):175–183. 61. Raghavendra U, Bhat NS, Gudigar A, Acharya UR. Automated system for
41. Yahalomi E, Chernofsky M, Werman M. Detection of distal radius fractures the detection of thoracolumbar fractures using a CNN architecture. Future
trained by a small set of X-ray images and faster R-CNN. arXiv preprint Gener Comput Syst 2018;85:184–189.
arXiv:1812.09025. https://arxiv.org/abs/1812.09025. Posted December 21, 62. Pranata YD, Wang KC, Wang JC, et al. Deep learning and SURF for au-
2018. Accessed May 6, 2021. tomated classification and detection of calcaneus fractures in CT images.
42. Yu JS, Yu SM, Erdal BS, et al. Detection and localisation of hip fractures Comput Methods Programs Biomed 2019;171:27–37.
on anteroposterior radiographs with artificial intelligence: proof of concept. 63. Kolanu N, Silverstone E, Pham H, et al. Utility of computer-aided vertebral
Clin Radiol 2020;75(3):237.e1–237.e9. fracture detection software. JOURNAL 2020;31(Suppl 1):S179.
43. Urakawa T, Tanaka Y, Goto S, Matsuzawa H, Watanabe K, Endo N. Detect- 64. Sato Y, Takegami Y, Asamoto T, et al. A computer-aided diagnosis system
ing intertrochanteric hip fractures with orthopedist-level accuracy using a using artificial intelligence for hip fractures -multi-institutional joint de-
deep convolutional neural network. Skeletal Radiol 2019;48(2):239–244. velopment research-. arXiv preprint arXiv:2003.12443. https://arxiv.org/
44. Thian YL, Li Y, Jagmohan P, Sia D, Chan VEY, Tan RT. Convolutional abs/2003.12443. Posted March 11, 2020. Accessed May 6, 2021.
neural networks for automated fracture detection and localization on wrist 65. Kuo RYL, Harrison CJ, Jones BE, Geoghegan L, Furniss D. Perspectives: A
radiographs. Radiol Artif Intell 2019;1(1):e180001. surgeon’s guide to machine learning. Int J Surg 2021;94:106133.
45. Rayan JC, Reddy N, Kan JH, Zhang W, Annapragada A. Binomial clas- 66. Balki I, Amirabadi A, Levman J, et al. Sample-size determination meth-
sification of pediatric elbow fractures using a deep learning multiview odologies for machine learning in medical imaging research: A systematic
approach emulating radiologist decision making. Radiol Artif Intell review. Can Assoc Radiol J 2019;70(4):344–353.
2019;1(1):e180015. 67. Liu X, Rivera SC, Moher D, Calvert MJ, Denniston AK; SPIRIT-AI and
46. Adams M, Chen W, Holcdorf D, McCusker MW, Howe PD, Gail- CONSORT-AI Working Group. Reporting guidelines for clinical trial re-
lard F. Computer vs human: Deep learning versus perceptual training for ports for interventions involving artificial intelligence: the CONSORT-AI
the detection of neck of femur fractures. J Med Imaging Radiat Oncol Extension. BMJ 2020;370:m3164.
2019;63(1):27–32. 68. Rivera SC, Liu X, Chan AW, Denniston AK, Calvert MJ; SPIRIT-AI and
47. Mehta SD, Sebro R. Computer-aided detection of incidental lumbar CONSORT-AI Working Group. Guidelines for clinical trial protocols for
spine fractures from routine dual-energy X-ray absorptiometry (DEXA) interventions involving artificial intelligence: the SPIRIT-AI Extension.
studies using a support vector machine (SVM) classifier. J Digit Imaging BMJ 2020;370:m3210.
2020;33(1):204–210. 69. Turner L, Shamseer L, Altman DG, Schulz KF, Moher D. Does use of
48. Mutasa S, Varada S, Goel A, Wong TT, Rasiej MJ. Advanced deep learning the CONSORT Statement impact the completeness of reporting of ran-
techniques applied to automated femoral neck fracture detection and clas- domised controlled trials published in medical journals? A Cochrane review.
sification. J Digit Imaging 2020;33(5):1209–1217. Syst Rev 2012;1(1):60.
49. Starosolski ZA, Kan H, Annapragada AV. CNN-based radiographic acute 70. Collins GS, Moons KGM. Reporting of artificial intelligence prediction
tibial fracture detection in the setting of open growth plates. bioRxiv pre- models. Lancet 2019;393(10181):1577–1579.
print bioRxiv:506154. https://www.biorxiv.org/content/10.1101/506154. 71. Sounderajah V, Ashrafian H, Aggarwal R, et al. Developing specific report-
Posted January 3, 2019. Accessed May 6, 2021. ing guidelines for diagnostic accuracy studies assessing AI interventions: The
50. Jiménez-Sánchez A, Kazi A, Albarqouni S, et al. Precise proximal femur STARD-AI Steering Group. Nat Med 2020;26(6):807–808.
fracture classification for interactive training and surgical planning. Int J
CARS 2020;15(5):847–857.

62 radiology.rsna.org n Radiology: Volume 304: Number 1—July 2022

You might also like