0% found this document useful (0 votes)
21 views4 pages

Chatgpt and Generating A Differential Diagnosis Early in An Emergency Department Presentation

Uploaded by

Eduardo Quintana
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views4 pages

Chatgpt and Generating A Differential Diagnosis Early in An Emergency Department Presentation

Uploaded by

Eduardo Quintana
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

RESEARCH LETTER

ChatGPT and Generating a clinical presentation. Definitive diagnoses were derived from
Differential Diagnosis Early in an the discharge letter sent to the patient’s general practitioner.
Emergency Department Presentation A differential diagnosis was created based on the patient’s
primary complaint, and consisted of 5 possible disease-
specific causes (additional information is provided in
0196-0644/$-see front matter
Copyright © 2023 by the American College of Emergency Physicians. Appendix E1 [available at http://www.annemergmed.com]).
To mimic daily practice, each case was evaluated by either an
emergency medicine or internal medicine resident and
supervising specialist. Each team formulated an unranked
differential diagnosis list for each case, with one diagnosis
INTRODUCTION
identified as the most likely (without laboratory data
Rapid diagnosis of patients at the emergency
available). The teams first formulated a differential diagnosis
department (ED) is crucial for improving patient
and a leading diagnosis without laboratory tests. They then
outcomes and reducing length of stay.1 To make accurate
adjusted the differential and leading diagnoses, if necessary,
diagnoses, physicians rely on various sources of
on evaluating the laboratory results. Each case was entered
information including medical history, physical
into ChatGPT in Dutch and English in threefold (additional
examination, medication usage, and further diagnostic
information is provided in Appendix E1). All queries and
evaluations. However, the volume and complexity of data
ChatGPT results can be found in the Appendix E1.
inherent in modern medicine, coupled with the rapid
Confidence intervals were calculated using bootstrapping.
pace of medical advancements, can pose challenges to
Groups were compared using 1-way ANOVA. P-values
medical practitioners.
<.05 were considered statistically significant.
Artificial intelligence (AI) could play an important role
in solving this issue. Moreover, it is gaining importance as a
diagnostic and prognostic tool in the health care RESULTS
environment.2-4 ChatGPT (OpenAI) is a large language
On retrospective review of medical history and physical
model, which has the potential to aid in generating ideas,
examination, physicians correctly included the diagnosis in
extracting important information from text, writing
the top 5 differential diagnosis for 83% of cases, similar to
documents, education, and clinical practice.5,6 In this
77% for ChatGPT v3.5 and 87% for v4.0 (Figure 1A,
retrospective study, we aimed to investigate the ability of
Supplementary Table E1 [available at http://www.
ChatGPT to generate accurate differential diagnoses in
annemergmed.com]). With laboratory data, physicians’
undifferentiated patients based on physician notes recorded
accuracy increased to 87%, ChatGPT v3.5 to 97%, and
at initial ED presentation.
v4.0 remained at 87% (Figure 1B, Supplementary
Table E1).
METHODS Physicians chose the correct leading diagnosis in 60% of
Undifferentiated patients (n¼30) who presented at the the cases, compared with ChatGPT v3.5 (37%) and v4.0
ED of the Jeroen Bosch Hospital (single-center, (53%) (Figure 1C, Supplementary Table E1). With
nonacademic teaching hospital) in March 2022, with a single laboratory results, physicians chose the correct leading
proven diagnosis, were retrospectively included. Written diagnosis in 53% of the cases, comparable to the accuracy
informed consent requirement was waived by the Medical of ChatGPT v3.5 (60%) and v4.0 (53%) (Figure 1D,
Ethics Committee (NW2023-18). Physician notes made Supplementary Table E1).
directly at ED presentation, containing medical history, There was approximately 60% overlap between
physical examination, and medication were anonymized and physicians’ and ChatGPT’s differential diagnoses for both
translated. The notes contained exclusive information versions (Figure 1E), and 50% overlap in leading diagnoses
immediately available on patient presentation, whereas (Figure 1F Dutch or English queries showed similar
treating physicians’ thoughts or any additional tests (such as diagnostic accuracy (Figure 1A-F). Notably, in some cases,
imaging) were excluded, except for physical examination. ChatGPT generated incorrect answers or explanations
Each patient presenting at our ED receives a set of identical (Supplementary Table E2).
laboratory tests, ensuring uniform, unbiased diagnostic Submitting the same query to ChatGPT can generate
procedures. However, D-dimer and troponin I are not part varied responses. Leading diagnoses were identical across
of these routine examinations and were measured only in 3 consecutive submissions in 55% of Dutch and 60% of
some patients, according to the specific complaints and English queries (Figure 2A). There was an average 70%

Volume -, no. - : - 2023 Annals of Emergency Medicine 1


ChatGPT Effectively Generates Differential Diagnosis Berg et al

A Without laboratory results


B With laboratory results
100 100

80 80

Correct diagnosis in

Correct diagnosis in
top 5 DDx (%)

top 5 DDx (%)


60 60

40 40

20 20

0 0
s

ch

ch
5

5
n

n
4.

4.
3.

3.
ia

ut

ia

ut
PT

PT
PT

PT
ic

ic
D

D
ys

ys
5

5
tG

tG
tG

tG
3.

3.
Ph

Ph
ha

ha
ha

ha
PT

PT
C

C
C

C
tG

tG
ha

ha
C

C
C D
Without laboratory results With laboratory results
100 100
Correct leading diagnosis (%)

Correct leading diagnosis (%)


80 80

60 60

40 40

20 20

0 0
ns

ch
5
ns

ch
5

4.
3.
4.
3.

ia

ut
ia

ut

PT
PT
ic

D
PT
PT
ic

ys
ys

5
tG
tG
5
tG
tG

3.
Ph
3.
Ph

ha
ha
ha
ha

PT
PT

C
C
C
C

tG
tG

ha
ha

C
C

E F
between ChatGPT and physician (%)
Average overlap in leading diagnosis
Average overlap in DDx between

100 100
ChatGPT and physician (%)

80 80

60 60

40 40

20 20

0 0
0

ch

ch
5

5
4.

4.
3.

3.
ut

ut
PT

PT
PT

PT
D

D
5

5
tG

G
tG

tG
3.

3.
t
ha

ha
ha

ha
PT

PT
C

C
C

C
tG

tG
ha

ha
C

Figure 1. Performance of ChatGPT in predicting diagnoses. A-B, The percentage of cases in which the correct diagnosis is in the
top 5 differential diagnoses. C-D, The percentage of cases in which the correct leading diagnosis was chosen. E-F, the average
percentage of overlap in differential diagnosis and leading diagnosis between ChatGPT and the physicians. There was no significant
difference between any of the groups.

overlap between the initial query’s differential diagnosis DISCUSSION


and the subsequent 2 entries’ differential diagnosis This study presents one of the first comprehensive
(Figure 2B). investigations of the potential of large language models to

2 Annals of Emergency Medicine Volume -, no. - : - 2023


Berg et al ChatGPT Effectively Generates Differential Diagnosis

A B ChatGPT holds promise to aid medical professionals in


the generation of possibilities for differential diagnoses in
diagnosis across three queries (%)

100 100
clinical practice. It is important to stress that the purpose of
Cases with the same leading

the separate queries (%)


Overlap in DDx between
80 80 AI like ChatGPT is not to replace the judgment of a
clinician, but rather to serve as an aid to their
60 60
decisionmaking process. This proof-of-concept study
40 40 demonstrates that large language models possess the
capacity to augment current medical practice by providing
20 20
valuable support to health care professionals.
0 0
DeepL (DeepL SE) was used for translation of physician’s
h 5

h 5
5
is 3.

5
is 3.
ch 3.

ch 3.
gl T

notes from Dutch to English. ChatGPT (v3.5) was used for


gl T
ut T
En tGP

ut T
En tGP
D GP

D P
tG
ha

checking grammar, structure, and spelling.


ha
ha

ha
C

C
C

C
Figure 2. The consistency of ChatGPT in its query results. A, Hidde ten Berg
The percentage of cases in which the same leading diagnosis
Bram van Bakel
generated three consecutive times. B, The average percentage
of overlap in the 5 differential diagnoses options of the second Lieke van de Wouw
and third query, compared with the first query. There was no Kim E. Jie
significant difference between any of the groups. Anoeska Schipper
Henry Jansen
assist in diagnostic workup.7 In this study, we Rory D. O’Connor
demonstrated that ChatGPT’s performance in generating Bram van Ginneken
differential diagnoses is comparable to that of medical Steef Kurstjens, PhD
experts retrospectively assessing the same cases.
https://doi.org/10.1016/j.annemergmed.2023.08.003
The study has some limitations. Unlike the highly
complex patients often seen in routine care at the ED, who Supervising editors: Stephen Schenkel, MD, MPP; David L.
Schriger, MD, MPH.
typically present with multiple concurrent medical issues
and diagnoses, the cases chosen for this study featured a Author affiliations: From the Department of Emergency Medicine
single primary complaint and diagnosis. The efficacy of (ten Berg, van de Wouw, Jie, O’Connor), Department of Internal
Medicine (van Bakel, Jansen), Laboratory of Clinical Chemistry and
ChatGPT in providing multiple distinct diagnoses for
Hematology (Schipper, Kurstjens), and the Content Support Team
patients with complex or rare diseases remains unverified. (Schipper), Jeroen Bosch Hospital, ’s Hertogenbosch, the
Additionally, ChatGPT’s rationale was at times medically Netherlands; and Diagnostic Image Analysis Group (Schipper, van
implausible or inconsistent, which can lead to Ginneken), Radboudumc, Nijmegen, the Netherlands.
misinformation or incorrect diagnoses, with significant Corresponding Author email: [email protected]
implications. Furthermore, ethical and legal requirements
Author contributions: HtB, BvG, KJ, and SK designed the study. HtB
regarding large language models in a medical setting should
and SK collected the data. SK, HtB, and AS performed data
be carefully considered, as ChatGPT is not a medical analysis. BvB, LvdW, HJ, and RO assessed the cases. SK wrote the
device. Lastly, the definitive diagnosis is based on the manuscript. All authors provided input for the manuscript and read
discharge letter, which could be inaccurate. and approved the final version. SK takes responsibility for the
The study has several notable strengths. First, the paper as a whole.
protocol for creating differential diagnoses accurately Data sharing statement: Information and data are available from
mimics daily practice. Second, we systematically assessed the corresponding author upon reasonable request. All English
each case in threefold, in both Dutch and English, with ChatGPT responses can be found in Appendix E1.
and without routine laboratory results. By submitting All authors attest to meeting the four ICMJE.org authorship criteria:
each case to ChatGPT in threefold, we illustrated the (1) Substantial contributions to the conception or design of the
degree of inconsistencies present in responses to identical work; or the acquisition, analysis, or interpretation of data for the
queries. This observed inconsistency in ChatGPT’s work; AND (2) Drafting the work or revising it critically for important
intellectual content; AND (3) Final approval of the version to be
outputs emphasizes the inherent unpredictability in large published; AND (4) Agreement to be accountable for all aspects of
language models and underscores the fact that these are the work in ensuring that questions related to the accuracy or
merely tools that can aid, but not replace physicians’ integrity of any part of the work are appropriately investigated and
judgment. resolved.

Volume -, no. - : - 2023 Annals of Emergency Medicine 3


ChatGPT Effectively Generates Differential Diagnosis Berg et al

Funding and support: The authors received no financial support for 3. van Leeuwen KG, Schalekamp S, Rutten MJCM, et al. Artificial
the research, authorship, and/or publication of this article. All intelligence in radiology: 100 commercially available products and their
authors have read the journal’s policy on disclosure of potential scientific evidence. Eur Radiol. 2021;31:3797-3804.
conflicts of interest and have none to declare. 4. van Doorn WPTM, Stassen PM, Borggreve HF, et al. A comparison of
machine learning models versus clinical evaluation for mortality
Publication dates: Received for publication June 6, 2023. prediction in patients with sepsis. PLoS One. 2021;16(1):e0245157.
Revision received July 27, 2023. Accepted for publication August https://doi.org/10.1371/journal.pone.0245157
2, 2023. 5. Lee P, Bubeck S, Petro J. Benefits, limits, and risks of GPT-4 as an AI
chatbot for medicine. N Engl J Med. 2023;388:1233-1239.
6. Gates B. Will ChatGPT transform healthcare? Nat Med.
1. Singer AJ, Thode HC Jr, Viccellio P, et al. The association between length 2023;29:505-506.
of emergency department boarding and mortality. Acad Emerg Med. 7. Hirosawa T, Harada Y, Yokose M, et al. Diagnostic accuracy of
2011;18:1324-1329. differential-diagnosis lists generated by generative pretrained
2. Kurstjens S, van der Horst A, Herpers R, et al. Rapid identification of transformer 3 chatbot for clinical vignettes with common chief
SARS-CoV-2-infected patients at the emergency department using complaints: a pilot study. Int J Environ Res Public Health.
routine testing. Clin Chem Lab Med. 2020;58:1587-1593. 2023;20:3378.

4 Annals of Emergency Medicine Volume -, no. - : - 2023

You might also like