Abstract
Pre-clinical studies suggest that large language models (i.e., ChatGPT) could be used in the diagnostic process to distinguish inflammatory rheumatic (IRD) from other diseases. We therefore aimed to assess the diagnostic accuracy of ChatGPT-4 in comparison to rheumatologists. For the analysis, the data set of Gräf et al. (2022) was used. Previous patient assessments were analyzed using ChatGPT-4 and compared to rheumatologists’ assessments. ChatGPT-4 listed the correct diagnosis comparable often to rheumatologists as the top diagnosis 35% vs 39% (p = 0.30); as well as among the top 3 diagnoses, 60% vs 55%, (p = 0.38). In IRD-positive cases, ChatGPT-4 provided the top diagnosis in 71% vs 62% in the rheumatologists’ analysis. Correct diagnosis was among the top 3 in 86% (ChatGPT-4) vs 74% (rheumatologists). In non-IRD cases, ChatGPT-4 provided the correct top diagnosis in 15% vs 27% in the rheumatologists’ analysis. Correct diagnosis was among the top 3 in non-IRD cases in 46% of the ChatGPT-4 group vs 45% in the rheumatologists group. If only the first suggestion for diagnosis was considered, ChatGPT-4 correctly classified 58% of cases as IRD compared to 56% of the rheumatologists (p = 0.52). ChatGPT-4 showed a slightly higher accuracy for the top 3 overall diagnoses compared to rheumatologist’s assessment. ChatGPT-4 was able to provide the correct differential diagnosis in a relevant number of cases and achieved better sensitivity to detect IRDs than rheumatologist, at the cost of lower specificity. The pilot results highlight the potential of this new technology as a triage tool for the diagnosis of IRD.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
Introduction
Recent diagnostic and therapeutic advances in rheumatology are still counterbalanced by a shortage of specialists [1] resulting in a significant diagnostic delay [2]. Early and correct diagnosis is, however, essential to prevent persistent joint damage.
In this context, artificial intelligence applications including patient-facing symptom checkers represent a field of interest and could facilitate patient triage and accelerate diagnosis [3, 4]. In 2022, we were able to show that the symptom-checker Ada had a significantly higher diagnostic accuracy than physicians in the evaluation of rheumatological case vignettes [5].
Currently, the introduction of large language models (LLM) such as ChatGPT has raised expectations for their use in medicine [6]. The impact of ChatGPT's arises from its ability to engage in conversations and its performance that is either close to or on par with human capabilities in various cognitive tasks [7]. For instance, Chat-GPT has achieved satisfactory scores on the United States Medical Licensing Examinations [8] and some authors suggest that LLM applications might be suitable for clinical, educational, or research environments [9, 10].
Interestingly, pre-clinical studies suggest that this technology could also be used in the diagnostic process [11, 12] to distinguish inflammatory rheumatic from other diseases.
We therefore aimed to assess the diagnostic accuracy of ChatGPT-4 in comparison to a previous analysis including physicians and symptom checkers regarding rheumatic and musculoskeletal diseases (RMDs).
Methods
For the analysis, the data set of Gräf et al. [5] was used with minor updates to disease classification regarding the grouping of diagnoses. The assessments of the symptom-checker app were analyzed using ChatGPT-4 and compared to the previous assessment results of Ada and the diagnostic ranking of the blinded rheumatologists. ChatGPT-4 was instructed to name the top five differential diagnoses based on the available information of the Ada assessment (see Supplement 1).
All diagnostic suggestions were manually reviewed. If an Inflammatory rheumatic disease (IRD) was among the top three (D3) or top five suggestions (ChatGPT-4 D5), respectively, D3 and D5 were summarized as IRD-positive (even if non-IRD diagnoses were also among the suggestions). Proportions of correctly classified patients were compared between the different groups using McNemar’s test. Classification of inflammatory rheumatic disease (IRD) status was additionally assessed.
Results
ChatGPT-4 listed the correct diagnosis comparable often to physicians as the top diagnosis 35% vs 39% (p = 0.30); as well as among the top 3 diagnoses, 60% vs 55%, (p = 0.38). In IRD-positive cases, ChatGPT-4 provided the top diagnosis in 71% vs 62% in the physician analysis. The correct diagnosis was among the top 3 in 86% (ChatGPT-4) vs 74% (physicians). In non-IRD cases, ChatGPT-4 provided the correct top diagnosis in 15% vs 27% in the physician analysis. The correct diagnosis was among the top 3 in non-IRD cases in 46% of the ChatGPT-4 group vs 45% in the physician group (Fig. 1).
If only the first suggestion for diagnosis was considered, ChatGPT-4 correctly classified 58% of cases as IRD compared to 56% of the rheumatologists (p = 0.52). If the top 3 diagnoses were considered, ChatGPT-4 classified 36% of the cases correctly as IRD vs 52% of the rheumatologists (p = 0.01) (see Fig. 1). ChatGPT-4 had at least one suggestion of an inflammatory diagnosis for all non-IRD cases.
Discussion
ChatGPT-4 showed a slightly higher accuracy (60% vs. 55%) for the top 3 overall diagnoses compared to the rheumatologist’s assessment. It had a higher sensitivity to determine the correct IRD status than rheumatologists, but considerably worse specificity, suggesting that ChatGPT-4 may be particularly useful for detecting IRD patients, where timely diagnosis and treatment initiation are critical. It could therefore potentially be used as a triage tool for digital pre-screening and facilitate quicker referrals of patients with suspected IRDs.
Our results are in line with those of Kanjee et al. [12] who demonstrated an accuracy of 64% for ChatGPT-4 evaluating the top 5 differential diagnoses of the New England Journal of Medicine clinicopathological conferences.
Interestingly, in the cross-sectional study of Ayers et al. [13], the authors found that chatbot responses to publicly asked medical questions on a public social media forum were preferred over physician responses and rated significantly higher for both quality and empathy, highlighting the potential of this technology as a first point of contact and source of information for patients. In summary, ChatGPT-4 was able to provide the correct differential diagnosis in a relevant number of cases and achieved better sensitivity to detect IRDs than a rheumatologist, at the cost of lower specificity.
Although this analysis has some shortcomings, i.e., the small sample size and the limited information (only access to the Ada assessments without further clinical data), it highlights the potential of this new technology as a triage tool that could support or even speed up the diagnosis of RMDs.
As digital self-assessment and remote care options are difficult for some patients due to limited digital health competencies [14], up-to-date studies should be conducted on how accurately patients can express their symptoms and complaints using AI and symptom-checker applications, so that we can benefit from these technologies more effectively.
Until satisfactory results are obtained, the use of artificial intelligence by GPs for effective referral instead of diagnostic use can be expanded and larger prospective studies are recommended to further evaluate the technology. Furthermore, issues, such as ethics, patient consent, and data privacy in the context of the use of artificial intelligence in medical-decision making, are crucial critical guidelines for the application of LLM technologies such as ChatGPT are needed [15].
Data availability
The datasets used and/or analysed during the current study are available from the corresponding author on reasonable request.
References
Rheumadocs und Arbeitskreis Junge Rheumatologie (AGJR), Krusche M, Sewerin P, Kleyer A, Mucke J, Vossen D, u. a. Facharztweiterbildung quo vadis? Z Für Rheumatol. Oktober 2019;78(8):692–7.
Miloslavsky EM, Marston B (2022) The challenge of addressing the rheumatology workforce shortage. J Rheumatol Juni 49(6):555–557
Fuchs F, Morf H, Mohn J, Mühlensiepen F, Ignatyev Y, Bohr D (2023) Diagnostic delay stages and pre-diagnostic treatment in patients with suspected rheumatic diseases before special care consultation: results of a multicenter-based study. Rheumatol Int März 43(3):495–502
Knitza J, Mohn J, Bergmann C, Kampylafka E, Hagen M, Bohr D (2021) Accuracy, patient-perceived usability, and acceptance of two symptom checkers (Ada and Rheport) in rheumatology: interim results from a randomized controlled crossover trial. Arthritis Res Ther 23(1):112
Gräf M, Knitza J, Leipe J, Krusche M, Welcker M, Kuhn S (2022) Comparison of physician and artificial intelligence-based symptom checker diagnostic accuracy. Rheumatol Int 42(12):2167–2176
Hügle T (2023) The wide range of opportunities for large language models such as ChatGPT in rheumatology. RMD Open 9(2):e003105
Thirunavukarasu AJ, Ting DSJ, Elangovan K, Gutierrez L, Tan TF, Ting DSW (2023) Large language models in medicine. Nat Med 29(8):1930–1940
Kung TH, Cheatham M, Medenilla A, Sillos C, De Leon L, Elepaño C (2023) Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large language models. PLOS Digit Health. 2(2):e0000198
Thirunavukarasu AJ, Hassan R, Mahmood S, Sanghera R, Barzangi K, El Mukashfi M (2023) Trialling a large language model (ChatGPT) in general practice with the applied knowledge test: observational study demonstrating opportunities and limitations in primary care. JMIR Med Educ 9:e46599
Verhoeven F, Wendling D, Prati C (2023) ChatGPT: when artificial intelligence replaces the rheumatologist in medical writing. Ann Rheum Dis 82(8):1015–1017
Ueda D, Mitsuyama Y, Takita H, Horiuchi D, Walston SL, Tatekawa H (2023) ChatGPT’s diagnostic performance from patient history and imaging findings on the diagnosis please quizzes. Radiology 308(1):e231040
Kanjee Z, Crowe B, Rodman A (2023) Accuracy of a generative artificial intelligence model in a complex diagnostic challenge. JAMA 330(1):78
Ayers JW, Poliak A, Dredze M, Leas EC, Zhu Z, Kelley JB (2023) Comparing physician and artificial intelligence chatbot responses to patient questions posted to a public social media forum. JAMA Intern Med 183(6):589–596
de Thurah A, Bosch P, Marques A, Meissner Y, Mukhtyar CB, Knitza J (2022) EULAR points to consider for remote care in rheumatic and musculoskeletal diseases. Ann Rheum Dis 81(8):1065–1071
Tools such as ChatGPT threaten transparent science; here are our ground rules for their use. Nature. Januar 2023;613(7945):612.
Funding
Open Access funding enabled and organized by Projekt DEAL. MK: Speaker fee from Ada, Scientific funding: Ada. JC: Speaker’ fees from Janssen-Cilag, Pfizer, and Idorsia, all unrelated to this work.
Author information
Authors and Affiliations
Contributions
Conceptualization: MK and NR; data curation: all authors, formal analysis: MK and JC, and funding acquisition: not applicable. Investigation: all authors. Methodology: MK, JC, and JK; software: MK; validation: all authors; visualization: MK and JC; writing—original draft: MK and NR; writing—review and editing: all authors.
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Below is the link to the electronic supplementary material.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Krusche, M., Callhoff, J., Knitza, J. et al. Diagnostic accuracy of a large language model in rheumatology: comparison of physician and ChatGPT-4. Rheumatol Int 44, 303–306 (2024). https://doi.org/10.1007/s00296-023-05464-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00296-023-05464-6