Is ChatGPT 'ready' to be a learning tool for medical undergraduates and will it perform equally in different subjects? Comparative study of ChatGPT performance in tutorial and case-based learning questions in physiology and biochemistry

W A Nathasha V Luke; Lee Seow Chong; Kenneth H Ban; Amanda H Wong; Chen Zhi Xiong; Lee Shuh Shing; Reshma Taneja; Dujeepa D Samarasekera; Celestial T Yap

doi:10.1080/0142159X.2024.2308779

Is ChatGPT 'ready' to be a learning tool for medical undergraduates and will it perform equally in different subjects? Comparative study of ChatGPT performance in tutorial and case-based learning questions in physiology and biochemistry

Med Teach. 2024 Nov;46(11):1441-1447. doi: 10.1080/0142159X.2024.2308779. Epub 2024 Jan 31.

Authors

W A Nathasha V Luke¹, Lee Seow Chong², Kenneth H Ban², Amanda H Wong¹, Chen Zhi Xiong^{1

3}, Lee Shuh Shing³, Reshma Taneja¹, Dujeepa D Samarasekera³, Celestial T Yap¹

Affiliations

¹ Department of Physiology, Yong Loo Lin School of Medicine, National University of Singapore, Singapore.
² Department of Biochemistry, Yong Loo Lin School of Medicine, National University of Singapore, Singapore.
³ Centre for Medical Education, Yong Loo Lin School of Medicine, National University of Singapore, Singapore.

PMID: 38295769
DOI: 10.1080/0142159X.2024.2308779

Abstract

Purpose: Generative AI will become an integral part of education in future. The potential of this technology in different disciplines should be identified to promote effective adoption. This study evaluated the performance of ChatGPT in tutorial and case-based learning questions in physiology and biochemistry for medical undergraduates. Our study mainly focused on the performance of GPT-3.5 version while a subgroup was comparatively assessed on GPT-3.5 and GPT-4 performances.

Materials and methods: Answers were generated in GPT-3.5 for 44 modified essay questions (MEQs) in physiology and 43 MEQs in biochemistry. Each answer was graded by two independent examiners. Subsequently, a subset of 15 questions from each subject were selected to represent different score categories of the GPT-3.5 answers; responses were generated in GPT-4, and graded.

Results: The mean score for physiology answers was 74.7 (SD 25.96). GPT-3.5 demonstrated a statistically significant (p = .009) superior performance in lower-order questions of Bloom's taxonomy in comparison to higher-order questions. Deficiencies in the application of physiological principles in clinical context were noted as a drawback. Scores in biochemistry were relatively lower with a mean score of 59.3 (SD 26.9) for GPT-3.5. There was no statistically significant difference in the scores for higher and lower-order questions of Bloom's taxonomy. The deficiencies highlighted were lack of in-depth explanations and precision. The subset of questions where the GPT-4 and GPT-3.5 were compared demonstrated a better overall performance in GPT-4 responses in both subjects. This difference between the GPT-3.5 and GPT-4 performance was statistically significant in biochemistry but not in physiology.

Conclusions: The differences in performance across the two versions, GPT-3.5 and GPT-4 across the disciplines are noteworthy. Educators and students should understand the strengths and limitations of this technology in different fields to effectively integrate this technology into teaching and learning.

Keywords: ChatGPT; GPT-3.5; GPT-4 generative AI (artificial intelligence); LLM (large language model); physiology biochemistry.

Publication types

Comparative Study
Research Support, Non-U.S. Gov't

MeSH terms

Biochemistry* / education
Computer-Assisted Instruction / methods
Education, Medical, Undergraduate* / methods
Educational Measurement* / methods
Humans
Learning
Physiology* / education
Problem-Based Learning / methods
Students, Medical / psychology