Please use this identifier to cite or link to this item: https://www.um.edu.mt/library/oar/handle/123456789/139845
Title: Poor performance of large language models based on the diabetes and endocrinology specialty certificate examination of the United Kingdom
Authors: Fan, Ka Siu
Gan, Jeffrey
Zou, Isabelle X.
Kaladjiska, Maja
Borg Inguanez, Monique
Garden, Gillian L.
Keywords: Artificial intelligence -- Medical applications
Modeling languages (Computer science)
Medical education -- United Kingdom
Endocrinology -- Examinations, questions, etc.
Diabetes -- Examinations, questions, etc.
Educational tests and measurements -- Evaluation
Issue Date: 2025
Publisher: Springer Nature
Citation: Fan K., Gan J., Zou I. X., Kaladjiska, M., Borg Inguanez, M., & Garden, G. L. (2025). Poor performance of large language models based on the diabetes and endocrinology specialty certificate examination of the United Kingdom. Cureus, 17(10), e93960, 1-11.
Abstract: Introduction: The medical knowledge of large language models (LLMs) has been tested using several postgraduate medical examinations. However, it is rarely examined in diabetes and endocrinology. This study aimed to evaluate the performance of LLMs in answering multiple-choice questions using the Diabetes and Endocrinology Speciality Certificate Examination (SCE) of the United Kingdom. Methods: The official diabetes and endocrinology SCE sample questions were used to assess the seven freely accessible and subscription-based commercial LLMs: ChatGPT-o1 Preview (OpenAI, USA), ChatGPT-4o (OpenAI, USA), Gemini (Google, USA), Claude-3.5 Sonnet (Anthropic, USA), Copilot (Microsoft, USA), Perplexity AI (Perplexity, USA), and Meta AI (Meta, USA). The accuracy of LLMs was calculated by comparing outputs against sample answers. Literacy metrics, including Flesch Reading Ease (FRES) and Flesch Kincaid Grade Level (FKGL), were calculated for each response. 83 questions, three of which included photographs, were entered into the LLMs without employing any prompt engineering techniques. Results: A total of 581 responses were generated and captured between August and October 2024. Performance differed significantly between models, with ChatGPT-o1 Preview achieving the highest accuracy (73%). None of the other LLMs achieved the historical pass mark of 65%, with Gemini achieving the lowest accuracy of 33%. Readability metrics also differed significantly between LLMs (p=0.004). LLMs performed better for questions without reference ranges (p<0.001). Conclusions: The performance of LLMs was generally inadequate in the diabetes and endocrinology examination. Of those tested, ChatGPT-o1 Preview achieved the highest score and is likely the most useful model to aid medical education. This may be due to it being an advanced reasoning model with a greater ability to solve complex problems. Nonetheless, continued research is needed to keep pace with the advances in LLMs.
URI: https://www.um.edu.mt/library/oar/handle/123456789/139845
Appears in Collections:Scholarly Works - FacSciSOR



Items in OAR@UM are protected by copyright, with all rights reserved, unless otherwise indicated.