Oncology

GPT-4 performing with superior scores on medical oncology examination questions

By Simon Pan, Alex Chan

December 17, 2024

1. The large language model, ChatGPT-4, answered 85.0% of examination-style multiple-choice questions on medical oncology correctly, a performance superior to all other large language models and comparable with medical oncology trainees.

2. Approximately 80% of incorrect answers were rated by clinicians as having a medium to high risk of causing moderate to severe harm if acted upon in clinical practice.

Evidence Rating Level: 2 (Good)

Study Rundown: Large language models (LLMs) may have extraordinary utility across various healthcare settings. For example, potential applications in the field of oncology range from assistance in administrative tasks to clinical decision-making. This cross-sectional study therefore sought to evaluate the medical oncology knowledge of the LLMs, ChatGPT-3.5 (proprietary LLM 1), ChatGPT-4 (proprietary LLM 2), and various open-source LLMs. Proprietary LLM 1 and proprietary LLM 2 were evaluated on their performance across 147 medical oncology examination questions from ASCO’s Oncology Self-Assessment Series, ESMO’s Examination Trial Questions, and unseen original questions. Proprietary LLM 2 achieved the highest performance among all LLMs by answering 85.0% of questions correctly. However, roughly 64% of incorrect answers were considered to have a medium likelihood of causing patient harm, and roughly 18% of incorrect answers were considered to have a high likelihood of causing patient harm if acted upon in practice. Approximately 82% of incorrect answers had a medium or high likelihood of causing moderate or severe harm. Overall, this study found that LLMs are capable of performing well on examination-style multiple-choice medical oncology questions, with some safety concerns being raised surrounding the possible consequences of incorrect decision-making. As such, the use of LLMs in medical oncology may be best applied to low-risk settings or under intensive human supervision with guidelines in place to ensure the safe application of LLMs in clinical practice.

Click to read the study in JAMA Network Open

Relevant Reading: Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models

In-Depth [cross-sectional study]: In recent years, the potential utility of LLMs in healthcare settings has been an important topic for investigation. They have already been shown to be capable of passing the United States Medical Licensing Examination while demonstrating remarkable knowledge recall and reasoning abilities. However, the performance of LLMs on examinations across different medical subspecialties is highly varied, and their performance on medical oncology examinations is not yet known. This cross-sectional study therefore sought to investigate the medical oncology knowledge of LLMs and their performance across examination-style multiple choice medical oncology questions. Proprietary LLM 1 and proprietary LLM 2 were assessed on 52 questions from ASCO, 75 questions from ESMO and 20 original questions. Proprietary LLM 2 achieved the highest accuracy of all LLMs assessed at 85.0% (95% CI = 78.2% to 90.4%; P < 0.001 vs random answering) with similar performance across each of the question sets (80.8%, 95% CI = 67.5% to 90.4%, P < 0.001; 88.0%, 95% CI = 78.4% to 94.4%, P < 0.001; 85.0%, 95% CI = 62.1% to 96.8%, P < 0.001 for ASCO, ESMO and original questions respectively). Proprietary LLM 1 achieved an accuracy of 60.5% (95% CI = 50.0% to 66.4%; P < 0.001 vs random answering). Incorrect answers by proprietary LLM 2 were more common when questions involved knowledge from recent publications (Wilcoxon test P = 0.02), with 63.6% of incorrect answers being due to incorrect knowledge recall. Among incorrect answers by proprietary LLM 2, the likelihood of causing patient harm by applying the error in practice was considered medium in 63.6% of incorrect answers (95% CI = 43.0% to 85.4%) and high in 18.2% of incorrect answers (95% CI = 5.2% to 40.3%). The extent of possible harm was considered to be moderate in 63.6% of incorrect answers (95% CI = 43.0% to 85.4%) and likely to cause severe harm or lead to death in 18.2% of incorrect answers (95% CI = 5.2% to 40.3%).

Image: PD

©2024 2 Minute Medicine, Inc. All rights reserved. No works may be reproduced without expressed written consent from 2 Minute Medicine, Inc. Inquire about licensing here. No article should be construed as medical advice and is not intended as such by the authors or by 2 Minute Medicine, Inc.