In a recent study published in PLOS Digital Health, researchers evaluated the performance of an artificial intelligence (AI) model named ChatGPT to perform clinical reasoning on the United States Medical Licensing Exam (USMLE).
The USMLE comprises three standardized exams, clearing which help students get medical licensure in the US.
Background
There have been advancements in artificial intelligence (AI) and deep learning in the past decade. These technologies have become applicable across several industries, from manufacturing and finance to consumer goods. However, their applications in clinical care, especially healthcare information technology (IT) systems, remain limited. Accordingly, AI has found relatively few applications in widespread clinical care.
One of the main reasons for this is the shortage of domain-specific training data. Large general domain models are now enabling image-based AI in clinical imaging. It has led to the development of Inception-V3, a top medical imaging model that spans domains from ophthalmology and pathology to dermatology.
In the last few weeks, ChatGPT, an OpenAI-developed general Large Language Model (LLM) (not domain specific), garnered attention due to its exceptional potential to perform a suite of natural language tasks. It uses a novel AI algorithm that predicts a given word sequence based on the context of the words written prior to it.
Thus, it could generate plausible word sequences based on the natural human language without being trained on humongous text data. People who have used ChatGPT find it capable of deductive reasoning and developing a chain of thought.
Regarding the choice of the USMLE as a substrate for ChatGPT testing, the researchers found it linguistically and conceptually rich. The test contained multifaceted clinical data (e.g., physical examination and laboratory test results) used to generate ambiguous medical scenarios with differential diagnoses.
About the study
In the present study, researchers first encoded USMLE exam items as open-ended questions with variable lead-in prompts, then as multiple-choice single-answer questions with no forced justification (MC-NJ). Finally, they encoded them as multiple-choice single-answer questions with a forced justification of positive and negative selections (MC-J). In this way, they assessed ChatGPT accuracy for all three USMLE steps, steps 1, 2CK, and 3.
Next, two physician reviewers independently arbitrated the concordance of ChatGPT across all questions and input formats. Further, they assessed its potential to enhance medical education-related human learning. Two physician reviewers also examined AI-generated explanation content for novelty, nonobviousness, and validity from the perspective of medical students.
Furthermore, the researchers assessed the prevalence of insight within AI-generated explanations to quantify the density of insight (DOI). The high frequency and moderate DOI (>0.6) indicated that it might be possible for a medical student to achieve some knowledge from the AI output, especially when answering incorrectly. DOI indicated the uniqueness, novelty, nonobviousness, and validity of insights provided for more than three out of five answer choices.
Results
ChatGPT performed at over 50% accuracy across all three USMLE examinations, exceeding the 60% USMLE pass threshold in some analyses. It is an extraordinary feat because no other prior models reached this benchmark; merely months prior, they performed at 36.7% accuracy. Chat GPT iteration GPT3 achieved 46% accuracy with no prompting or training, suggesting that further model tuning could fetch more precise results. AI performance will likely continue to advance as LLM models mature.
In addition, ChatGPT performed better than PubMedGPT, a similar LLM trained exclusively in biomedical literature (accuracies ~60% vs. 50.3%). It seems that ChatGPT, trained on general non-domain-specific content, had its advantages as exposure to more clinical content, e.g., patient-facing disease primers are far more conclusive and consistent.
Another reason why the performance of ChatGPT was more impressive is that prior models most likely had ingested many of the inputs while training, while it had not. Note that the researchers tested ChatGPT against more contemporary USMLE exams that became publicly available in the year 2022 only). However, they had trained other domain-specific language models, e.g., PubMedGPT and BioBERT, on the MedQA-USMLE dataset, publically available since 2009.
Intriguingly, the accuracy of ChatGPT was inclined to increase sequentially, being lowest for Step 1 and highest for Step 3, reflecting the perception of real-world human users, who also find Step 1 subject matter difficult. This particular finding exposes AI’s vulnerability to becoming connected to human ability.
Furthermore, the researchers noted that missing information drove inaccuracy observed in ChatGPT responses which fetched poorer insights and indecision in the AI. Yet, it did not show an inclination towards the incorrect answer choice. In this regard, they could try to improve ChatGPT performance by merging it with other models trained on abundant and highly validated resources in the clinical domain (e.g., UpToDate).
In ~90% of outputs, ChatGPT-generated responses also offered significant insight, valuable to medical students. It showed the partial ability to extract nonobvious and novel concepts that might provide qualitative gains for human medical education. As a substitute for the metric of usefulness in the human learning process, ChatGPT responses were also highly concordant. Thus, these outputs could help students understand the language, logic, and course of relationships encompassed within the explanation text.
Conclusions
The study provided new and surprising evidence that ChatGPT could perform several intricate tasks relevant to handling complex medical and clinical information. Although the study findings provide a preliminary protocol for arbitrating AI-generated responses concerning insight, concordance, accuracy, and the advent of AI in medical education would require an open science research infrastructure. It would help standardize experimental methods and describe and quantify human-AI interactions.
Soon AIs could become pervasive in clinical practice, with varied applications in nearly all medical disciplines, e.g., clinical decision support and patient communication. The remarkable performance of ChatGPT also inspired clinicians to experiment with it.
At AnsibleHealth, a chronic pulmonary disease clinic, they are using ChatGPT to assist with challenging tasks, such as simplifying radiology reports to facilitate patient comprehension. More importantly, they use ChatGPT for brainstorming when facing diagnostically difficult cases.
The demand for new examination formats continues to increase. Thus, future studies should explore whether AI could help offload the human effort of taking medical tests (e.g., USMLE) by helping with the question-explanation process or, if feasible, writing the whole autonomously.