In a recent study posted to the medRxiv* preprint server, researchers in the United States assessed the performance of three general Large Language Models (LLMs), ChatGPT (or GPT-3.5), GPT-4, and Google Bard, on higher-order questions, specifically representing the American Board of Neurological Surgery (ABNS) oral board examination. In addition, they interpreted the differences in their performance and accuracy after varying question characteristics.

Study: Performance of ChatGPT, GPT-4, and Google Bard on a Neurosurgery Oral Boards Preparation Question Bank. Image Credit: Login / Shutterstock

*Important notice: medRxiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as conclusive, guide clinical practice/health-related behavior, or treated as established information.

Background

All three LLMs assessed in this study have shown the capability to pass medical board exams with multiple-choice questions. However, no previous studies have tested or compared the performance of multiple LLMs on predominantly higher-order questions from a high-stake medical subspecialty domain, e.g., neurosurgery.

A prior study showed that ChatGPT passed a 500-question module imitating the neurosurgery written board exams with a score of 73.4%. Its updated model, GPT-4, became available for public use on March 14, 2023, and similarly attained passing scores in >25 standardized exams. Studies documented that GPT-4 showed >20% performance improvements on the United States Medical Licensing Exam (USMLE).

Another artificial intelligence (AI)-based chatbot, Google Bard, had real-time web crawling capabilities, thus, could offer more contextually relevant information while generating responses for standardized exams in fields of medicine, business, and law. The ABNS neurosurgery oral board examination, considered a more rigorous assessment than its written counterpart, is taken by doctors two to three years after residency graduation. It comprises three sessions of 45 minutes each, and its pass rate has not exceeded 90% since 2018.

About the study

In the present study, researchers assessed the performance of GPT-3.5, GPT-4, and Google Bard on a 149-question module imitating the neurosurgery oral board exam.

The Self-Assessment Neurosurgery Exam (SANS) indications exam covered intriguing questions on relatively difficult topics, such as neurosurgical indications and interventional decision-making. The team assessed questions in one best-answer multiple-choice question format. Since all three LLMs currently do not have multimodal input, they tracked responses with ‘hallucinations’ for questions with medical imaging data, scenarios where an LLM asserts inaccurate facts it falsely believes are correct. In all, 51 questions incorporated imaging into the question stem.

Furthermore, the team used linear regression to query correlations between performance on different question categories. They assessed variations in performance using chi-squared, Fisher’s exact, and logistic regression tests with a single variable, where p<0.05 was considered statistically significant.

Study findings

On a 149-question bank of mainly higher-order diagnostic and management multiple-choice questions designed for neurosurgery oral board exams, GPT-4 attained a score of 82.6% and outperformed ChatGPT’s score of 62.4%. Additionally, GPT-4 demonstrated a markedly better performance than ChatGPT in the Spine subspecialty (90.5% vs. 64.3%).

Google Bard generated correct responses for 44.2% (66/149) of questions. While it generated incorrect responses to 45% (67/149) of questions, it declined to answer 10.7% (16/149) of questions. GPT-3.5 and GPT-4 never declined to answer a text-based question, whereas Bard even declined to answer 14 test-based questions. In fact, GPT-4 outshone Google Bard in all categories and demonstrated improved performance in question categories for which ChatGPT showed lower accuracy. Interestingly, while GPT-4 performed better on imaging-related questions than ChatGPT (68.6% vs. 47.1%), its performance was comparable to Google Bard (68.6% vs. 66.7%).

However, notably, GPT-4 showed reduced rates of hallucination and the ability to navigate challenging concepts like declaring medical futility. However, it struggled in other scenarios, such as factoring in patient-level characteristics, e.g., frailty.

Conclusions

There is an urgent need to develop more trust in LLM systems, thus, rigorous validation of their performance on increasingly higher-order and open-ended scenarios should continue. It would ensure the safe and effective integration of these LLMs into clinical decision-making processes.

Methods to quantify and understand hallucinations remain vital, and eventually, only those LLMs would be incorporated into clinical practice that would minimize and recognize hallucinations. Further, the study findings underscore the urgent need for neurosurgeons to stay informed on emerging LLMs and their varying performance levels for potential clinical applications.

Multiple-choice examination patterns might become obsolete in medical education, while verbal assessments will gain more importance. With advancements in the AI domain, neurosurgical trainees might use and depend on LLMs for board preparation. For instance, LLMs-generated responses might provide new clinical insights. They could also serve as a conversational aid to rehearse various clinical scenarios on challenging topics for the boards.

Journal reference:

Preliminary scientific report.
Performance of ChatGPT, GPT-4, and Google Bard on a Neurosurgery Oral Boards Preparation Question Bank, Rohaid Ali, Oliver Y. Tang, Ian D. Connolly, Jared S. Fridley, John H. Shin, Patricia L. Zadnik Sullivan, Deus Cielo, Adetokunbo A. Oyelese, Curtis E. Doberstein, Albert E. Telfeian, Ziya L. Gokaslan, Wael F. Asaad, medRxiv preprint 2023.04.06.23288265; DOI: https://doi.org/10.1101/2023.04.06.23288265, https://www.medrxiv.org/content/10.1101/2023.04.06.23288265v1