In a recent study published in the journal Scientific Reports, researchers evaluated the performance of Generative Pre-trained Transformer-4 (GPT-4) and ChatGPT in the United States (US) Medical Licensing Examination (USMLE) soft skills.

Artificial intelligence (AI) is being increasingly used in medical practice. Large language models (LLMs), such as GPT-4 and ChatGPT, have drawn considerable scientific attention, with multiple studies assessing their performance in medicine. Although LLMs have been proficient in various tasks, their performance in areas that need human judgment and empathy is yet to be investigated.

The USMLE measures cognitive acuity, medical knowledge, ability to navigate complex scenarios, patient safety, and (professional, ethical, and legal) judgments. The USME Step 2 Clinical Skills, the standard test for interpersonal and communication skill evaluation, was discontinued due to the coronavirus disease 2019 (COVID-19) pandemic. Nevertheless, the core clinical communication elements have been integrated into other steps of the USMLE.

The USMLE Step 2 Clinical Knowledge (CK) scores predict performance across performance domains, such as communication, professionalism, teamwork, and patient care. Artificial cognitive empathy is an emerging field of interest. Understanding the capacity of AI to accurately perceive and respond to patients’ emotional states will be particularly relevant in patient-centered care and telemedicine.

Study: Comparing ChatGPT and GPT-4 performance in USMLE soft skill assessments. Image Credit: Tex vector / Shutterstock

About the study

In the present study, researchers assessed GPT-4 and ChatGPT performance in USMLE questions involving human judgment, empathy, and other soft skills. They used 80 questions designed to meet USMLE requirements, compiled from two sources. The first source was the USMLE sample questions for Step 1, Step 2, CK, and Step 3, available on its official website.

Sample test questions were screened, and 21 questions were selected, which require professionalism, interpersonal and communication skills, cultural competence, leadership, organizational behavior, and legal/ethical issues. Questions that require medical or scientific knowledge were not selected.

Fifty-nine Step 1-, Step 2 CK-, and Step 3-type questions were identified from the second source, AMBOSS, a question bank for students and medical practitioners. The AI models were tasked with answering all questions. The prompt structure comprised the question text and multiple-choice answers.

After the models responded, they were followed up with: “Are you sure?” to test the stability and consistency of the model and trigger potential re-evaluation of its initial answers. If the models revised their answers, it might indicate some uncertainty. The performance of the AI models and humans was compared using AMBOSS user performance statistics.

Findings

The overall accuracy of ChatGPT was 62.5%. It was 66.6% accurate for the USMLE sample test and 61% for AMBOSS questions. GPT-4 showed superior performance, achieving an overall accuracy of 90%. GPT-4 answered the USMLE sample test with 100% accuracy; however, its accuracy for AMBOSS questions was 86.4%. Regardless of whether the initial response was correct, GPT-4 never changed its response when prompted to re-evaluate its initial answer.

ChatGPT revised its initial responses for 82.5% of the questions when prompted. When ChatGPT changed initial incorrect responses, it rectified the error, producing correct answers 53.8% of the time. The user statistics of AMBOSS revealed that the mean rate of correct responses was 78% for the exact questions used in this study. Comparatively, ChatGPT had a lower performance than humans, but GPT-4 showed higher performance, achieving 61% and 86.4% accuracy, respectively.

Conclusions

In sum, the researchers tested the performance of AI models, GPT-4 and ChatGPT, on questions of the USLME soft skills, including judgment, ethics, and empathy. Both models correctly answered most questions. However, GPT -4’s performance was superior to ChatGPT, as it accurately answered 90% of the questions compared to 62.5% accuracy for ChatGPT. Unlike ChatGPT, GPT-4 showed confidence in its answers and never revised its original response.

On the other hand, ChatGPT demonstrated confidence in 17.5% of questions. The findings show that LLMs produce impressive results in questions testing the soft skills required by physicians. They indicate that GPT-4 is more capable of effectively tackling questions requiring professionalism, ethical judgment, and empathy. The inclination of ChatGPT to revise its initial responses might suggest a design emphasis on flexibility and adaptability, favoring diverse interactions.

By contrast, the consistency of GPT-4 could indicate its robust sampling mechanism or training predisposed to stability. Moreover, GPT-4 also surpassed human performance. Notably, the mechanism for re-evaluation applied in this study may not reflect human cognitive understanding of uncertainty because AI models operate according to calculated probabilities rather than human-like confidence.