In a recent article published in JAMA Oncology, researchers evaluate whether chatbots powered by large language models (LLMs) driven by artificial intelligence (AI) algorithms could provide accurate and reliable cancer treatment recommendations.

Study: Use of Artificial Intelligence Chatbots for Cancer Treatment Information. Image Credit: greenbutterfly / Shutterstock.com

Background

LLMs have shown promise in encoding clinical data and making diagnostic recommendations, with some of these systems recently used to take and subsequently pass the United States Medical Licensing Examination (USMLE). Likewise, the OpenAI application ChatGPT, which is part of the generative pre-training transformer (CPT) family of models, has also been used to identify potential research topics, as well as update physicians, nurses, and other healthcare professionals on recent developments in their respective fields.

LLMs can also mimic human dialects and provide prompt, detailed, and coherent responses to queries. However, in some cases, LLMs might provide less reliable information, which could misguide people who often use AI for self-education. Despite providing these systems with reliable and high-quality data, AI is still vulnerable to biases, limiting their applicability for medical applications.

Researchers predict that general users might use an LLM chatbot to query cancer-related medical guidance. Thus, a chatbot providing seemingly correct information but a wrong or less accurate response related to cancer diagnosis or treatment might misguide the person and generate and amplify misinformation.

About the study

In the present study, researchers evaluate the performance of an LLM chatbot in providing prostate, lung, and breast cancer treatment recommendations in agreement with National Comprehensive Cancer Network (NCCN) guidelines.

Since the knowledge end date of the LLM chatbot was September 2021, this model relied on 2021 NCCN guidelines for establishing treatment recommendations.

Four zero-shot prompt templates were also developed and used to create four variations for 26 cancer diagnosis descriptions for a final total of 104 prompts. These prompts were subsequently provided as input to the GPT-3.5 through the ChatGPT interface.

The study team comprised four board-certified oncologists, three of whom assessed the concordance of the chatbot output with the 2021 NCCN guidelines based on five scoring criteria developed by the researchers. The majority rule was used to determine the final score.

The fourth oncologist helped the other three resolve disagreements, which primarily arose when the LLM chatbot output was unclear. For example, LLM did not specify which treatments to combine for a specific type of cancer.

Study findings

A total of 104 unique prompts scored on five scoring criteria yielded 520 scores, from which all three annotators agreed on 322 or 61.9% of scores. Furthermore, the LLM chatbot provided a minimum of one recommendation for 98% of prompts.

All responses with a treatment recommendation comprised a minimum of one NCCN-concordant treatment. Moreover, 35 of the 102 outputs recommended one or more non-concordant treatments. In 34.6% of cancer diagnosis descriptions, all four prompt templates were given the same scores on all five score criteria.

Over 12% of chatbot responses were not considered NCCI-recommended treatments. These responses, which were described as ‘hallucinations’ by the researchers, were primarily immunotherapy, localized treatment of advanced disease, or other targeted therapies.

LLM chatbot recommendations also varied with the way the researchers phrased their questions. In some cases, the chatbot yielded unclear output, which led to disagreements among three annotators.

Other disagreements arose due to varying interpretations of NCCN guidelines. Nevertheless, these agreements highlighted the difficulty of reliably interpreting LLM output, especially the descriptive output.

Conclusions

The LLM chatbot evaluated in this study mixed incorrect cancer treatment recommendations with correct recommendations, which even experts failed to detect these mistakes. Accordingly, 33.33% of its treatment recommendations were at least partially non-concordant with NCCN guidelines.

The study findings demonstrate that the LLM chatbot was associated with below-average performance in providing reliable and precise cancer treatment recommendations.

Due to the increasingly widespread use of AI, it is crucial for healthcare providers to appropriately educate their patients about the potential misinformation that this technology can provide. These findings also emphasize the importance of federal regulations for AI and other technologies that have the potential to cause harm to the general public due to their inherent limitations and inappropriate use.

Journal reference:

Chen, S., Kann, B. H., Foote, M. B., et al. (2023). Use of Artificial Intelligence Chatbots for Cancer Treatment Information. JAMA Oncology. doi:10.1001/jamaoncol.2023.2954