A new study found that 34% of the chatbot’s answers included 1 or more recommendation that did not align with clinical guidelines.
Large language models that utilize artificial intelligence—like ChatGPT—do not provide patients with accurate cancer treatment recommendations, according to a new research letter published in JAMA Oncology.1
Large language models are a type of artificial intelligence that are trained on large amounts of text data and can mimic human-like responses when prompted with questions.2 Many patient’s now use the internet to find information about cancer treatments and some are bound to use chatbots to receive answers, but this could likely result in misinformation.
“Patients should feel empowered to educate themselves about their medical conditions, but they should always discuss with a clinician, and resources on the Internet should not be consulted in isolation,” Danielle Bitterman, MD, a corresponding author on the study, said in a release.3 “ChatGPT responses can sound a lot like a human and can be quite convincing. But, when it comes to clinical decision-making, there are so many subtleties for every patient’s unique situation.”
A team of investigators from Brigham and Women’s Hospital conducted an observational study to assess the performance of a large language model chatbot on providing prostate, lung, and breast cancer treatment recommendations that aligned with the National Comprehensive Cancer Network’s guidelines.
The researchers created 4 prompt templates used to ask the chatbot questions about cancer treatment recommendations. The prompts were then used to create 4 variations for 26 diagnosis descriptions, which totaled 104 prompts. To assess consistency with the guidelines, 5 scoring criteria were developed.
The chatbot’s answers were assessed by 3 board-certified oncologists and majority rule was used as the final score.
Investigators found that the chatbot gave at least 1 recommendation for 98% of prompts, with all of them including at least 1 recommendation that aligned with the National Comprehensive Cancer Network’s guidelines. However, 34% included 1 or more recommendations that did not align with the guidelines.
Additionally, 12.5% of the chatbot’s answers were hallucinated, meaning they were not part of any recommended treatments. The hallucinations were mostly for localized treatment of advanced disease, targeted therapy, or immunotherapy.
“It is an open research question as to the extent large language models provide consistent logical responses as oftentimes ‘hallucinations’ are observed,” Shan Chen, MS, first author on the study, said in a release. “Users are likely to seek answers from the large language models to educate themselves on health-related topics—similarly to how Google searches have been used. At the same time, we need to raise awareness that large language models are not the equivalent of trained medical professionals.”