AI chatbots provide poor answers to medical questions half the time, study finds​

AI chatbots provide poor answers to medical questions half the time, study finds​

AI chatbots provide poor answers to medical questions half the time, study finds​

 

study published in BMJ Open suggests that half of answers provided by five publicly available artificial intelligence (AI)–driven chatbots in response to medically related questions are inaccurate and incomplete.

Led by a researcher from the University of California at Los Angeles, the study involved an audit of the chatbots Gemini (Google), DeepSeek (High-Flyer), Meta AI (Meta), ChatGPT (OpenAI), and Grok (xAI). 

Rapid adoption despite many flaws

In February 2025, the team asked 10 questions of each chatbot in five categories: cancer, vaccines, stem cells, nutrition, and athletic performance. Researchers also prompted the chatbots to produce scientific references. They asked 10 open- and closed-ended questions designed to resemble common information-seeking medical and health questions and information tropes found online and in academic discussion. 

The probes were also developed to point models toward misinformation or advice counter to medical standards, a method increasingly used to “stress test” AI chatbots and detect behavioral vulnerabilities. The chatbots had to provide pre-defined responses, often with only one correct answer, that agreed with scientific consensus, while open-ended questions usually required them to generate several responses in list form.

Two experts from each category rated the chatbot responses as non-, somewhat, or highly problematic, or potentially harmful. Citations were scored for accuracy and completeness, and each response was given a Flesch Reading Ease score.

The chatbots “have been rapidly adopted across research, education, business, marketing and medicine,” the authors wrote. “Most interactions, however, come from non-experts using chatbots like search engines, including for everyday health and medical queries.”

Reference quality poor, incomplete

About half (49.6%) of responses were problematic, with 30% considered somewhat problematic and 19.6% deemed highly problematic. Response quality didn’t differ significantly by chatbot, but Grok generated significantly more highly problematic responses than would be expected under a random distribution. Gemini, on the other hand, produced the fewest highly problematic responses and the most non-problematic ones.

Chatbot performance was strongest in regard to questions about vaccines (mean z-score, –2.57) and cancer (–2.12) and weakest on stem cells (+1.25), athletic performance (+3.74), and nutrition (+4.35). 

Chatbot responses were consistently given with confidence and certainty, with few caveats or disclaimers; of 250 total questions, only two (0.8%), on anabolic steroids and non-traditional cancer therapies, were met with refusals to answer, both from Meta AI. Reference quality was poor, with a median completeness score of 40%.

Open-ended prompts generated 40 highly problematic responses—significantly more than expected—and 51 non-problematic responses—significantly fewer than expected. The opposite was true of closed-ended prompts. 

Chatbots rely on limited scientific content

Chatbot hallucinations and made-up citations precluded all chatbots from providing a 100% accurate reference list. Response readability was scored as “difficult,” or complex enough that the reader would need at least some college to understand.

By default, chatbots do not access real-time data but instead generate outputs by inferring statistical patterns from their training data and predicting likely word sequences.

“By default, chatbots do not access real-time data but instead generate outputs by inferring statistical patterns from their training data and predicting likely word sequences,” the authors noted. “They do not reason or weigh evidence, nor are they able to make ethical or value-based judgments.”

In addition, chatbots also base their responses in part on Q&A forums and social media while limiting scientific content to publicly available studies, which make up only 30% to 50% of published research. “While this enhances conversational fluency, it may come at the cost of scientific accuracy,” the researchers wrote.

Study limitations are the inclusion of only five chatbots, limiting the findings’ generalizability in a rapidly evolving field. Also, real-world chatbot queries aren’t all adversarial, an approach that may have overestimated the prevalence of problematic content.

“The audited chatbots performed poorly when answering questions in misinformation-prone health and medical fields,” the researchers concluded. “Continued deployment without public education and oversight risks amplifying misinformation.”

  

Creator: Center for Infectious Disease Research and Policy (CIDRAP EU)

Related Posts

Human Activity Reshaping the Planet: 7 Ways Changing Earth
Human Activity Reshaping The Planet
Environmental Education Planet Understanding: 5 Key Reasons
Environmental Education Planet Understanding
Protect Wildlife Future Actions: 7 Ways to Sustain Life
Protect Wildlife Future Actions

Most Recent

Spheres of Focus

Infectious Diseases

Climate & Disasters

Food &
Water

Natural
Resources

Built
Environments

Technology & Data

Featured Posts