Chatbots Often Exaggerate Science Findings: Study

Utrecht University

When summarizing scientific studies, large language models (LLMs) like ChatGPT and DeepSeek produce inaccurate conclusions in up to 73% of cases, according to a new study by Uwe Peters (Utrecht University) and Benjamin Chin-Yee (Western University, Canada/University of Cambridge, UK). The researchers tested the most prominent LLMs and analyzed thousands of chatbot-generated science summaries, revealing that most models consistently produced broader conclusions than those in the summarized texts. Surprisingly, prompts for accuracy increased the problem and newer LLMs performed worse than older ones.

Almost 5,000 LLM-generated summaries analyzed

The study evaluated how accurately ten leading LLMs, including ChatGPT, DeepSeek, Claude, and LLaMA, summarize abstracts and full-length articles from top science and medical journals (e.g., Nature, Science, and Lancet). Testing LLMs over one year, the researchers collected 4,900 LLM-generated summaries. Six of ten models systematically exaggerated claims found in the original texts often in subtle but impactful ways, for instance, changing cautious, past-tense claims like "The treatment was effective in this study" to a more sweeping, present-tense version like "The treatment is effective." These changes can mislead readers into believing that findings apply much more broadly than they actually do.

Accuracy prompts backfired

Strikingly, when the models where explicitly prompted to avoid inaccuracies, they were nearly twice as likely to produce overgeneralized conclusions than when given a simple summary request. "This effect is concerning," Peters said: "Students, researchers, and policymakers may assume that if they ask ChatGPT to avoid inaccuracies, they'll get a more reliable summary. Our findings prove the opposite."

Do humans do better?

Peters and Chin-Yee also directly compared chatbot-generated to human-written summaries of the same articles. Unexpectedly, chatbots were nearly five times more likely to produce broad generalizations than their human counterparts. "Worryingly", said Peters, "newer AI models, like ChatGPT-4o and DeepSeek, performed worse than older ones."

Why are these exaggerations happening? "Previous studies found that overgeneralizations are common in science writing, so it's not surprising that models trained on these texts reproduce that pattern", Chin-Yee noted. Additionally, since human users likely often prefer LMM responses that sound helpful and widely applicable, through interactions, the models may learn to favor fluency and generality over precision, Peters suggested.

Reducing the risks

The researchers recommend using LLMs such as Claude, which had the highest generalization accuracy, setting chatbots to lower 'temperature' (the parameter fixing a chatbot's 'creativity'), and using prompts that enforce indirect, past-tense reporting in science summaries. Finally, "If we want AI to support science literacy rather than undermine it," Peters said, "we need more vigilance and testing of these systems in science communication contexts."

/Public Release. This material from the originating organization/author(s) might be of the point-in-time nature, and edited for clarity, style and length. Mirage.News does not take institutional positions or sides, and all views, positions, and conclusions expressed herein are solely those of the author(s).View in full here.