AI Remains Lacking In Clinical Reasoning Abilities, According To Study Of 21 Large Language Models

Mass General Brigham

Despite increasing use of artificial intelligence (AI) in health care, a new study led by Mass General Brigham researchers from the MESH Incubator shows that generative AI models continue to fall short at their clinical reasoning capabilities.

By asking 21 different large language models (LLMs) to play doctor in a series of clinical scenarios, the researchers showed that LLMs often fail often fail at navigating diagnostic workups and coming up with a testable list of potential or "differential" diagnoses. Though all tested LLMs arrived at a correct final diagnosis more than 90% of the time when provided with all pertinent information in a patient case, they consistently performed poorly at the earlier, reasoning-driven steps of the diagnostic process, according to the results published in JAMA Network Open .

"Despite continued improvements, off-the-shelf large language models are not ready for unsupervised clinical-grade deployment," said corresponding author Marc Succi, MD, executive director of the MESH Incubator at Mass General Brigham. "Differential diagnoses are central to clinical reasoning and underlie the 'art of medicine' that AI cannot currently replicate. The promise of AI in clinical medicine continues to lie in its potential to augment, not replace, physician reasoning, provided all the relevant data is available – not always the case"

This new research is a follow-up to previous work led by Succi's MESH group in which researchers evaluated ChatGPT 3.5 ability to accurately in diagnose a series of a clinical vignettes.

In the new study, the researchers developed a novel and more holistic measure of LLMs that looked beyond accuracy, called PrIME-LLM, which evaluates a model's competency across different stages of clinical reasoning—coming up with potential diagnoses, conducting appropriate tests, arriving at a final diagnosis, and managing treatment. When models perform well in one area but poorly in another, this imbalance is reflected in the PrIME-LLM score, as opposed to averaging competency across tasks, which may mask areas of weakness, according to the researchers.

The study compared 21 general-purpose LLMs, including the latest models of ChatGPT, DeepSeek, Claude, Gemini, and Grok at the time of submission. The researchers tested the models' ability to work through 29 published clinical cases. To simulate the way that clinical cases unfold, the researchers gradually fed the models information, beginning with basics like a patient's age, gender and symptoms before adding physical examination findings and laboratory results. The LLMs' performance at each stage was assessed by medical student evaluators, and these evaluations were used to calculate the models' overall PrIME-LLM scores.

In line with their previous study, the researchers found that the LLMs were good at producing accurate final diagnoses. However, all of the models failed to produce an appropriate differential diagnosis more than 80% of the time. In the real world, a differential diagnosis is critical, but in this study, the models were given more information so that they could proceed to the next stage of the clinical workup even if they failed at the differential diagnosis step.

"By evaluating LLMs in a stepwise fashion, we move past treating them like test-takers and put them in the position of a doctor," said Arya Rao, lead author, MESH researcher, and MD-PhD student at Harvard Medical School. "These models are great at naming a final diagnosis once the data is complete, but they struggle at the open-ended start of a case, when there isn't much information."

Most of the LLMs showed improved accuracy when provided with laboratory results and imaging in addition to text. More recently released models generally outperformed older models, showing that LLMs are improving incrementally. The models' PrIME-LLM scores ranged from 64% for Gemini 1.5 Flash to 78% for Grok 4 and GPT-5.

According to Succi, PrIME-LLM represents a standardized way to evaluate AI's clinical competency that could be used by AI developers and hospital leaders to benchmark new technologies as they are released.

"We want to help separate the hype from the reality of these tools as they apply to health care," he said. "Our results reinforce that large language models in healthcare continue to require a 'human in the loop' and very close oversight."

Authorship: In addition to Succi, Mass General Brigham authors include Arya S. Rao, Kaiz P. Esmail, Richard S. Lee, Sharon Jiang, Bianca Arraiza Carlo, Jasleen Gill, Praneet Khanna, Ezra Kalmowitz, Basile Montagnese, Kimia Heydari, Qiao Jiao, Ethan Bott, Dan Nguyen, Grace Wang, Michael Hood, Adam B. Landman.

Disclosures: Landman is a consultant on the Abbott Medical Device Cybersecurity Council (unrelated to the current work).

Funding: Rao is supported in part by award Number T32GM144273 from the National Institute of General Medical Sciences. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institute of General Medical Sciences or the National Institutes of Health. The funding organization was not involved in the design and conduct of the study; collection, management, analysis, and interpretation of the data; preparation, review, or approval of the manuscript; and decision to submit the manuscript for publication.

Paper cited: Rao, A et al. "Large Language Model Performance and Clinical Reasoning Tasks" JAMA Network Open DOI: 10.1001/jamanetworkopen.2026.4003

/Public Release. This material from the originating organization/author(s) might be of the point-in-time nature, and edited for clarity, style and length. Mirage.News does not take institutional positions or sides, and all views, positions, and conclusions expressed herein are solely those of the author(s).View in full here.