Traditional Systems Beat Generative AI in Diagnoses

Mass General Brigham

Medical professionals have been using artificial intelligence (AI) to streamline diagnoses for decades, using what are called diagnostic decision support systems (DDSSs). Computer scientists at Massachusetts General Hospital (MGH), a founding member of the Mass General Brigham healthcare system first developed MGH's own DDSS called DXplain in 1984, which relies on thousands of disease profiles, clinical findings, and data points to generate and rank potential diagnoses for use by clinicians. With the popularization and increased accessibility of generative AI and large language models (LLMs) in medicine, investigators at MGH's Laboratory of Computer Science (LCS) sought to compare the diagnostic capabilities of DXplain, which has evolved over the past four decades, to popular LLMs.

Their new research compares ChatGPT, Gemini, and DXplain at diagnosing patient cases, revealing that DXplain performed somewhat better, but the LLMs also performed well. The investigators envision pairing DXplain with an LLM as the optimal way forward, as it would improve both systems and enhance their clinical efficacy. The results are published in JAMA Network Open .

"Amid all the interest in large language models, it's easy to forget that the first AI systems used successfully in medicine were expert systems like DXplain," said co-author Edward Hoffer, MD, of the LCS at MGH.

"These systems can enhance and expand clinicians' diagnoses, recalling information that physicians may forget in the heat of the moment and isn't biased by common flaws in human reasoning. And now, we think combining the powerful explanatory capabilities of existing diagnostic systems with the linguistic capabilities of large language models will enable better automated diagnostic decision support and patient outcomes," said corresponding author Mitchell Feldman, MD, also of MGH's LCS.

The investigators tested the diagnostic capabilities of DXplain, ChatGPT, and Gemini using 36 patient cases spanning racial, ethnic, age, and gender categories. For each case, the systems had a chance to suggest potential case diagnoses both with and without lab data. With lab data, all three systems listed the correct diagnosis most of the time: 72% for DXplain, 64% for ChatGPT, and 58% for Gemini. Without lab data, DXplain listed the correct diagnosis 56% of the time, outperforming ChatGPT (42%) and Gemini (39%), though the results were not statistically significant.

The researchers observed that the DDSS and LLMs caught certain diseases the others missed, suggesting there may be promise in combining the approaches. Preliminary work building off these findings reveals that LLMs could be used to pull clinical findings from narrative text, which could then be plugged into DDSSs—in turn synergistically improving both systems and their diagnostic conclusions.

Authorship: Additional Mass General Brigham authors include Jared Conley, Jaime Chang, Jeanhee Chung, Michael Jernigan, William Lester, Zachary Strasser, and Henry Chueh.

Disclosures: None.

Funding: This work was supported by the National Center for Advancing Translational Sciences of the NIH (UM1TR004408), awarded through Harvard Catalyst.

Paper cited: Feldman, Met al. "Dedicated AI Expert System vs Generative AI With Large Language Model for Clinical Diagnoses" JAMA Network Open DOI: 10.1001/jamanetworkopen.2025.12994

/Public Release. This material from the originating organization/author(s) might be of the point-in-time nature, and edited for clarity, style and length. Mirage.News does not take institutional positions or sides, and all views, positions, and conclusions expressed herein are solely those of the author(s).View in full here.