AI Benchmark Set for Everyday Patient Care

Mass General Brigham

Researchers at Mass General Brigham developed BRIDGE, a multilingual benchmark that evaluates how well large language models (LLMs) understand clinical patient-care text, including language used in electronic health records (EHRs), across nine languages. The benchmarking tool could help clinicians evaluate and compare LLMs to use in specific contexts. Results are published in Nature Biomedical Engineering .

"Unlike many existing medical AI benchmarks, BRIDGE focuses on real-world clinical data sources that better reflect the complexity of real-world care," said senior author Jie Yang, PhD, FACMI, FAMIA, of the Division of Pharmacoepidemiology and Pharmacoeconomics in the Mass General Brigham Department of Medicine. "BRIDGE can help clinicians select the right AI tools while guiding developers in improving model performance."

Medical LLMs have traditionally been assessed using licensing exam questions composed of standardized language and medical knowledge that may not fully reflect the complexity of real-world clinical interactions. The developers of BRIDGE created a framework for assessing LLMs using clinical text from EHRs, clinical case reports, and patient-doctor consultations. While the highest performing LLM scored as high as 92 on standardized medical exams, it earned only 44.8% on BRIDGE, highlighting the LLM's gaps in understanding of nuanced clinical language used in health care settings.

Yang and colleagues, including co-senior author Joshua Lin, MD, MPH, ScD, and co-first authors Jiageng Wu and Bowen Gu, used BRIDGE to systematically evaluate the performance of 95 LLMs from 59 clinical sources on real-world clinical tasks spanning the patient care continuum. This involved 14 clinical specialties and included triage, information extraction, diagnosis, prognosis, and billing coding. They also created a public continuously updated leaderboard (which now includes 107 LLMs), enabling clinicians and AI developers to compare LLM performance across clinical tasks.

BRIDGE also revealed that AI performance varies across medical specialties. Because the benchmark includes clinical data in nine languages, it enables researchers to identify LLM performance gaps and support the development of more accurate and equitable AI tools for non-English-speaking patients.

Read the paper

Authorship: In addition to Yang and Lin, Mass General Brigham authors include Jiageng Wu, Bowen Gu, Richard Wyss, Rishi J Desai, and Sebastian Schneeweiss. Additional authors include Ren Zhou, Kevin Xie, Doug Snyder, Yixing Jiang, Valentina Carducci, Emily Alsentzer, Leo Anthony Celi, Adam Rodman, Jonathan H. Chen, and Santiago Romero-Brufau.

Disclosures: Lin has received research grants from Takeda, AbbVie, and UCB for projects unrelated to this study. Alsentzer reports consultant fees from Fourier Health. Schneeweiss is participating in investigator-initiated grants to the Brigham and Women's Hospital from Boehringer Ingelheim, Takeda, and UCB unrelated to the topic of this study. He is an advisor to Aetion Inc., a software manufacturer. Schneeweiss is an advisor to Temedica GmbH, a patient-oriented data generation company and his interests were declared, reviewed, and approved by the Brigham and Women's Hospital in accordance with their institutional compliance policies. Chen reports cofounding Reaction Explorer, that develops and licenses organic chemistry education software, and receive medical expert witness fees from Sutton Pierce, Younker Hyde MacFarlane, Sykes McAllister, Elite Expert, consulting fees from ISHI Health, and honoraria or travel expenses for invited presentations by insitro, General Reinsurance Corporation, Cozeva, and other industry conferences, academic institutions, and health systems.

Funding: This study was partially funded by PCORI ME-2022C1-25646, Goldberg Scholarship and Brigham Research Institute, National Institute on Aging (RF1AG090405), and National Library of Medicine R01LM014667. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript

Paper cited: Wu, J. et al. "BRIDGE: Benchmarking Large Language Models for Understanding Real-world Clinical Practice Text" Nature Biomedical Engineering DOI: 10.1038/s41551-026-01719-2

/Public Release. This material from the originating organization/author(s) might be of the point-in-time nature, and edited for clarity, style and length. Mirage.News does not take institutional positions or sides, and all views, positions, and conclusions expressed herein are solely those of the author(s).View in full here.

You might also like