AI may ace multiple-choice medical exams, but it still stumbles when faced with changing clinical information, according to new research in the New England Journal of Medicine.
University of Alberta neurology resident Liam McCoy evaluated how well large language models perform clinical reasoning — the ability to sort through symptoms, order the right tests, evaluate new information and come to the correct conclusion about what's wrong with a patient.
He found that advanced AI models struggle to update judgment in response to new and uncertain information, and often fail to recognize when some information is completely irrelevant. In fact, some recent improvements designed to make AI reasoning better have actually made this overconfidence problem worse.
It all means that while AI may do really well on medical licensing exams, there's a lot more to being a good doctor than instantly recalling facts, says McCoy.
"Large language models have superhuman performance on multiple choice questions, but we're still not at a stage where a patient can safely walk into a room, turn on their language model assistant, and have that do the entire visit," says McCoy, who is also a research affiliate at the Massachusetts Institute of Technology and a research student intern with Harvard's Beth Israel Deaconess Medical Center.
The use of AI in medicine has grown by leaps and bounds in the past five years — from writing up doctors' notes to looking for patterns in disease data and advising physicians on what to look for in medical images — but it's not yet ready to take over from doctors when it comes to giving a diagnosis, he says.
McCoy and colleagues from Harvard, MIT and elsewhere took a page from medical education to develop their benchmark test to measure this flexibility in clinical reasoning for AI models. Their tool, called concor.dance, is based on script concordance testing, a common method of assessing the skills of medical and nursing students.
"As a clinician you develop a script of how an illness looks and how it goes, what to do next. The simplest example would be if somebody has chest pain to think, 'OK, the heart might possibly be involved so we should do an ECG and some blood work to look for the markers of a heart attack,'" McCoy explains.
As medical learners gain experience, their diagnostic "scripts" become more sophisticated and they are better able to sort through which symptoms are most relevant and come up with a proper diagnosis.
"As you get more advanced, you might say, 'But it's also possible this chest pain could be due to pneumonia or a hole in the lung lining,'" he explains. "You develop more complex scripts and become nimble enough to switch between different scripts based on what is happening to your patient."
In medical education, script concordance testing awards students points for how well they do this nuanced human reasoning in comparison with the most experienced experts in each field.
McCoy's test for AI models used med school scripts for surgery, pediatrics, obstetrics, psychiatry, emergency medicine, neurology and internal medicine from Canada, the United States, Singapore and Australia.
McCoy tested 10 of the most popular AI models from Google, OpenAI, DeepSeek, Anthropic and others. While the models generally performed similarly to first- or second-year medical students, they often failed to reach the standard set by senior residents or attending physicians.