A cutting-edge large language model (LLM) outperformed human doctors in common clinical reasoning tasks including emergency room decisions, identifying likely diagnoses, and choosing next steps in management, according to a new study that used real emergency department data. The authors of the study – one of the largest studies to date to compare artificial intelligence and physicians on a wide array of clinical reasoning tasks – are clear that their results do not mean AI systems are ready to practice medicine on their own, or that doctors can be removed from the diagnostic process. The results do, however, raise urgent questions about the future evaluation and implementation of artificial intelligence (AI) tools in clinical care. For more than 65 years, difficult clinical diagnostic cases have been the gold standard for evaluating medical computing systems. Most recently, LLMs have surpassed earlier computational approaches on these complex cases. However, despite this progress, most medical studies of LLMs have examined narrow or highly controlled scenarios and often lacked direct comparison to the performance of human physicians in real-world clinical reasoning tasks. The rapid advancement of LLM-based medical tools now necessitates more rigorous evaluation.
Here, Peter Brodeur and colleagues comprehensively evaluated the diagnostic and treatment-planning abilities of an advanced LLM – the OpenAI o1 series – by comparing its performance to that of hundreds of physicians and earlier AI systems, across a range of clinical reasoning tasks. These included both standardized clinical cases and a real-world study involving randomly selected emergency room patients at a major emergency medical center in Massachusetts. Brodeur et al. found that, across all six experiments, the LLM model consistently matched or exceeded human performance in diagnostic and management reasoning. Notably, its advantage was most pronounced in early-stage emergency department triage, where clinicians must make rapid decisions with minimal information. While both humans and AI improved as more clinical data became available, the model demonstrated a distinct strength under conditions of uncertainty, using even fragmented, unstructured health record data effectively. According to the authors, LLMs are rapidly approaching, and in some areas surpassing, human-level clinical reasoning, and although AI-assisted decision-making is often viewed as risky, the findings suggest such tools – when used in collaboration with physicians' assessments – could reduce diagnostic errors, delays, and disparities in access to care. However, the authors also note several important limitations of the study. For example, its focus was confined to text-based reasoning, whereas clinical practice depends heavily on visual and auditory cues, areas where current AI remains less capable. "Accuracy on a defined task is only one dimension of deployment readiness. Clinical AI must also deliver equitable, cost-effective, and safe outcomes, supported by accountability, transparency, and ongoing monitoring," write Ashley Hopkins and Erik Cornelisse in a related Perspective. "Without robust demonstrated effectiveness, equity, and safety, many AI systems will remain insufficient for clinical use."