A council of five AI models working together, discussing their answers through an iterative process, achieved 97%, 93%, and 94% accuracy on 325 medical exam questions spanning the three stages of the U.S. Medical Licensing Examination (USMLE), according to a new study published October 9th in the open-access journal PLOS Medicine by researcher Yahya Shaikh of Baltimore, USA, and colleagues.
Over the past several years, many studies have evaluated the performance of Large Language Models (LLMs) on medical knowledge and licensing exams. While scores have improved across LLMS, varying performance has been noted when the same question is asked to an LLM multiple times—a variety of responses are generated, some of which are incorrect or hallucinations.
In the new study, researchers developed a method to create a council of AI agents—composed of multiple instances of OpenAI's GPT-4—that undergo coordinated and iterative exchanges designed to arrive at a consensus response. A facilitator algorithm facilitates a deliberative process when there are divergent responses, summarizing the reasoning in each response and asking the Council to deliberate and re-answer the original question.
When the council was given 325 publicly available USMLE questions, including those focused on foundational biomedical sciences as well as clinical diagnosis and management, the system achieved consensus responses that were correct 97%, 93%, and 94% of the time for Step 1, Step 2 CK, and Step 3, respectively, outperforming single-instance GPT-4 models. In cases where there wasn't an initial unanimous response, the Council of AI deliberations achieved a consensus that was the correct answer 83% of the time. For questions that required deliberation, the Council corrected over half (53%) of responses that majority vote had gotten incorrect.
The authors suggest that collective decision-making among AIs can enhance accuracy and lead to more trustworthy tools for healthcare, where accuracy is critical. However, they note that the paradigm has not yet been tested in real clinical scenarios.
"By demonstrating that diverse AI perspectives can refine answers, we challenge the notion that consistency alone defines a 'good' AI," say the authors. "Instead, embracing variability through teamwork might unlock new possibilities for AI in medicine and beyond."
Yahya Shaikh says, "Our study shows that when multiple AIs deliberate together, they achieve the highest-ever performance on medical licensing exams, scoring 97%, 93%, and 94% across Steps 1–3, without any special training on or access to medical data. This demonstrates the power of collaboration and dialogue between AI systems to reach more accurate and reliable answers. Our work provides the first clear evidence that AI systems can self-correct through structured dialogue, with a performance of the collective better that the performance of any single AI."
Zishan Siddiqui notes, "This study isn't about evaluating AI's USMLE test-taking prowess, the kind that would make its mama proud, its papa brag, and grab headlines. Instead, we describe a method that improves accuracy by treating AI's natural response variability as a strength. It allows the system to take a few tries, compare notes, and self-correct, and it should be built into future tools for education and, where appropriate, clinical care."
Zainab Asiyah notes, "Semantic entropy didn't just measure data, but it told a story. It shows a struggle, ups and downs, and a resolution, so much like a human journey. It revealed a human side to LLMs. The numbers show how LLMs could actually convince each other to take on viewpoints and converse to change each other's minds…even if it was the wrong answer."
In your coverage, please use this URL to provide access to the freely available paper in PLOS Digital Health: https://plos.io/4n7eY5L
Citation: Shaikh Y, Jeelani-Shaikh ZA, Jeelani MM, Javaid A, Mahmud T, Gaglani S, et al. (2025) Collaborative intelligence in AI: Evaluating the performance of a council of AIs on the USMLE. PLOS Digit Health 4(10): e0000787. https://doi.org/10.1371/journal.pdig.0000787
Author countries: United States, Malaysia, Pakistan
Funding: The author(s) received no specific funding for this work.