The largest user study of large language models (LLMs) for assisting the general public in medical decisions has found that they present risks to people seeking medical advice due to their tendency to provide inaccurate and inconsistent information. The results have been published in Nature Medicine .
The new study, led by the Oxford Internet Institute and the Nuffield Department of Primary Care Health Sciences at the University of Oxford, carried out in partnership with MLCommons and other institutions, reveals a major gap between the promise of large language models (LLMs) and their usefulness for people seeking medical advice. While these models now excel at standardised tests of medical knowledge, they pose risks to real users seeking help with their own medical symptoms.
Despite all the hype, AI just isn't ready to take on the role of the physician. Patients need to be aware that asking a large language model about their symptoms can be dangerous, giving wrong diagnoses and failing to recognise when urgent help is needed.
Dr Rebecca Payne , Nuffield Department of Primary Care Health Sciences
In the study, participants used LLMs to identify health conditions and decide on an appropriate course of action, such as seeing a GP, or going to the hospital, based on information provided in a series of specific medical scenarios developed by doctors.
A key finding was that LLMs were no better than traditional methods. Those using LLMs did not make better decisions than participants who relied on traditional methods like online searches or their own judgment.
The study also revealed a two-way communication breakdown. Participants often didn't know what information the LLMs needed to offer accurate advice, and the responses they received frequently combined good and poor recommendations, making it difficult to identify the best course of action.
In addition, existing tests fall short: current evaluation methods for LLMs do not reflect the complexity of interacting with human users. Like clinical trials for new medications, LLM systems should be tested in the real world before being deployed.
'These findings highlight the difficulty of building AI systems that can genuinely support people in sensitive, high-stakes areas like health,' said Dr Rebecca Payne , GP, lead medical practitioner on the study (Nuffield Department of Primary Care Health Sciences, University of Oxford, and Bangor University).
'Despite all the hype, AI just isn't ready to take on the role of the physician. Patients need to be aware that asking a large language model about their symptoms can be dangerous, giving wrong diagnoses and failing to recognise when urgent help is needed.'
In the study, researchers conducted a randomised trial involving nearly 1,300 online participants who were asked to identify potential health conditions and recommended courses of action, based on personal medical scenarios. The detailed scenarios, developed by doctors, ranged from a young man developing a severe headache after a night out with friends to a new mother feeling constantly out of breath and exhausted.
Designing robust testing for large language models is key to understanding how we can make use of this new technology. In this study, we show that interacting with humans poses a challenge even for top LLMs.
Andrew Bean , Oxford Internet Institute
One group used an LLM to assist their decision-making, while a control group used other traditional sources of information. The researchers then evaluated how accurately participants identified the likely medical issues and the most appropriate next step, such as visiting a GP or going to A&E. They also compared these outcomes to the results of standard LLM testing strategies, which do not involve real human users. The contrast was striking; models that performed well on benchmark tests faltered when interacting with people.
They found evidence of three types of challenge:
- Users often didn't know what information they should provide to the LLM,
- LLMs provided very different answers based on slight variations in the questions asked,
- LLMs often provided a mix of good and bad information which users struggled to distinguish.
Lead author Andrew Bean , a DPhil student at the Oxford Internet Institute, said: 'Designing robust testing for large language models is key to understanding how we can make use of this new technology. In this study, we show that interacting with humans poses a challenge even for top LLMs. We hope this work will contribute to the development of safer and more useful AI systems.'
The study 'Clinical knowledge in LLMs does not translate to human interactions' has been published in Nature Medicine .