AI Doctor: GPT's Healthcare Answers 76% Accurate

Pennsylvania State University

Artificial intelligence (AI)-powered chatbots respond to everyday health-related questions from general users with nearly 76% accuracy, which raises concerns about their trustworthiness in real-world client-facing applications, according to a new study led by Penn State researchers.

The researchers wanted to understand how the average person uses AI for health-related concerns and how accurately AI responds to everyday medical queries. They found that when it comes to healthcare, especially specialized areas like neurology and dermatology, AI tools may work best in the hands of trained physicians rather than patients. The team will present their findings at the 2026 Association for Computing Machinery Fairness, Accountability and Transparency (FAccT) conference in Montreal, Canada, June 25-28.

"Our work focuses explicitly on healthcare scenarios that the average internet user might ask AI, which is a perspective that prior research into large language models (LLMs) and healthcare hasn't covered," said study co-author Amulya Yadav, associate professor of informatics and intelligent systems in Penn State's College of Information Sciences and Technology (IST). "We wanted to understand that if people are using LLMs like ChatGPT as a symptom health checker, like historically we've used Google, how accurate is the LLM in answering those queries, and how harmful could those responses be?"

To understand how accurate or harmful health-related LLM responses could be for the average internet user, the researchers held an AI competition called a Diagnose-a-thon at Penn State. A total of 34 participants - comprising faculty, staff and undergraduate and graduate students - submitted 212 prompts and AI-generated responses to real and imaginary health concerns written from both patient and doctor perspectives. Participants were allowed to choose one of four LLMs to use for the contest: ChatGPT-4o, ChatGPT-3.5, Gemini-1.5 Pro and Llama3-8b.

"One of the strengths of our study is we're essentially trying to replicate real-world usage of LLMs by telling participants to choose the LLM of their choice and use it as they would on a normal day," said Bonam Mingole, lead author of the study and doctoral candidate in information sciences and technology. "This type of participatory research is so important for understanding how the public uses AI in their daily life."

The researchers then asked nine board-certified physicians to evaluate the accuracy of the AI-generated responses and how harmful they may be using a six-point scale ranging from very low to very high. A competition committee awarded prizes to the top eight submissions that generated the most medically accurate information and a prize to the submission that generated the response most likely to cause harm.

/Public Release. This material from the originating organization/author(s) might be of the point-in-time nature, and edited for clarity, style and length. Mirage.News does not take institutional positions or sides, and all views, positions, and conclusions expressed herein are solely those of the author(s).View in full here.

You might also like