When people worry that they're getting sick, they are increasingly turning to generative artificial intelligence like ChatGPT for a diagnosis. But how accurate are the answers that AI gives out?
Research recently published in the journal iScience puts ChatGPT and its large language models to the test, with a few surprising conclusions.
Ahmed Abdeen Hamed - a research fellow for the Thomas J. Watson College of Engineering and Applied Science's School of Systems Science and Industrial Engineering at Binghamton University - led the study, with collaborators from AGH University of Krakow, Poland; Howard University; and the University of Vermont.
As part of George J. Klir Professor of Systems Science Luis M. Rocha's Complex Adaptive Systems and Computational Intelligence Lab, Hamed developed a machine-learning algorithm last year that he calls xFakeSci. It can detect up to 94% of bogus scientific papers - nearly twice as successfully as more common data-mining techniques. He sees this new research as the next step to verify the biomedical generative capabilities of large language models.
"People talk to ChatGPT all the time these days, and they say: 'I have these symptoms. Do I have cancer? Do I have cardiac arrest? Should I be getting treatment?'" Hamed said. "It can be a very dangerous business, so we wanted to see what would happen if we asked these questions, what sort of answers we got and how these answers could be verified from the biomedical literature."
The researchers tested ChatGPT for disease terms and three types of associations: drug names, genetics and symptoms. The AI showed high accuracy in identifying disease terms (88-97%), drug names (90-91%) and genetic information (88-98%). Hamed admitted he thought it would be "at most 25% accuracy."
"The exciting result was ChatGPT said cancer is a disease, hypertension is a disease, fever is a symptom, Remdesivir is a drug and BRCA is a gene related to breast cancer," he said. "Incredible, absolutely incredible!"
Symptom identification, however, scored lower (49-61%), and the reason may be how the large language models are trained. Doctors and researchers use biomedical ontologies to define and organize terms and relationships for consistent data representation and knowledge-sharing, but users enter more informal descriptions.
"ChatGPT uses more of a friendly and social language, because it's supposed to be communicating with average people. In medical literature, people use proper names," Hamed said. "The LLM is apparently trying to simplify the definition of these symptoms, because there is a lot of traffic asking such questions, so it started to minimize the formalities of medical language to appeal to those users."
One puzzling result stood out. The National Institutes of Health maintains a database called GenBank, which gives an accession number to every identified DNA sequence. It's usually a combination of letters and numbers. For example, the designation for the Breast Cancer 1 gene (BRCA1) is NM_007294.4.
When asked for these numbers as part of the genetic information testing, ChatGPT just made them up - a phenomenon known as "hallucinating." Hamed sees this as a major failing amid so many other positive results.
"Maybe there is an opportunity here that we can start introducing these biomedical ontologies to the LLMs to provide much higher accuracy, get rid of all the hallucinations and make these tools into something amazing," he said.
Hamed's interest in LLMs began in 2023, when he discovered ChatGPT and heard about the issues regarding fact-checking. His goal is to expose the flaws so data scientists can adjust the models as needed and make them better.
"If I am analyzing knowledge, I want to make sure that I remove anything that may seem fishy before I build my theories and make something that is not accurate," he said.