Testing Large Language Models On Scientific Literature

To stay up to date and work forward in their fields, scientists must have at their fingertips and in their minds thousands of published studies. Large language models (LLMs) show promise as a tool for exploring the vast scientific literature, but are they trustworthy when it comes to providing full and scientifically accurate answers to complex questions in specialized fields?

To find out, Cornell physicists and Google researchers engaged a panel of 12 human experts to test the ability of six LLM systems - ChatGpt, Claude and others - to understand scientific literature at the level of a specialist, using the field of high-temperature cuprates, a class of superconducting materials, as an example. Some systems perform better than others, they found. The study also revealed some gaps in current LLM capability and narrowed down a wish list for AI developers to improve in future models.

"This study is about testing out LLMs' ability to read the literature the way an expert would read," said Eun-Ah Kim, the Hans A. Bethe Professor of physics in the College of Arts and Sciences (A&S), corresponding author of the study. "This paper is important now because everyone is very curious about what LLMs can and cannot do, especially in the context of artificial general intelligence (AGI). There are critical gaps in what LLMs can do right now, which is clearly showing that they are not at AGI."

"Expert Evaluation of LLM World Models: A High-Tc Superconductivity Case Study" was published in the Proceedings of the National Academy of Sciences on March 10. The lead author is Haoyu Guo, Bethe/KIC postdoctoral fellow with Cornell's Laboratory of Atomic and Solid State Physics (LAASP).

As a graduate student, Guo worked on cuprate high-Tc superconductors, the example field of the current study. "The challenge was the huge number of experimental results accumulated over decades," he said. "I am curious to see whether an LLM could help young students or researchers going into a new field - in general, not just cuprates."

The researchers created a database of 1,726 scientific papers curated by human experts that cover the history of the field of high-temperature cuprates and a set of 67 questions, written by a larger group of experts, that probe deep understanding of the literature.

With these assets, they examined four LLMs - ChatGPT-4, Claude 3.5, Perplexity and Gemini Advanced Pro 1.5 - as well as NotebookLM, a Google product that answers a user's questions based on provided documents. They also added to the mix a custom retrieval-augmented generation (RAG) system capable of retrieving relevant images as well as text from the curated documents.

AExperts manually graded the answers given by each of the systems without knowing which system they were evaluating.

The systems that featured curated information - Google's product and the custom RAG system - did the best.

"LLMs operating on trusted data sources - papers we collected ourselves, not from the LLM searching the Internet - tend to perform better," Guo said. "Among these, NotebookLM performs better when I have a set of papers that I want to understand better."

All the LLMs were surprisingly good at pulling out text-based information, Kim said, but "totally incapable" at engaging with data visualization. This is a serious drawback; she teaches students to look at data visualization critically, as an essential part of a paper.

The custom model, with its unique ability to retrieve images from curated documents was significantly better at data visualization.

On the wish list to AI developers for improved LLMs, Guo said, are more accurate attributions to LLMs' claims (they sometimes make up references); better ability to synthesize many facets of one problem and to reflect the complexities of the problem; and improved comprehension of plots and figures.

"It has been about a year since we performed the benchmark and we have seen improvements of models in many aspects," Guo said. "But visual reasoning is still underdeveloped."

Using trusted LLM systems to explore scientific literature could give a leg up to young researchers who have creative ideas, Kim said. "Knowing the facts used to be brandished as a ticket to the table. Holding a fact in your head should not be the ticket. The ticket should be: Do you know how to think in a creative way? Can you approach problems from a creative angle?"

This is the first study out of the Cornell-led National Science Foundation AI-Materials Institute, which Kim directs.

Kate Blackwood is a writer for the College of Arts and Sciences.

/Public Release. This material from the originating organization/author(s) might be of the point-in-time nature, and edited for clarity, style and length. Mirage.News does not take institutional positions or sides, and all views, positions, and conclusions expressed herein are solely those of the author(s).View in full here.