A University of Exeter expert is working to ensure AI can better help scientists safely make rapid advances in vital fields such as drug discovery at a fraction of the current costs.
Stephan Guttinger has secured a Research Leadership Award from the Leverhulme Trust to explore the reasoning abilities of "AI Scientists" – AI-based systems that can autonomously perform research tasks and solve scientific problems.
It is hoped that the deployment of these systems will support research. However, to work autonomously these new tools need to be able to resolve the most common problem in everyday research practice: dealing with and identify experimental error and suggest possible solutions.
Dr Guttinger, a lecturer in Philosophy of Data at the Department of Social and Political Sciences, Philosophy and Anthropology, has been awarded the Leverhulme funding to assemble an interdisciplinary team that will explore the error-reasoning ability of AI Scientists.
The four-year project will bring together experts in philosophy, the natural sciences, and computer science. At the heart of the project is the development of a benchmark that can systematically probe how well current or future AI models can deal with error in scientific practice. The benchmark is needed as it is not clear how good existing AI systems are at scientific error-reasoning.
Dr Guttinger said: "Scientific error-reasoning has not been widely or deeply datafied: scientists work through errors in weekly laboratory meetings, on whiteboards, or in the hallways of a conference venue. These discussions rarely find their way into published materials and are thus underrepresented in the data on which AI models are trained.
"To address this uncertainty, we need benchmarks that allow us to assess the extent to which AI models can reason about scientific error. However, even our most sophisticated benchmarks for AI agents don't currently test for this type of reasoning."
A key challenge for building an error-reasoning benchmark is the lack of a well-developed theory of error.
Dr Guttinger said: "Developing effective benchmarks requires a good understanding of error-reasoning in science: what are the types of errors scientists encounter and what are the strategies they usually deploy to address them? Unfortunately, we still lack a systematic and comprehensive theory of error in science."
The first goal of the project will therefore be to build a detailed error theory for science, which the team will then use to assemble a systematic database of error types and strategies in science.
This database will be used to develop two benchmarks: a traditional benchmark, containing more than 500 question-answer pairs that can be used to test the error-reasoning ability of isolated AI agents. Another will be designed to assess human-AI teams. This will allow them to assess different aspects of the error-reasoning process in science.