Researchers from The University of Manchester have contributed to a new global benchmark designed to measure the limits of today's most advanced artificial intelligence (AI) systems.
As large language models such as ChatGPT and Gemini have rapidly improved in recent years, many widely used benchmarks have become less informative. In 2023, leading models were found to pass the Turing test and, separately, in 2025, achieved gold medal-level performance on International Mathematical Olympiad questions, achieving over 80% accuracy.
Now, two Manchester mathematicians, Dr Cesare Giulio Ardito and Dr Igor Chernyavsky, have joined nearly 1,000 expert contributors worldwide to create a multidisciplinary academic test called "Humanity's Last Exam" (HLE), which sets AI systems a fresh challenge.
The test consists of 2,500 rigorously reviewed questions spanning dozens of disciplines, from mathematics and the natural sciences to humanities. Questions are deliberately precise, closed-ended and resistant to simple internet search or memorisation, with some using both textual and image data.
Every question in HLE was tested against leading AI models before inclusion. If an AI system could answer a question correctly at the time the benchmark was designed, it was rejected.
The study, now published in Nature, found they passed fewer than 10% of the HLE questions when the dataset was first released in early 2025, despite scoring above 80% on more conventional benchmarks.
Although the rapid pace of AI development has enabled some systems to significantly improve their scores in less than a year, the top-ranked models still reach just below 40%. The results also show that many AI systems still frequently express high confidence in incorrect answers to the HLE questions. However, their capability in self-assessing knowledge gaps has gradually improved.
Dr Cesare Giulio Ardito said: "I'm happy that the University of Manchester is represented among contributors from all over the world. This was a human team effort and, so far, we appear to still have an edge."
Although this new AI benchmark only measures performance on closed-ended, expert-level questions at the frontier of current knowledge, the authors hope it will help identify remaining limitations and potentially capture emerging generalist research capabilities.
This research was published in the journal Nature
Full title: A benchmark of expert-level academic questions to assess AI capabilities
DOI: https://doi.org/10.1038/s41586-025-09962-4