Scientists Craft Toughest AI Test, Results Surprise

Texas A&M University

As artificial intelligence systems began scoring extremely high on long used academic benchmarks, researchers noticed a growing issue. The tests that once challenged machines were no longer difficult enough. Well known evaluations such as the Massive Multitask Language Understanding (MMLU) exam, which had previously been seen as demanding, now fail to properly measure the capabilities of today's advanced AI models.

To solve this problem, a worldwide group of nearly 1,000 researchers, including a professor from Texas A&M University, developed a new type of test. Their goal was to build an exam that is broad, difficult, and grounded in expert human knowledge in ways that current AI systems still struggle to handle.

The result is "Humanity's Last Exam" (HLE), a 2,500 question assessment covering mathematics, humanities, natural sciences, ancient languages, and a wide range of highly specialized academic fields. Details of the project appear in a paper published in Nature, and additional information about the exam is available at lastexam.ai .

Among the many contributors is Dr. Tung Nguyen, instructional associate professor in the Department of Computer Science and Engineering at Texas A&M. Nguyen helped write and refine many of the exam questions.

"When AI systems start performing extremely well on human benchmarks, it's tempting to think they're approaching human-level understanding," Nguyen said. "But HLE reminds us that intelligence isn't just about pattern recognition -- it's about depth, context and specialized expertise."

The purpose of the exam was not to trick or defeat human test takers. Instead, the goal was to carefully identify areas where AI systems still fall short.

A Global Effort to Measure AI's Limits

Specialists from around the world wrote and reviewed the questions included in Humanity's Last Exam. Each problem was carefully designed so it has one clear, verifiable answer. The questions were also crafted to prevent quick solutions through simple internet searches.

The topics come from advanced academic challenges. Some tasks involve translating ancient Palmyrene inscriptions, while others require identifying tiny anatomical structures in birds or analyzing detailed features of Biblical Hebrew pronunciation.

Researchers tested every question against leading AI systems. If any model was able to answer a question correctly, that question was removed from the final exam. This process ensured the test remained just beyond what current AI systems can reliably solve.

Early testing confirmed that the strategy worked. Even powerful AI models struggled with the exam. GPT-4o achieved a score of 2.7 percent, while Claude 3.5 Sonnet reached 4.1 percent. OpenAI's o1 model performed somewhat better with 8 percent. The most capable systems so far, including Gemini 3.1 Pro and Claude Opus 4.6, have reached accuracy levels between about 40 percent and 50 percent.

Why New AI Benchmarks Are Needed

Nguyen explained that the issue of AI surpassing older tests is more than a technical concern. He contributed 73 of the 2,500 publicly available questions in HLE, the second highest number among contributors, and wrote the most questions related to mathematics and computer science.

"Without accurate assessment tools, policymakers, developers and users risk misinterpreting what AI systems can actually do," he said. "Benchmarks provide the foundation for measuring progress and identifying risks."

According to the research team, high scores on tests originally designed for humans do not necessarily indicate genuine intelligence. Those benchmarks mainly measure how well AI can complete specific tasks created for human learners, rather than capturing deeper understanding.

Not a Threat, but a Tool

Despite the dramatic name, Humanity's Last Exam is not meant to suggest that humans are becoming obsolete. Instead, it highlights the large amount of knowledge and expertise that still remains uniquely human.

"This isn't a race against AI," Nguyen said. "It's a method for understanding where these systems are strong and where they struggle. That understanding helps us build safer, more reliable technologies. And, importantly, it reminds us why human expertise still matters."

Building a Long Term AI Benchmark

Humanity's Last Exam is designed to serve as a durable and transparent benchmark for future AI systems. To support that goal, the researchers have released some questions publicly while keeping the majority hidden so that AI models cannot simply memorize the answers.

"For now, Humanity's Last Exam stands as one of the clearest assessments of the gap between AI and human intelligence," Nguyen said, "and despite rapid technological advances, it remains wide."

A Massive International Research Effort

Nguyen emphasized that the scale of the project demonstrates the value of collaboration across disciplines and countries.

"What made this project extraordinary was the scale," he said. "Experts from nearly every discipline contributed. It wasn't just computer scientists; it was historians, physicists, linguists, medical researchers. That diversity is exactly what exposes the gaps in today's AI systems -- perhaps ironically, it's humans working together."

/Public Release. This material from the originating organization/author(s) might be of the point-in-time nature, and edited for clarity, style and length. Mirage.News does not take institutional positions or sides, and all views, positions, and conclusions expressed herein are solely those of the author(s).View in full here.

You might also like