AI Falls Short in Grading University Essays

University of Cambridge

Researchers have used top Generative AI models to grade hundreds of undergraduate essays and found that AI only matched human-awarded degree classification around half the time, with AI often failing to accurately assess the best and worst submissions.

A University of Cambridge-led team of psychologists and AI experts tested three "frontier" systems including the latest versions (as of April 2026) of Claude and ChatGPT on over 750 student essays from three UK universities submitted as part of a psychology degree.

While accuracy of AI in grading the essays, from coursework to exam answers, was "not uniformly high", say researchers, it did manage to match the broad grading bands – a first, 2:1, 2:2 and so on – given out by human examiners between 35-65% of the time.

However, major stumbling blocks for AI include routinely undervaluing work awarded top marks by humans, or overvaluing essays ranked among the lowest.

Unlike human examiners, all the AI systems were "oversensitive to linguistic features": giving out higher marks based on essay length, vocabulary range, and sentence complexity, regardless of the academic quality of the essay.

In the latest report, researchers suggest that AI could be valuable for aspects of student assessment such as error detection and consistency checks – a "second pair of eyes" – as well as triaging feedback for students.

For example, large discrepancies between AI and human marks could help flag assignments requiring further review by a human assessor.

However, the team cautions that AI alone is far too shallow and inconsistent to grade undergraduate work, and a human should always determine the final mark.

"Universities are under huge pressure to reduce staff workload and improve efficiency, all while meeting rising student expectations, and some may start to lean on AI for assessment," said Dr Deborah Talmi, the Cambridge psychologist who leads the OpRaise project behind the new report.

"AI could perhaps automate some of the labour-intensive aspects of marking, freeing academics up for direct student engagement."

"We find that leaning heavily on the best current AI models would see student grading that is homogenised, underestimates brilliance, and favours linguistic style over the substance of sound academic judgement," said Talmi.

"Assessment is not just a system for distributing marks. It is part of how educational meaning is made, so students feel seen, standards are upheld, and trust is maintained. Use of AI in assessment poses a risk to these values."

/Public Release. This material from the originating organization/author(s) might be of the point-in-time nature, and edited for clarity, style and length. Mirage.News does not take institutional positions or sides, and all views, positions, and conclusions expressed herein are solely those of the author(s).View in full here.

You might also like