Are language models such as ChatGPT suitable as independent teaching assistants in the natural sciences? A research team at the University of Würzburg has investigated this question.

Artificial intelligence has become an integral part of many people's everyday lives. Large language models (LLMs) such as ChatGPT, Gemini or Copilot write letters and term papers for them, give tips for excursions on holiday or answer questions on every conceivable topic.
The use of artificial intelligence has also long been routine at universities in many areas. To what extent can large language models support students in the natural sciences as unsupervised tutors? A research team at Julius-Maximilians-Universität Würzburg (JMU) has now investigated this question. Their results have been published as a preprint on arXiv.
A Freely Accessible Evaluation Tool
A research group from the Department of Physical Chemistry, which has so far mainly conducted research into the spectroscopy of nanomaterials, has now developed a tool that tests the thermodynamic understanding of modern LLMs - in particular, whether their skills go beyond mere factual knowledge. The tool, called UTQA (Undergraduate Thermodynamics Question Answering), is freely accessible and is intended to support teachers and researchers in evaluating LLMs in a fair and subject-specific way - and to make progress measurable.
"Our wish is that AI will one day be able to support us as an unsupervised partner in teaching - for example in the form of competent chatbots that respond individually to the needs of each student in the preparation and follow-up of lectures. We're clearly not there yet, but the progress is breathtaking," says project manager Professor Tobias Hertel. "With UTQA, we show where current language models are already convincing and where they systematically fail - this is exactly what lecturers need in order to be able to plan their use in teaching responsibly."
Born out of Teaching
Hertel's team has been using LLMs in the thermodynamics lecture with over 150 students for weekly knowledge checks since the winter semester of 2023. Models such as ChatGPT-3.5 and ChatGPT-4 showed their strengths, but also clear weaknesses.
This led to the desire for a subject-specific benchmark: "UTQA therefore comprises 50 challenging single-choice tasks from the basic thermodynamics lecture - two thirds text-based, one third with diagrams and sketches, as is typical for didactic exercises," explains Hertel. The aim was not only to test factual knowledge and definitions, but also to test the language models' ability to link different boundary conditions in a targeted manner and to understand complex process sequences.
Results: Solid - but not (yet) reliable enough
According to Hertel, the test of the best-performing models of the year 2025 paints a clear picture: with UTQA, no model achieved the success rate of 95 per cent required by the research group for unsupervised assistance as an AI tutor. Even the leading GPT-o3 model in many benchmarks only achieved 82 per cent overall accuracy.
"Two weaknesses were noticeable: Firstly, the models consistently had difficulties with so-called irreversible processes, where the speed of the state change influences the outcome. Secondly, there were clear deficits in tasks that required image interpretation," says the scientist.
A historical review shows that this is not surprising: Around 100 years ago, the French physicist Pierre Duhem already described the phenomenon of reversibility as one of the most difficult phenomena in thermodynamics. The fact that LLMs have problems interpreting diagrams is also not surprising, as the perception and processing of visual content is one of the outstanding cognitive strengths of humans.
Not Good Enough for Unsupervised Use yet
"In practice, this means that LLMs can already be very useful in teaching with or without supervision - but not yet enough to be used as unsupervised tutors," says Hertel. "At the same time, we have seen enormous progress in the last two years. We are therefore confident that - provided development does not suddenly come to a standstill - the expertise required for teaching assistants in our discipline can soon be achieved."
Tobias Hertel is particularly pleased that two student teachers were significantly involved in the research project, contributing their specialised didactic perspectives. Luca-Sophie Bien created an initial German version of many of the tasks; Anna Geißler translated and expanded the collection for international use.
Why Thermodynamics
According to Hertel, thermodynamics is ideal for testing the models' understanding and reasoning ability: "It is fundamental to our understanding of nature, has compact basic laws, but in application requires a precise distinction between state and process variables, heat or work, and reversible or irreversible processes. This is precisely where reasoning ability is separated from mere memorisation," says the physical chemist.
As a next step, the team is now planning to expand the tool to include real gases, mixtures, phase diagrams and standard cycles. The aim is to cover further concepts that are central to teaching. "The better models can handle multimodal binding, i.e. the combination of text and images, as well as irreversible regimes, the closer we get to reliable, subject-sensitive AI tutorials," says Hertel.
Publication & data
From Canonical to Complex: Benchmarking LLM Capabilities in Undergraduate Thermodynamics, Anna Geißler, Luca-Sophie Bien, Friedrich Schöppler,and Tobias Hertel, published as a prepint here: https://arxiv.org/abs/2508.21452
The dataset can be found here: UTQA (herteltm/UTQA) on Hugging Face.