AI Solves Temporal Errors, Boosts Medical, Legal Trust

Korea Advanced Institute of Science and Technology

What if ChatGPT answered with the name of a minister from a year ago when asked, "Who was the minister inaugurated last month?" This is a prime example of the limitations of AI that fails to properly reflect the latest information. Our university's research team has developed a new evaluation technology that automatically reflects changing real-world information while catching "temporal errors" that may appear correct on the surface. This is expected to drastically improve AI reliability.

KAIST announced on April14th that a research team led by Professor Steven Euijong Whang from the School of Electrical Engineering, in joint research with Microsoft Research, has developed a system that automatically evaluates and diagnoses the temporal reasoning capabilities of Large Language Models (LLMs) using temporal database technology.

For AI to earn user trust, the ability to accurately understand real-world information that changes moment by moment is essential. However, existing evaluation methods only checked whether the answer matched or failed to sufficiently reflect complex temporal relationships, making it difficult to properly evaluate various question scenarios occurring in actual environments.

To solve this, the research team introduced "Temporal Database" design theory—which has been verified over the past 40 years—into AI evaluation for the first time. By utilizing the temporal flow and relational structure of data, the core of this technology is the automatic generation of 13 types of complex time-based problems from the database itself, without the need for humans to manually write evaluation questions.

In particular, this technology is evaluated as a major innovation because it shifts from the traditional method where humans manually created problems to a method where evaluation questions are automatically generated based on data. Furthermore, by automating the entire process from problem generation to answer derivation and verification based on the database, the burden of maintenance can be drastically reduced without the need to manually modify questions as was previously required.

When real-world information changes, the evaluation questions, answers, and verification criteria are automatically updated simply by updating the corresponding content in the database. While the input of the latest information itself is handled by external data or administrators, this technology is structured to perform the overall evaluation automatically after such data is updated.

Additionally, moving beyond the existing method of simply judging whether the final answer is correct or incorrect, the research team introduced a new metric that verifies the logical validity of dates or periods presented during the answering process. Through this, they achieved a performance improvement in detecting "Temporal Hallucination" phenomena—where an answer appears correct but has the wrong temporal basis—by an average of 21.7% more accurately than before.

Applying this technology can significantly reduce evaluation maintenance costs since only the database needs to be updated when information changes, and it showed an effect of reducing the amount of input data by an average of 51% compared to previous methods.

Professor Steven Euijong Whang stated, "This research is an example showing that classical database design theory can play a crucial role in solving the reliability issues of the latest AI. By converting vast amounts of professional data into evaluation resources, we expect this to become a practical foundation for verifying AI performance in various fields such as medicine and law in the future."

Soyeon Kim, a PhD student at KAIST, participated as the lead author of this study, and Jindong Wang (Microsoft Research, currently at William & Mary) and Xing Xie (Microsoft Research) participated as co-authors. The research results will be presented this April at ICLR 2026, the most prestigious academic conference in the field of artificial intelligence.

Paper Title: Harnessing Temporal Databases for Systematic Evaluation of Factual Time-Sensitive Question-Answering in Large Language Models
Paper Link: https://arxiv.org/abs/2508.02045

Meanwhile, this research was conducted with support from Microsoft Research, the National Research Foundation of Korea, and the Institute for Information & Communications Technology Planning & Evaluation (IITP) Global AI Frontier Lab projects (RS-2024-00469482, RS-2024-00509258).

/Public Release. This material from the originating organization/author(s) might be of the point-in-time nature, and edited for clarity, style and length. Mirage.News does not take institutional positions or sides, and all views, positions, and conclusions expressed herein are solely those of the author(s).View in full here.

You might also like