
< Professor Junmo Kim and Ph.D. candidate Minchan Kwon, School of Electrical Engineering >
No matter how much data they learn, why do Artificial Intelligence (AI) models often miss the mark on human intent? Conventional "comparison learning," designed to help AI understand human preferences, has frequently led to confusion rather than clarity. A KAIST research team has now presented a new learning solution that allows AI to accurately learn human preferences even with limited data by assigning it a "private tutor."
On December17th, a research team led by Professor Junmo Kim of KAIST School of Electrical Engineering announced the development of "TVKD" (Teacher Value-based Knowledge Distillation), a reinforcement learning framework that significantly improves data efficiency and learning stability while effectively reflecting human preferences.
Existing AI training methods typically rely on collecting massive amounts of "preference comparison" data—simple structures like "A is better than B." However, this approach requires vast datasets and often causes the AI to become confused in ambiguous situations where the distinction is unclear.
To solve this problem, the research team proposed a method in which a 'Teacher model' that has first deeply understood human preferences delivers only the core information to a 'Student model.' This can be compared to a private tutor who organizes and teaches complex content, and the research team named this 'Preference Distillation.'
The biggest feature of this technology is that instead of simply imitating 'good or bad,' it is designed so that the teacher model learns a 'Value Function' that numerically judges how valuable each situation is, and then delivers this to the student model. Through this, the AI can learn by making comprehensive judgments about 'why this choice is better' rather than fragmentary comparisons, even in ambiguous situations.

< Conceptual diagram of TVKD: After teaching the human preference dataset to the teacher model, learning proceeds by delivering the teacher's information and the dataset to the student model >
The core of this technology is twofold. First, by reflecting value judgments that consider the entire context into the student model, learning that understands the overall flow rather than fragmentary answers has become possible. Second, a technique was introduced to adjust learning importance according to the reliability of preference data. Clear data is significantly reflected in learning, while the influence of ambiguous or noisy data is reduced, allowing the AI to learn stably even in realistic environments.
As a result of the research team applying this technology to various AI models and conducting experiments, it showed more accurate and stable performance than methods previously known to have the best performance. In particular, it recorded achievements that stably outperformed existing top technologies in major evaluation indices such as MT-Bench and AlpacaEval.
Professor Junmo Kim said, "In reality, human preference data is not always sufficient or perfect," and added, "This technology will allow AI to learn consistently even under such constraints, so it will be highly practical in various fields."

< Performance comparison results for each task of MT-Bench. It can be confirmed that the proposed TVKD framework records generally higher scores than existing methods. >

< Visualization results of the Shaping term. The top tokens (converted into words) judged as important by the teacher model within the response are displayed in red, intuitively showing which tokens have a greater influence during the value-based alignment process. >
Ph.D. candidate Minchan Kwon from our university's School of Electrical Engineering participated as the first author, and the research results were accepted at 'NeurIPS 2025', the most prestigious international conference in the field of artificial intelligence. The research was presented at a poster session on December 3, 2025 (US Pacific Time).
※ Paper Title: Preference Distillation via Value based Reinforcement Learning, DOI: https://doi.org/10.48550/arXiv.2509.16965
Meanwhile, this research was carried out with support from the Information & Communications Technology Planning & Evaluation (IITP) funded by the government (Ministry of Science and ICT) in 2024 (No. RS-2024-00439020, Development of Sustainable Real-time Multimodal Interactive Generative AI, SW Star Lab).