AI Tool Grades Messy Handwritten Math Equations

Abstract

Automatically assessing handwritten mathematical solutions is an important problem in educational technology with practical applications, but remains a significant challenge due to the diverse formats, unstructured layouts, and symbolic complexity of student work. To address this challenge, we introduce VEHME-a Vision-Language Model for Evaluating Handwritten Mathematics Expressions-designed to assess open-form handwritten math responses with high accuracy and interpretable reasoning traces. VEHME integrates a two-phase training pipeline: (i) supervised fine-tuning using structured reasoning data, and (ii) reinforcement learning that aligns model outputs with multi-dimensional grading objectives, including correctness, reasoning depth, and error localization. To enhance spatial understanding, we propose an Expression-Aware Visual Prompting Module, trained on our synthesized multi-line math expressions dataset to robustly guide attention in visually heterogeneous inputs. Evaluated on AIHub and FERMAT datasets, VEHME achieves state-of-the-art performance among open-source models and approaches the accuracy of proprietary systems, demonstrating its potential as a scalable and accessible tool for automated math assessment. Our training and experiment code is publicly available at our GitHub repository.

A research team, affiliated with UNIST has unveiled a novel AI system, capable of grading and providing detailed feedback on even the most untidy handwritten math answers-much like a human instructor.

Led by Professor Taehwan Kim of UNIST Graduate School of Artificial Intelligence and Professor Sungahn Ko of POSTECH, the team announced the development of VEHME (Vision-Language Model for Evaluating Handwritten Mathematics Expressions), an AI model designed specifically to evaluate complex handwritten mathematics expressions.

Automated grading of open-ended math problems has traditionally been a labor-intensive and time-consuming process in classrooms. The challenge arises from the wide variety of answer formats, including equations, graphs, and diagrams, as well as differences in handwriting styles and how students organize their answers. Developing an AI, capable of accurately understanding and evaluating such unstructured, diverse content has remained a major obstacle.

VEHME approaches this problem by mimicking how a human grader works-carefully analyzing the position and meaning of each element within the problem and solution to identify mistakes.

VEHME overview Figure 1. A schematic illustration, describing the overview of VEHME overview.

In tests covering a broad spectrum-from calculus to elementary arithmetic-VEHME achieved accuracy levels comparable to large proprietary models such as GPT-4o and Gemini 2.0 Flash, despite being a more lightweight and efficient model. Remarkably, VEHME even outperformed commercial models in challenging cases involving heavily rotated or poorly written answers, demonstrating more reliable error detection. While models like GPT-4 and Gemini contain hundreds of billions of parameters, VEHME operates with just 7 billion, illustrating that high performance is attainable without enormous computational resources.

This advancement was made possible through a specialized visual prompting technology called Expression-aware Visual Prompting Module (EVPM), combined with a two-stage training process. EVPM enables VEHME to understand complex, multi-line expressions by virtually "boxing" them, ensuring the model maintains awareness of the problem layout. The second training stage not only helps VEHME recognize correct answers but also equips it to explain where and why errors occur during problem-solving.

Given the scarcity of high-quality handwritten datasets with detailed error annotations, the team generated synthetic training data using a large language model (QwQ-32B), which significantly improved VEHME's learning and evaluation capabilities.

Most importantly, VEHME is open-source and freely accessible, making it a practical tool for schools and researchers interested in adopting or enhancing the system.

Professor Taehwan Kim said, "Grading handwritten math answers is one of the most challenging tasks in educational AI, requiring a nuanced understanding of both images and language. VEHME's ability to follow complex solution steps and accurately identify mistakes marks an important step toward practical classroom applications." He added, "Our EVPM technology can automatically interpret complex visual information, which has potential uses beyond education-such as document processing, technical drawing analysis, and digital archiving of handwritten records."

This research was supported by the Ministry of Science and ICT (MSIT), the National Research Foundation of Korea (NRF), and the Institute for Information & Communications Technology Planning & Evaluation (IITP). The findings have been accepted for presentation at the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP 2025), a leading international conference on empirical natural language processing, which was held from November 5 to 9 in Suzhou, China.

Journal Reference

Thu Phuong Nguyen, Duc M. Nguyen, Hyotaek Jeon, et al., "VEHME: A Vision-Language Model For Evaluating Handwritten Mathematics Expressions," EMNLP '25, (2025).

/Public Release. This material from the originating organization/author(s) might be of the point-in-time nature, and edited for clarity, style and length. Mirage.News does not take institutional positions or sides, and all views, positions, and conclusions expressed herein are solely those of the author(s).View in full here.