AI Framework Enhances Self-Driving with Human Reasoning

Tsinghua University Press

Autonomous driving has advanced rapidly, transitioning from rule-based systems to deep neural networks. Yet end-to-end models still face major deficits: they often lack world knowledge, struggle in rare or ambiguous scenarios, and provide minimal insight into their decision-making process. Large language models (LLMs), by contrast, excel at reasoning, contextual understanding, and interpreting complex instructions. However, LLM outputs are linguistic rather than executable, making integration with real vehicle control difficult. These gaps highlight the need for frameworks that combine multi-modal perception with structured, actionable decision outputs grounded in established driving logic. Addressing these challenges requires deeper research into aligning multi-modal reasoning with autonomous driving planners.

A research team from Shanghai Jiao Tong University, Shanghai AI Laboratory, Tsinghua University, and collaborating institutions has developed DriveMLM, a multi-modal large language model framework for closed-loop autonomous driving. The findings were published (DOI: 10.1007/s44267-025-00095-w) on 26 November 2025 in Visual Intelligence . DriveMLM integrates multi-view camera images, LiDAR point clouds, system messages, and user instructions to produce aligned behavioral planning states. These states plug directly into existing motion-planning modules, enabling real-time driving control while generating natural-language explanations of each decision.

DriveMLM tackles a core challenge in LLM-based driving: converting linguistic reasoning into reliable control behavior. The framework aligns LLM outputs with the behavioral planning states used in modular systems such as Apollo, covering both speed decisions (KEEP, ACCELERATE, DECELERATE, STOP) and path decisions (FOLLOW, LEFT_CHANGE, RIGHT_CHANGE, and others).

A specialized multi-modal tokenizer processes multi-view temporal images, LiDAR data, traffic rules, and user instructions into unified token embeddings. A multi-modal LLM then predicts the appropriate decision state and produces an accompanying explanation, ensuring interpretability.

To support training, the team created a large-scale data engine that generated 280 hours of driving data across eight CARLA maps and 30 challenging scenarios, including rare safety-critical events. The pipeline automatically labels speed and path decisions and uses human refinements and GPT-based augmentation to produce rich explanatory annotations.

In closed-loop evaluation on the CARLA Town05 Long benchmark, DriveMLM achieved a Driving Score of 76.1, outperforming the Apollo baseline by 4.7 points, and recorded the highest miles per intervention (0.96) among all compared systems. DriveMLM also demonstrated strong open-loop decision accuracy, improved explanation quality, and robust performance under natural-language guidance—such as yielding to emergency vehicles or interpreting user commands like "overtake" under varying traffic conditions.

"Our study shows that LLMs, once aligned with structured decision states, can serve as powerful behavioral planners for autonomous vehicles," the research team noted. "DriveMLM goes beyond rule-following. It understands complex scenes, reasons about motion, and explains its decisions in natural language—capabilities essential for safety and public trust. By combining perception, planning, and human instruction within a unified framework, DriveMLM offers a promising direction for next-generation autonomous driving systems."

DriveMLM demonstrates how multi-modal LLMs can enhance transparency, flexibility, and safety in autonomous driving. Its plug-and-play design allows seamless integration into established systems such as Apollo or Autopilot, enabling improved decision-making without major architectural changes. The ability to interpret natural-language instructions expands possibilities for interactive driving assistance and personalized in-vehicle AI copilots. More broadly, DriveMLM highlights a path toward reasoning-driven autonomous systems capable of understanding complex environments, anticipating risks, and justifying their actions—key capabilities for deploying trustworthy AI in real transportation networks.

Funding information

The work is supported by the National Key R&D Program of China (No. 2022ZD0161300) and the National Natural Science Foundation of China (Nos. U24A20325, 62321005 and 62376134).


About Visual Intelligence

Visual Intelligence is an international, peer-reviewed, open-access journal devoted to the theory and practice of visual intelligence. This journal is the official publication of the China Society of Image and Graphics (CSIG), with Article Processing Charges fully covered by the Society. It focuses on the foundations of visual computing, the methodologies employed in the field, and the applications of visual intelligence, while particularly encouraging submissions that address rapidly advancing areas of visual intelligence research.

About the Authors

Dr. Jifeng Dai is an Associate Professor at Department of Electronic Engineering of Tsinghua University. His current research focus is on learning intelligent models from visual data for understanding the complex world. Prior to that, he was an Executive Research Director at SenseTime Research, headed by Professor Xiaogang Wang, between 2019 and 2022. He was a Principal Research Manager in Visual Computing Group at Microsoft Research Asia (MSRA) between 2014 and 2019, headed by Dr. Jian Sun and Dr. Baining Guo.

Dr. Wenhai Wang is a Postdoctoral Researcher at The Chinese University of Hong Kong. His research interests include computer vision, machine learning and large language models (LLMs) toward artificial general intelligence (AGI).

/Public Release. This material from the originating organization/author(s) might be of the point-in-time nature, and edited for clarity, style and length. Mirage.News does not take institutional positions or sides, and all views, positions, and conclusions expressed herein are solely those of the author(s).View in full here.