AI Boosts Self-Driving Cars' Safe City Navigation

Tsinghua University Press

This study presents KEPT, an AI system that helps self-driving cars predict their own short-term path more safely by combining video understanding with a memory of similar past scenes. Tested on the public nuScenes benchmark, KEPT cuts prediction errors and potential collisions compared with existing planning methods, while using a fast, lightweight retrieval module that is practical for real-time driving.

The team published their study in Communications in Transportation Research ( https://doi.org/10.26599/COMMTR.2026.9640012 ).

Researchers from Tongji University and their international collaborators have developed a new AI system that helps self-driving cars "remember" similar past situations before choosing how to move next. The method, called KEPT (Knowledge-Enhanced Prediction of Trajectories), allows a vision-language model to predict the car's short-term path directly from front-view camera video while consulting a large library of previous real-world driving clips.

"Short-horizon trajectory prediction is where many autonomous driving systems still struggle, especially in complex, busy scenes," said first author Yujin Wang from the School of Automotive Studies at Tongji University. "Our idea was to let a vision-language model not only look at the current frames, but also recall how similar scenes have unfolded before, and then plan a safe, feasible motion based on that prior experience."

To make this possible, the team first designed a new video encoder that turns short clips of consecutive driving frames into compact vectors that capture both spatial layout and motion cues. The encoder, called a temporal frequency–spatial fusion (TFSF) module, combines a fast-Fourier-transform-based frequency attention block with multi-scale features from a Swin Transformer and a lightweight temporal transformer over seven frames sampled at 2 Hz. This design helps the model focus on subtle motion changes and fine-grained scene structure that matter for near-term planning.

The TFSF encoder is trained in a self-supervised way, without manual labels. It learns to bring visually and dynamically similar clips closer in its embedding space and push dissimilar clips apart, using a contrastive loss with a memory queue and hard-negative mining. This produces robust clip-level embeddings that can be used directly for retrieval.

On top of these embeddings, the researchers built a scalable retrieval pipeline. All clips in a large driving corpus are encoded and stored in a vector database. At run time, the current 3-second camera sequence is embedded and first routed to a nearby cluster by k-means, then matched to its closest neighbors using a hierarchical navigable small-world (HNSW) index. The system retrieves a handful of most similar scenes and their ground-truth trajectories, which act as strong priors for the planner while keeping retrieval latency low.

These retrieved trajectories are not used in a black-box way. Instead, KEPT injects them into a chain-of-thought style prompt for a vision-language model, alongside the current video frames and explicit safety and kinematic constraints. The model is guided to compare the new scene with the retrieved examples, reason about similarities and differences, and then output a 3-second ego trajectory that respects speed limits, smoothness, and collision avoidance requirements.

To make a general-purpose vision-language backbone suitable for this task, the team introduces a triple-stage fine-tuning scheme. In the first stage, the model is fine-tuned on visual question-answering tasks about object category, size, and distance to strengthen spatial grounding. In the second stage, it learns to regress future trajectories from surround-view images and basic kinematics, with losses that penalize unsafe curvature and abrupt maneuvers. In the final stage, it predicts the full trajectory from only consecutive front-view frames, learning to align its language head with short-term temporal structure. All three stages use lightweight LoRA adapters to keep adaptation efficient.

The researchers evaluated KEPT on the widely used nuScenes benchmark, comparing it against both traditional end-to-end planning baselines and more recent vision-language-based planners. Across standard open-loop metrics, KEPT achieved the best overall performance, reducing prediction error while maintaining competitive or lower collision indicators. Ablation experiments further showed that each component—self-supervised TFSF pre-training, the clustered retrieval stack, the triple-stage fine-tuning, and using multiple retrieved exemplars—contributes measurably to the final accuracy and safety profile.

"Vision-language models are powerful reasoners, but in driving they can easily hallucinate or ignore physical constraints if we just ask them to 'draw a path'," said corresponding author Prof. Bingzhao Gao. "By grounding the model in a bank of real trajectories and training it on metrics that directly reflect motion feasibility and collision risk, KEPT turns this reasoning ability into something much closer to an engineerable planning module."

Beyond benchmark scores, the work points to a broader design pattern for autonomous driving: instead of treating large models as end-to-end black boxes, they can be wrapped with retrieval, structured prompts, and physically meaningful objectives to provide more transparent, data-efficient, and safety-aware behavior. The authors note that KEPT currently focuses on short-horizon, open-loop evaluation on a single dataset and camera configuration, and that closed-loop testing, richer sensor inputs, and more diverse driving regions are key directions for future research.

The team envisions that similar knowledge-enhanced planners could eventually support not only automated vehicles, but also advanced driver-assistance systems that explain their recommendations to human drivers in everyday language. By combining retrieval, vision, and language, KEPT offers a concrete step toward autonomous systems that can both drive and justify how they drive.

About Communications in Transportation Research

Communications in Transportation Research was launched in 2021, with academic support provided by Tsinghua University and China Intelligent Transportation Systems Association. The Editors-in-Chief are Professor Xiaobo Qu, a member of the Academia Europaea from Tsinghua University and Professor Xiaopeng (Shaw) Li from University of Wisconsin–Madison. The journal mainly publishes high-quality, original research and review articles that are of significant importance to emerging transportation systems, aiming to serve as an international platform for showcasing and exchanging innovative achievements in transportation and related fields, fostering academic exchange and development between China and the global community.

It has been indexed in SCIE, SSCI, Ei Compendex, Scopus, CSTPCD, CSCD, OAJ, DOAJ, TRID and other databases. It was selected as Q1 Top Journal in the Engineering and Technology category of the Chinese Academy of Sciences (CAS) Journal Ranking List. In 2022, it was selected as a High-Starting-Point new journal project of the "China Science and Technology Journal Excellence Action Plan". In 2024, it was selected as the Support the Development Project of "High-Level International Scientific and Technological Journals". The same year, it was also chosen as an English Journal Tier Project of the "China Science and Technology Journal Excellence Action Plan PhaseⅡ". In 2024, it received the first impact factor (2023 IF) of 12.5, ranking Top1 (1/58, Q1) among all journals in "TRANSPORTATION" category. In 2025, its 2024 IF was announced as 14.5, maintaining the Top1 position (1/62, Q1) in the same category.

From Volume 6 (2026), Communications in Transportation Research will be published by Tsinghua University Press on the SciOpen platform with the official journal website at https://www.sciopen.com/journal/2097-5023 . We kindly request that all new manuscript submissions be made through the journal's submission system at https://mc03.manuscriptcentral.com/commtr

/Public Release. This material from the originating organization/author(s) might be of the point-in-time nature, and edited for clarity, style and length. Mirage.News does not take institutional positions or sides, and all views, positions, and conclusions expressed herein are solely those of the author(s).View in full here.