
Scientists at EPFL's Vita Lab are teaching AI to correct its mistakes in video production© 2026 Vita Lab
A team of EPFL researchers has taken a major step towards resolving the problem of drift in generative video, which is what causes sequences to become incoherent after a handful of seconds. Their breakthrough paves the way to AI videos with no time constraints.
Today, anyone can create realistic images in just a few clicks with the help of AI. Generating videos, however, is a much more complicated task. Existing AI models are only capable of producing videos that work for less than 30 seconds before degrading into randomness with incoherent shapes, colors and logic. The problem is called drift, and computer scientists have been working on it for years. At EPFL, researchers at the Visual Intelligence for Transportation (VITA) laboratory have taken a novel approach - working with the errors instead of circumventing or ignoring them - and developed a video-generation method that essentially eliminates drift. Their method is based on recycling errors back into the AI model so that it learns from its own mistakes.
Teaching machines to mess up
Drift causes videos to become increasingly unrealistic as they progress. It occurs because generative video programs typically work by using the image they just created as the starting point for the next one. That means any errors in that image - a blurred face or slightly deformed object, for example - will be magnified in the next one, with the error only getting worse as the sequence continues. "The problem is that models are trained only on perfect data sets, but when used in real-world situations, they need to know how to handle input containing their own errors," says Prof. Alexandre Alahi, head of the VITA Lab.
The new method invented at EPFL is called retraining by error recycling, and it has been successful at getting rid of drift. The researchers start by having a model generate a video, and then they measure the errors in that video - that is, the difference between the images produced and the images that should have been produced - according to various metrics. These errors are stored in memory. The next time the model is trained, the errors are intentionally fed back into the system so that the model is forced to operate under real-world conditions. As a result, the model gradually learns how to get back on track after viewing imperfect data, returning to images that are clear and that follow a logical sequence for humans - even if the starting image was deformed. After being trained in this way, the model becomes more robust and learns to stabilize videos after flawed images are produced. For Wuyang Li, a postdoctoral researcher at the laboratory, "unlike humans, generative AI rarely knows how to recover from its mistakes, which leads to drift. So we teach the models how to do this and how to remain stable despite imperfection."
"Our method involves making adjustments that don't require a lot of processing power or huge data sets and that make the output of AI programs more stable," says Alahi. "It's a little like training a pilot in turbulent weather rather than in a clear blue sky." The method has been integrated into a system called Stable Video Infinity (SVI), which can generate quality videos lasting several minutes or longer.
SVI, available in open source, has been tested by comparing numerous videos it produced with the same sequences generated by another AI system. It will be presented at the 2026 International Conference on Learning Representations (ICLR 2026) in April. Experts from various fields, including audiovisual production, animation and video games, have taken an interest in the technology. "We have concrete figures that attest to the effectiveness of our AI system," says Li. "Our work was featured by one of the largest YouTubers in the AI community and received over 150k+ views and 6k+ upvotes within a few weeks. In addition, our open-source repository has gained 1.9k+ stars on GitHub, a code hosting site, demonstrating its impact within the community." In addition, the new method will help VITA Lab researchers engineer autonomous systems that are safer, more effective and able to interact seamlessly with humans.
Multimodal AI combining video, images and sound
The experts at VITA Lab have also used their error recycling approach to develop another method, called LayerSync, which they will also present at ICLR. With this method, the AI model recycles not only its visible errors but also its internal logic. "Some parts of the model are better able to understand the meaning behind the images," says Alahi. "LayerSync enables these more 'expert' parts to guide the other parts when the model is being trained, as if the model were correcting itself from within. As a result, the model learns faster because it uses its own signals to supervise the process, with no additional data or external models required. That generates better-quality content, whether for video, images or sound."