
<(From Left) Hyun-Bin Oh, Takida Yuhta, Uesaka Toshimitsu, Tae-Hyun Oh, Mitsufuji Yuki>
When people watch a scene in the film Jurassic Park where a giant dinosaur walks toward them, they naturally imagine a heavy, rumbling sound, as if the ground were shaking. This is because humans predict sound by considering not only the shape of an object, but also physical properties such as its size, weight, and speed of movement. However, existing video-to-audio generation AI mainly generates sound based on the category of objects or scene information in the video, and has not sufficiently reflected physical properties that vary depending on weight or speed.
KAIST (President Kwang Hyung Lee) announced on the 26th of May that a collaborative research team involving Professor Tae-Hyun Oh of the School of Computing, KAIST, together with joint researchers from POSTECH (President Sung Keun Kim) and Sony AI, has developed "PAVAS (Physics-Aware Video-to-Audio Synthesis)," an artificial intelligence (AI) technology that understands the physical situation in a video and generates more realistic sound.

The key feature of this technology is that it is designed so that AI can infer invisible physical information such as the mass and velocity of objects in a video on its own. Ordinary videos do not provide exact numerical values for an object's weight or speed, but the research team enabled AI to estimate them by analyzing the surrounding environment and movement context, and to reflect the results in the sound generation process.
In other words, the AI was designed to go beyond simply recognizing "what is visible" and to understand the physical cause of "why this sound should occur."
As a result of technical validation, the research team's AI generated sounds very similar to real-world environments in scenes involving physical interactions such as collisions or impacts between objects. In particular, it produced more realistic audio in which loudness and tone naturally changed when the mass and velocity of objects varied.
Recently, generative AI technologies that simultaneously generate video and audio have been advancing rapidly. Representative examples include Google's "Veo 3" and ByteDance's "Seedance 2.0." However, in actual film, advertising, and game production sites, there is far greater demand for post-production work that adds sound effects suited to existing video scenes or supplements audio than for generating entirely new videos.
While existing commercial AI models have focused on generating video and audio together, PAVAS is differentiated by its ability to analyze the movement and collision characteristics of objects in a video and generate realistic sound effects that precisely match the scene.

The research team explained that this technology presents new possibilities in the field of "Physical AI," or physically consistent generative AI. Physically consistent generative AI refers to AI that goes beyond simply producing plausible results and understands the laws of physics and causal relationships in the real world.
In the future, this technology is expected to provide more immersive user experiences in a wide range of fields, including the automation of content sound production, augmented reality (AR) and virtual reality (VR) content, the metaverse, and robotics simulation.
Professor Tae-Hyun Oh stated, "While existing generative AI has developed by increasing the scale of data and models, this research is meaningful in that it was designed so that AI directly understands physical quantities and causal relationships," adding, "In the future, it can be expanded into a core foundational technology for next-generation multimodal AI that simultaneously understands and processes diverse types of information, including text, video, and speech."
This study was led by POSTECH integrated M.S.-Ph.D. student Hyun-Bin Oh as the first author, with KAIST Professor Tae-Hyun Oh and Sony AI researchers Yuhta Takida, Toshimitsu Uesaka, and Yuki Mitsufuji participating as co-authors. This research was selected as an Oral presentation paper at CVPR 2026 (Computer Vision and Pattern Recognition 2026), the world's most prestigious academic conference in the field of computer vision (image-based artificial intelligence technology), where only the top 0.88% of all papers are selected for oral presentation, recognizing the excellence of the work. The presentation is scheduled to take place on June 6.
※ Paper title: "PAVAS: Physics-Aware Video-to-Audio Synthesis," DOI: https://arxiv.org/abs/2512.08282
This research was supported by the Mid-Career Research Program under the Basic Research Program of the Ministry of Science and ICT, the Pioneer Research Program for Future Converging Technology of the Ministry of Science, ICT and Future Planning, the AGI Program of the Ministry of Science and ICT, and the KAIST InnoCORE Program.