Abstract
A research team, affiliated with UNIST has unveiled a novel machine learning approach that allows artificial intelligence (AI) systems to facilitate learning across different data types by training on only one modality. This breakthrough eliminates the need for extensive data alignment and matching, which are typically required in multimodal learning, thereby reducing the costs associated with dataset construction.
Led by Professor Sung Whan Yoon from the Graduate School of Artificial Intelligence at UNIST, the team announced the successful development of an AI multimodal learning technique capable of promoting model training across various data modalities without relying on paired datasets.
Multimodal learning involves the integrated understanding and processing of diverse data types such as audio, images, and text. Traditionally, effective multimodal learning depends heavily on aligning and matching data across these modalities-a process that demands significant time and resources. Performance often deteriorates when clearly paired data is scarce.
The proposed approach enables multimodal learning even with unpaired data. This innovation can significantly reduce costs and time in developing AI systems, such as voice assistants that interpret emotions by analyzing speech and facial expressions, or medical AI systems that combine CT images with patient records for diagnostics.
The team conducted experiments demonstrating that text models can assist image model training and that audio models can enhance language model performance. These experiments achieved higher accuracy than existing methods, confirming the effectiveness of cross-modality learning. Notably, even combinations with weak inherent correlations, such as audio and images, showed significant performance improvements.
Jae-Jun Lee, the first author of the study, explained, "The performance gains observed in seemingly unrelated pairings like audio and images challenge conventional assumptions of multimodal learning, representing an exciting breakthrough."
Professor Yoon stated, "This approach holds great potential for application in fields where obtaining aligned datasets is challenging, such as healthcare, autonomous driving, and smart AI assistants."
The findings have been accepted to the 13th International Conference on Learning Representations (ICLR), one of the top three conferences in AI and deep learning. The conference took place in Singapore from April 24 to 28, and this year attracted researchers from over 70 countries. Notably, out of 11,672 submissions, only 3,646 papers (31.24%) were accepted.
Journal Reference
Jae-Jun Lee and Sung Whan Yoon, "Can One Modality Model Synergize Training of Other Modality Models?," ICLR'25, (2025).