AI Learning Boosted by Unsupervised Multimodal Data

Abstract

Learning with multiple modalities has recently demonstrated significant gains in many domains by maximizing the shared information across modalities. However, the current approaches strongly rely on high-quality paired datasets, which allow co-training from the paired labels from different modalities. In this context, we raise a pivotal question: Can a model with one modality synergize the training of other models with the different modalities, even without the paired multimodal labels? Our answer is 'Yes'. As a figurative description, we argue that a writer, i.e., a language model, can promote the training of a painter, i.e., a visual model, even without the paired ground truth of text and image. We theoretically argue that a superior representation can be achieved by the synergy between two different modalities without paired supervision. As proofs of concept, we broadly confirm the considerable performance gains from the synergy among visual, language, and audio models. From a theoretical viewpoint, we first establish a mathematical foundation of the synergy between two different modality models, where each one is trained with its own modality. From a practical viewpoint, our work aims to broaden the scope of multimodal learning to encompass the synergistic usage of single-modality models, relieving a strong limitation of paired supervision …

A research team, affiliated with UNIST has unveiled a novel machine learning approach that allows artificial intelligence (AI) systems to facilitate learning across different data types by training on only one modality. This breakthrough eliminates the need for extensive data alignment and matching, which are typically required in multimodal learning, thereby reducing the costs associated with dataset construction.

Led by Professor Sung Whan Yoon from the Graduate School of Artificial Intelligence at UNIST, the team announced the successful development of an AI multimodal learning technique capable of promoting model training across various data modalities without relying on paired datasets.

Multimodal learning involves the integrated understanding and processing of diverse data types such as audio, images, and text. Traditionally, effective multimodal learning depends heavily on aligning and matching data across these modalities-a process that demands significant time and resources. Performance often deteriorates when clearly paired data is scarce.

The proposed approach enables multimodal learning even with unpaired data. This innovation can significantly reduce costs and time in developing AI systems, such as voice assistants that interpret emotions by analyzing speech and facial expressions, or medical AI systems that combine CT images with patient records for diagnostics.

The team conducted experiments demonstrating that text models can assist image model training and that audio models can enhance language model performance. These experiments achieved higher accuracy than existing methods, confirming the effectiveness of cross-modality learning. Notably, even combinations with weak inherent correlations, such as audio and images, showed significant performance improvements.

Jae-Jun Lee, the first author of the study, explained, "The performance gains observed in seemingly unrelated pairings like audio and images challenge conventional assumptions of multimodal learning, representing an exciting breakthrough."

Professor Yoon stated, "This approach holds great potential for application in fields where obtaining aligned datasets is challenging, such as healthcare, autonomous driving, and smart AI assistants."

The findings have been accepted to the 13th International Conference on Learning Representations (ICLR), one of the top three conferences in AI and deep learning. The conference took place in Singapore from April 24 to 28, and this year attracted researchers from over 70 countries. Notably, out of 11,672 submissions, only 3,646 papers (31.24%) were accepted.

Journal Reference

Jae-Jun Lee and Sung Whan Yoon, "Can One Modality Model Synergize Training of Other Modality Models?," ICLR'25, (2025).

/Public Release. This material from the originating organization/author(s) might be of the point-in-time nature, and edited for clarity, style and length. Mirage.News does not take institutional positions or sides, and all views, positions, and conclusions expressed herein are solely those of the author(s).View in full here.