Cross-Modal Retrieval Benefits from Adequate Alignment, Interaction

Beijing Zhongke Journal Publising Co. Ltd.

With the popularization of social networks, different modalities of data such as images, text, and audio aregrowing rapidly on the Internet. Subsequently, cross-modal retrieval has emerged as a fundamental task invarious applications, and has received significant attention in recent years. The core idea of cross-modalretrieval is to learn an accurate and generalizable alignment between different modalities (e.g., visual andtextual data) such that semantically similar objects can be correctly retrieved in one modality with a queryfrom another modality.

This article propose a novel framework for cross-modal retrieval that aims to performadequate alignment and interaction on the aggregated features of different modalities to effectively bridge themodality gap. The proposed framework contains two key components: a well-designed alignment module andnovel multimodal fusion encoder. Specifically,we leveraged an image/text encoder to extract a set of features from the input image/text, and maintained amomentum encoder corresponding to the image/text encoder to provide rich negative samples for modeltraining. Inspired by recent work on feature aggregation, we adopted the design of a generalized poolingoperator (GPO) to improve the quality of the global representations. To ensure that the model learns adequately aligned relationships, we introduced an alignment module with three objectives: image-text contrastive learning (ITC), intra-modal separability (IMS), and local mutual information maximization (LMIM).ITC encourages the model to separate unmatched image-text pairs and colligates the embeddings of matchedimage-text pairs. The IMS enables the model to learn representations that can distinguish different objectsusing the same modality, which can alleviate the problem of representation degradation to a certain extent. TheLMIM encourages the model to maximize the mutual information between the global representation (aggregated features) and local regional features (e.g., image patches or text tokens), aiming to capture sharedinformation among all regions rather than being affected by certain noisy regions. To endow the model withthe capability to explore the interaction information between different modalities, we incorporated a multimodal fusion encoder at the end of our model to perform interactions between different modalities after crossmodal alignment.

/Public Release. This material from the originating organization/author(s) might be of the point-in-time nature, and edited for clarity, style and length. Mirage.News does not take institutional positions or sides, and all views, positions, and conclusions expressed herein are solely those of the author(s).View in full here.