CROSS-MODAL EMBEDDINGS: A COMPREHENSIVE SURVEY OF TEXT-IMAGE REPRESENTATION LEARNING
Main Article Content
Abstract
The fusion of visual and textual modalities through cross-modal embeddings has become a critical research direction in computer vision and natural language processing. This paper investigates advanced embedding models that enable shared semantic understanding between text and images, with a focus on improving cross-modal retrieval performance. We analyze joint and coordinated embedding methods such as CLIP, DeViSE, and the proposed Cross-Modal Semantic Embedding Hashing (CMSEH) and Visual-Textual Fusion Network (VTFN). These models utilize contrastive and generative learning strategies to bridge the semantic gap across modalities. Extensive experiments on benchmark datasets—NUS-WIDE and MIR-Flickr25K—demonstrate that CMSEH significantly outperforms traditional approaches, achieving up to 82% mAP in text-to-image retrieval. An ablation study further confirms the effectiveness of semantic fusion and hashing components in enhancing retrieval accuracy. Our findings highlight the scalability, efficiency, and robustness of the proposed models, underscoring their potential for real-world applications such as visual search, image captioning, and visual question answering. This work also identifies current research gaps—such as modality imbalance, interpretability, and language bias—and outlines future directions for building fair, generalizable, and context-aware multimodal systems.