| Masked Vision and Language Modeling for Multi-modal Representation Learning | Aug 3, 2022 | cross-modal alignmentLanguage Modeling | —Unverified | 0 | 0 |
| MCAD: Multi-teacher Cross-modal Alignment Distillation for efficient image-text retrieval | Oct 30, 2023 | cross-modal alignmentImage-text Retrieval | —Unverified | 0 | 0 |
| MCQA: Multimodal Co-attention Based Network for Question Answering | Apr 25, 2020 | cross-modal alignmentQuestion Answering | —Unverified | 0 | 0 |
| MDE: Modality Discrimination Enhancement for Multi-modal Recommendation | Feb 8, 2025 | cross-modal alignmentMulti-modal Recommendation | —Unverified | 0 | 0 |
| Mind the Modality Gap: Towards a Remote Sensing Vision-Language Model via Cross-modal Alignment | Feb 15, 2024 | cross-modal alignmentCross-Modal Retrieval | —Unverified | 0 | 0 |
| Distributionally Robust Alignment for Medical Federated Vision-Language Pre-training Under Data Heterogeneity | Apr 5, 2024 | cross-modal alignmentFederated Learning | —Unverified | 0 | 0 |
| Mix and match networks: cross-modal alignment for zero-pair image-to-image translation | Mar 8, 2019 | cross-modal alignmentDecoder | —Unverified | 0 | 0 |
| MLLM as Video Narrator: Mitigating Modality Imbalance in Video Moment Retrieval | Jun 25, 2024 | cross-modal alignmentMoment Retrieval | —Unverified | 0 | 0 |
| MLLMs are Deeply Affected by Modality Bias | May 24, 2025 | cross-modal alignment | —Unverified | 0 | 0 |
| Modeling the Human Visual System: Comparative Insights from Response-Optimized and Task-Optimized Vision Models, Language Models, and different Readout Mechanisms | Oct 17, 2024 | cross-modal alignmentLarge Language Model | —Unverified | 0 | 0 |
| mSLAM: Massively multilingual joint pre-training for speech and text | Feb 3, 2022 | cross-modal alignmentintent-classification | —Unverified | 0 | 0 |
| Multi-Grained Cross-modal Alignment for Learning Open-vocabulary Semantic Segmentation from Text Supervision | Mar 6, 2024 | Contrastive Learningcross-modal alignment | —Unverified | 0 | 0 |
| Multi-level Cross-modal Alignment for Image Clustering | Jan 22, 2024 | Clusteringcross-modal alignment | —Unverified | 0 | 0 |
| Multi-modal Attribute Prompting for Vision-Language Models | Mar 1, 2024 | Attributecross-modal alignment | —Unverified | 0 | 0 |
| Multi-Modal Cross-Domain Alignment Network for Video Moment Retrieval | Sep 23, 2022 | cross-modal alignmentInformation Retrieval | —Unverified | 0 | 0 |
| Multimodal Machine Learning in Mental Health: A Survey of Data, Algorithms, and Challenges | Jul 23, 2024 | cross-modal alignmentFairness | —Unverified | 0 | 0 |
| Multimodal Reasoning with Multimodal Knowledge Graph | Jun 4, 2024 | cross-modal alignmentGraph Attention | —Unverified | 0 | 0 |
| Multi-path Exploration and Feedback Adjustment for Text-to-Image Person Retrieval | Oct 26, 2024 | cross-modal alignmentPerson Retrieval | —Unverified | 0 | 0 |
| Multi-Prompts Learning with Cross-Modal Alignment for Attribute-based Person Re-Identification | Dec 28, 2023 | Attributecross-modal alignment | —Unverified | 0 | 0 |
| Multi-task Paired Masking with Alignment Modeling for Medical Vision-Language Pre-training | May 13, 2023 | cross-modal alignment | —Unverified | 0 | 0 |
| NeuroLIP: Interpretable and Fair Cross-Modal Alignment of fMRI and Phenotypic Text | Mar 27, 2025 | AttributeContrastive Learning | —Unverified | 0 | 0 |
| NEVLP: Noise-Robust Framework for Efficient Vision-Language Pre-training | Sep 15, 2024 | Contrastive Learningcross-modal alignment | —Unverified | 0 | 0 |
| NOTA: Multimodal Music Notation Understanding for Visual Large Language Model | Feb 17, 2025 | cross-modal alignmentLanguage Modeling | —Unverified | 0 | 0 |
| Observation-Graph Interaction and Key-Detail Guidance for Vision and Language Navigation | Mar 14, 2025 | cross-modal alignmentNavigate | —Unverified | 0 | 0 |
| OMCAT: Omni Context Aware Transformer | Oct 15, 2024 | Audio-visual Question AnsweringAudio-Visual Question Answering (AVQA) | —Unverified | 0 | 0 |
| OmniBind: Teach to Build Unequal-Scale Modality Interaction for Omni-Bind of All | May 25, 2024 | Allcross-modal alignment | —Unverified | 0 | 0 |
| OmniVL:One Foundation Model for Image-Language and Video-Language Tasks | Sep 15, 2022 | Action ClassificationAction Recognition | —Unverified | 0 | 0 |
| OneEncoder: A Lightweight Framework for Progressive Alignment of Modalities | Sep 17, 2024 | cross-modal alignmentQuestion Answering | —Unverified | 0 | 0 |
| On the Language Encoder of Contrastive Cross-modal Models | Oct 20, 2023 | cross-modal alignmentSentence | —Unverified | 0 | 0 |
| OpenSight: A Simple Open-Vocabulary Framework for LiDAR-Based Object Detection | Dec 12, 2023 | cross-modal alignmentobject-detection | —Unverified | 0 | 0 |
| OV-SCAN: Semantically Consistent Alignment for Novel Object Discovery in Open-Vocabulary 3D Object Detection | Mar 9, 2025 | 3D Object DetectionAutonomous Driving | —Unverified | 0 | 0 |
| PhysLLM: Harnessing Large Language Models for Cross-Modal Remote Physiological Sensing | May 6, 2025 | cross-modal alignment | —Unverified | 0 | 0 |
| PMMTalk: Speech-Driven 3D Facial Animation from Complementary Pseudo Multi-modal Features | Dec 5, 2023 | cross-modal alignmentDecoder | —Unverified | 0 | 0 |
| Prompt-based Context- and Domain-aware Pretraining for Vision and Language Navigation | Sep 7, 2023 | Contrastive Learningcross-modal alignment | —Unverified | 0 | 0 |
| Prototype-guided Cross-modal Completion and Alignment for Incomplete Text-based Person Re-identification | Sep 29, 2023 | cross-modal alignmentPerson Re-Identification | —Unverified | 0 | 0 |
| RAC3: Retrieval-Augmented Corner Case Comprehension for Autonomous Driving with Vision-Language Models | Dec 15, 2024 | Autonomous DrivingContrastive Learning | —Unverified | 0 | 0 |
| Reinforcement Learning for Weakly Supervised Temporal Grounding of Natural Language in Untrimmed Videos | Sep 18, 2020 | cross-modal alignmentreinforcement-learning | —Unverified | 0 | 0 |
| Representation Discrepancy Bridging Method for Remote Sensing Image-Text Retrieval | May 22, 2025 | cross-modal alignmentImage-text Retrieval | —Unverified | 0 | 0 |
| Retrieving-to-Answer: Zero-Shot Video Question Answering with Frozen Large Language Models | Jun 15, 2023 | cross-modal alignmentDomain Generalization | —Unverified | 0 | 0 |
| Revisiting Misalignment in Multispectral Pedestrian Detection: A Language-Driven Approach for Cross-modal Alignment Fusion | Nov 27, 2024 | cross-modal alignmentPedestrian Detection | —Unverified | 0 | 0 |
| Scene-Intuitive Agent for Remote Embodied Visual Grounding | Mar 24, 2021 | cross-modal alignmentNavigate | —Unverified | 0 | 0 |
| SE4Lip: Speech-Lip Encoder for Talking Head Synthesis to Solve Phoneme-Viseme Alignment Ambiguity | Apr 8, 2025 | 3DGScross-modal alignment | —Unverified | 0 | 0 |