| Align and Prompt: Video-and-Language Pre-training with Entity Prompts | Dec 17, 2021 | cross-modal alignmentEntity Alignment | CodeCode Available | 1 |
| Landmark-RxR: Solving Vision-and-Language Navigation with Fine-Grained Alignment Supervision | Dec 1, 2021 | cross-modal alignmentNavigate | CodeCode Available | 1 |
| Dynamic Modality Interaction Modeling for Image-Text Retrieval | Jul 11, 2021 | cross-modal alignmentCross-Modal Retrieval | CodeCode Available | 1 |
| EPMF: Efficient Perception-aware Multi-sensor Fusion for 3D Semantic Segmentation | Jun 21, 2021 | 3D Semantic SegmentationAutonomous Driving | CodeCode Available | 1 |
| DanceIt: Music-inspired Dancing Video Synthesis | Sep 17, 2020 | cross-modal alignmentRhythm | CodeCode Available | 1 |
| Symbiotic Adversarial Learning for Attribute-based Person Search | Jul 19, 2020 | Attributecross-modal alignment | CodeCode Available | 1 |
| Transformer-based Spatial Grounding: A Comprehensive Survey | Jul 17, 2025 | cross-modal alignmentSurvey | —Unverified | 0 |
| CATVis: Context-Aware Thought Visualization | Jul 15, 2025 | cross-modal alignmentEEG | —Unverified | 0 |
| Bridge Feature Matching and Cross-Modal Alignment with Mutual-filtering for Zero-shot Anomaly Detection | Jul 15, 2025 | Anomaly ClassificationAnomaly Detection | —Unverified | 0 |
| Evaluating Attribute Confusion in Fashion Text-to-Image Generation | Jul 9, 2025 | Attributecross-modal alignment | —Unverified | 0 |
| DALR: Dual-level Alignment Learning for Multimodal Sentence Representation Learning | Jun 26, 2025 | cross-modal alignmentRepresentation Learning | —Unverified | 0 |
| TSDASeg: A Two-Stage Model with Direct Alignment for Interactive Point Cloud Segmentation | Jun 26, 2025 | cross-modal alignmentInteractive Segmentation | —Unverified | 0 |
| HyperPath: Knowledge-Guided Hyperbolic Semantic Hierarchy Modeling for WSI Analysis | Jun 19, 2025 | cross-modal alignmentMultiple Instance Learning | CodeCode Available | 0 |
| Speech-Language Models with Decoupled Tokenizers and Multi-Token Prediction | Jun 14, 2025 | cross-modal alignment | —Unverified | 0 |
| TAViS: Text-bridged Audio-Visual Segmentation with Foundation Models | Jun 13, 2025 | cross-modal alignmentSegmentation | —Unverified | 0 |
| Improving Medical Visual Representation Learning with Pathological-level Cross-Modal Alignment and Correlation Exploration | Jun 12, 2025 | cross-modal alignmentImage to text | —Unverified | 0 |
| OmniDRCA: Parallel Speech-Text Foundation Model via Dual-Resolution Speech Representations and Contrastive Alignment | Jun 11, 2025 | cross-modal alignmentQuestion Answering | CodeCode Available | 0 |
| Generating Vision-Language Navigation Instructions Incorporated Fine-Grained Alignment Annotations | Jun 10, 2025 | cross-modal alignmentNavigate | —Unverified | 0 |
| Fusing Cross-modal and Uni-modal Representations: A Kronecker Product Approach | Jun 10, 2025 | cross-modal alignment | —Unverified | 0 |
| WhisQ: Cross-Modal Representation Learning for Text-to-Music MOS Prediction | Jun 6, 2025 | cross-modal alignmentLanguage Modeling | —Unverified | 0 |
| Does Your 3D Encoder Really Work? When Pretrain-SFT from 2D VLMs Meets 3D VLMs | Jun 5, 2025 | cross-modal alignmentDense Captioning | —Unverified | 0 |
| Towards LLM-Centric Multimodal Fusion: A Survey on Integration Strategies and Techniques | Jun 5, 2025 | cross-modal alignmentLarge Language Model | —Unverified | 0 |
| UniCUE: Unified Recognition and Generation Framework for Chinese Cued Speech Video-to-Speech Generation | Jun 4, 2025 | cross-modal alignmentLipreading | —Unverified | 0 |
| EmotionRankCLAP: Bridging Natural Language Speaking Styles and Ordinal Speech Emotion via Rank-N-Contrast | May 29, 2025 | Contrastive Learningcross-modal alignment | —Unverified | 0 |
| ALAS: Measuring Latent Speech-Text Alignment For Spoken Language Understanding In Multimodal LLMs | May 26, 2025 | cross-modal alignmentEmotion Recognition | —Unverified | 0 |
| DiSa: Directional Saliency-Aware Prompt Learning for Generalizable Vision-Language Models | May 26, 2025 | cross-modal alignmentDomain Generalization | —Unverified | 0 |
| ViTaPEs: Visuotactile Position Encodings for Cross-Modal Alignment in Multimodal Transformers | May 26, 2025 | cross-modal alignmentPosition | —Unverified | 0 |
| From Alignment to Advancement: Bootstrapping Audio-Language Alignment with Synthetic Data | May 26, 2025 | cross-modal alignmentInstruction Following | —Unverified | 0 |
| Co-AttenDWG: Co-Attentive Dimension-Wise Gating and Expert Fusion for Multi-Modal Offensive Content Detection | May 25, 2025 | cross-modal alignmentScene Understanding | —Unverified | 0 |
| Deformable Attentive Visual Enhancement for Referring Segmentation Using Vision-Language Model | May 25, 2025 | cross-modal alignmentImage Segmentation | —Unverified | 0 |
| MLLMs are Deeply Affected by Modality Bias | May 24, 2025 | cross-modal alignment | —Unverified | 0 |
| Clip4Retrofit: Enabling Real-Time Image Labeling on Edge Devices via Cross-Architecture CLIP Distillation | May 23, 2025 | Autonomous Drivingcross-modal alignment | —Unverified | 0 |
| ICPL-ReID: Identity-Conditional Prompt Learning for Multi-Spectral Object Re-Identification | May 23, 2025 | cross-modal alignmentPrompt Learning | CodeCode Available | 0 |
| Representation Discrepancy Bridging Method for Remote Sensing Image-Text Retrieval | May 22, 2025 | cross-modal alignmentImage-text Retrieval | —Unverified | 0 |
| ALN-P3: Unified Language Alignment for Perception, Prediction, and Planning in Autonomous Driving | May 21, 2025 | Autonomous Drivingcross-modal alignment | —Unverified | 0 |
| CAD: A General Multimodal Framework for Video Deepfake Detection via Cross-Modal Alignment and Distillation | May 21, 2025 | cross-modal alignmentDeepFake Detection | —Unverified | 0 |
| Enhancing LLMs for Time Series Forecasting via Structure-Guided Cross-Modal Alignment | May 19, 2025 | cross-modal alignmentTime Series | —Unverified | 0 |
| FALCON: False-Negative Aware Learning of Contrastive Negatives in Vision-Language Pretraining | May 16, 2025 | cross-modal alignment | —Unverified | 0 |
| Beyond Modality Collapse: Representations Blending for Multimodal Dataset Distillation | May 16, 2025 | cross-modal alignmentDataset Distillation | —Unverified | 0 |
| VISTA: Enhancing Vision-Text Alignment in MLLMs via Cross-Modal Mutual Information Maximization | May 16, 2025 | cross-modal alignmentMME | —Unverified | 0 |
| Adaptive Spatial Transcriptomics Interpolation via Cross-modal Cross-slice Modeling | May 15, 2025 | cross-modal alignment | CodeCode Available | 0 |
| Denoising and Alignment: Rethinking Domain Generalization for Multimodal Face Anti-Spoofing | May 14, 2025 | cross-modal alignmentDenoising | —Unverified | 0 |
| Anatomical Attention Alignment representation for Radiology Report Generation | May 12, 2025 | cross-modal alignmentDecoder | CodeCode Available | 0 |
| HCMA: Hierarchical Cross-model Alignment for Grounded Text-to-Image Generation | May 10, 2025 | cross-modal alignmentImage Generation | CodeCode Available | 0 |
| Semantic-Space-Intervened Diffusive Alignment for Visual Classification | May 9, 2025 | Classificationcross-modal alignment | —Unverified | 0 |
| Task-Adapter++: Task-specific Adaptation with Order-aware Alignment for Few-shot Action Recognition | May 9, 2025 | Action Recognitioncross-modal alignment | CodeCode Available | 0 |
| Probabilistic Embeddings for Frozen Vision-Language Models: Uncertainty Quantification with Gaussian Process Latent Variable Models | May 8, 2025 | Active Learningcross-modal alignment | CodeCode Available | 0 |
| DenseGrounding: Improving Dense Language-Vision Semantics for Ego-Centric 3D Visual Grounding | May 8, 2025 | 3D visual groundingcross-modal alignment | —Unverified | 0 |
| PhysLLM: Harnessing Large Language Models for Cross-Modal Remote Physiological Sensing | May 6, 2025 | cross-modal alignment | —Unverified | 0 |
| MicarVLMoE: A Modern Gated Cross-Aligned Vision-Language Mixture of Experts Model for Medical Image Captioning and Report Generation | Apr 29, 2025 | cross-modal alignmentDecoder | CodeCode Available | 0 |