| A Cross-Modal Approach to Silent Speech with LLM-Enhanced Recognition | Mar 2, 2024 | Automatic Speech RecognitionAutomatic Speech Recognition (ASR) | CodeCode Available | 1 | 5 |
| LESS: Label-Efficient and Single-Stage Referring 3D Segmentation | Oct 17, 2024 | cross-modal alignmentInstance Segmentation | CodeCode Available | 1 | 5 |
| Structural Entities Extraction and Patient Indications Incorporation for Chest X-ray Report Generation | May 23, 2024 | cross-modal alignmentDecoder | CodeCode Available | 1 | 5 |
| HiCLIP: Contrastive Language-Image Pretraining with Hierarchy-aware Attention | Mar 6, 2023 | cross-modal alignment | CodeCode Available | 1 | 5 |
| Position-Enhanced Visual Instruction Tuning for Multimodal Large Language Models | Aug 25, 2023 | cross-modal alignmentPosition | CodeCode Available | 1 | 5 |
| CLIP Behaves like a Bag-of-Words Model Cross-modally but not Uni-modally | Feb 5, 2025 | Attributecross-modal alignment | CodeCode Available | 1 | 5 |
| ReForm-Eval: Evaluating Large Vision Language Models via Unified Re-Formulation of Task-Oriented Benchmarks | Oct 4, 2023 | cross-modal alignment | CodeCode Available | 1 | 5 |
| CLIP-Driven Fine-grained Text-Image Person Re-identification | Oct 19, 2022 | cross-modal alignmentPerson Re-Identification | CodeCode Available | 1 | 5 |
| Boosting Masked ECG-Text Auto-Encoders as Discriminative Learners | Oct 3, 2024 | cross-modal alignment | CodeCode Available | 1 | 5 |
| Align and Prompt: Video-and-Language Pre-training with Entity Prompts | Dec 17, 2021 | cross-modal alignmentEntity Alignment | CodeCode Available | 1 | 5 |
| A Survey on Facial Expression Recognition of Static and Dynamic Emotions | Aug 28, 2024 | cross-modal alignmentFacial Expression Recognition | CodeCode Available | 1 | 5 |
| Navigating Open Set Scenarios for Skeleton-based Action Recognition | Dec 11, 2023 | Action RecognitionActivity Recognition | CodeCode Available | 1 | 5 |
| BrainVis: Exploring the Bridge between Brain and Visual Signals via Image Reconstruction | Dec 22, 2023 | cross-modal alignmentEEG | CodeCode Available | 1 | 5 |
| Grounded Entity-Landmark Adaptive Pre-training for Vision-and-Language Navigation | Aug 24, 2023 | cross-modal alignmentDescriptive | CodeCode Available | 1 | 5 |
| Multimodal Music Generation with Explicit Bridges and Retrieval Augmentation | Dec 12, 2024 | cross-modal alignmentMultimodal Music Generation | CodeCode Available | 1 | 5 |
| Cross-modal Causal Relation Alignment for Video Question Grounding | Mar 5, 2025 | Contrastive Learningcross-modal alignment | CodeCode Available | 1 | 5 |
| GEAL: Generalizable 3D Affordance Learning with Cross-Modal Consistency | Dec 12, 2024 | cross-modal alignmentTransfer Learning | CodeCode Available | 1 | 5 |
| Multi-Semantic Fusion Model for Generalized Zero-Shot Skeleton-Based Action Recognition | Sep 18, 2023 | Action Recognitioncross-modal alignment | CodeCode Available | 1 | 5 |
| EPMF: Efficient Perception-aware Multi-sensor Fusion for 3D Semantic Segmentation | Jun 21, 2021 | 3D Semantic SegmentationAutonomous Driving | CodeCode Available | 1 | 5 |
| Fine-Grained Semantically Aligned Vision-Language Pre-Training | Aug 4, 2022 | cross-modal alignmentobject-detection | CodeCode Available | 1 | 5 |
| Align-KD: Distilling Cross-Modal Alignment Knowledge for Mobile Vision-Language Model | Dec 2, 2024 | cross-modal alignmentKnowledge Distillation | CodeCode Available | 1 | 5 |
| mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections | May 24, 2022 | Computational Efficiencycross-modal alignment | CodeCode Available | 1 | 5 |
| MSCI: Addressing CLIP's Inherent Limitations for Compositional Zero-Shot Learning | May 15, 2025 | Compositional Zero-Shot Learningcross-modal alignment | CodeCode Available | 1 | 5 |
| Enhanced OoD Detection through Cross-Modal Alignment of Multi-Modal Representations | Mar 24, 2025 | cross-modal alignmentImage Classification | CodeCode Available | 1 | 5 |
| Unraveling Instance Associations: A Closer Look for Audio-Visual Segmentation | Apr 6, 2023 | audio-visual learningContrastive Learning | CodeCode Available | 1 | 5 |
| Efficient Vision-Language Pretraining with Visual Concepts and Hierarchical Alignment | Aug 29, 2022 | cross-modal alignmentImage-text Retrieval | CodeCode Available | 1 | 5 |
| Beyond Uncertainty: Evidential Deep Learning for Robust Video Temporal Grounding | Aug 29, 2024 | cross-modal alignmentDeep Learning | CodeCode Available | 1 | 5 |
| Factual Serialization Enhancement: A Key Innovation for Chest X-ray Report Generation | May 15, 2024 | Contrastive Learningcross-modal alignment | CodeCode Available | 1 | 5 |
| Aligning Sight and Sound: Advanced Sound Source Localization Through Audio-Visual Alignment | Jul 18, 2024 | cross-modal alignmentCross-Modal Retrieval | CodeCode Available | 1 | 5 |
| BiPVL-Seg: Bidirectional Progressive Vision-Language Fusion with Global-Local Alignment for Medical Image Segmentation | Mar 30, 2025 | cross-modal alignmentImage Segmentation | CodeCode Available | 1 | 5 |
| Align-KD: Distilling Cross-Modal Alignment Knowledge for Mobile Vision-Language Large Model Enhancement | Jan 1, 2025 | cross-modal alignmentKnowledge Distillation | CodeCode Available | 1 | 5 |
| Free Lunch Enhancements for Multi-modal Crowd Counting | Jan 1, 2025 | cross-modal alignmentCrowd Counting | CodeCode Available | 1 | 5 |
| Modality Curation: Building Universal Embeddings for Advanced Multimodal Information Retrieval | May 26, 2025 | Contrastive Learningcross-modal alignment | CodeCode Available | 1 | 5 |
| Multi-Granularity Cross-modal Alignment for Generalized Medical Visual Representation Learning | Oct 12, 2022 | Contrastive Learningcross-modal alignment | CodeCode Available | 1 | 5 |
| MENTOR: Multi-level Self-supervised Learning for Multimodal Recommendation | Feb 29, 2024 | cross-modal alignmentMultimodal Recommendation | CodeCode Available | 1 | 5 |
| Global and Local Semantic Completion Learning for Vision-Language Pre-training | Jun 12, 2023 | cross-modal alignmentImage-text Retrieval | CodeCode Available | 1 | 5 |
| BridgeTower: Building Bridges Between Encoders in Vision-Language Representation Learning | Jun 17, 2022 | cross-modal alignmentRepresentation Learning | CodeCode Available | 1 | 5 |
| CVT-SLR: Contrastive Visual-Textual Transformation for Sign Language Recognition with Variational Alignment | Mar 10, 2023 | cross-modal alignmentSign Language Recognition | CodeCode Available | 1 | 5 |
| Distractors-Immune Representation Learning with Cross-modal Contrastive Regularization for Change Captioning | Jul 16, 2024 | Caption Generationcross-modal alignment | CodeCode Available | 1 | 5 |
| DSGN++: Exploiting Visual-Spatial Relation for Stereo-based 3D Detectors | Apr 6, 2022 | 3D geometry3D Object Detection | CodeCode Available | 1 | 5 |
| DanceIt: Music-inspired Dancing Video Synthesis | Sep 17, 2020 | cross-modal alignmentRhythm | CodeCode Available | 1 | 5 |
| AlignVSR: Audio-Visual Cross-Modal Alignment for Visual Speech Recognition | Oct 21, 2024 | cross-modal alignmentspeech-recognition | CodeCode Available | 1 | 5 |
| Progressive Multi-Modality Learning for Inverse Protein Folding | Dec 11, 2023 | cross-modal alignmentData Augmentation | CodeCode Available | 1 | 5 |
| Diffusion Bridge: Leveraging Diffusion Model to Reduce the Modality Gap Between Text and Vision for Zero-Shot Image Captioning | Jan 1, 2025 | cross-modal alignmentDenoising | CodeCode Available | 1 | 5 |
| Conditional Variational Autoencoder for Sign Language Translation with Cross-Modal Alignment | Dec 25, 2023 | cross-modal alignmentDecoder | CodeCode Available | 1 | 5 |
| Dynamic Modality Interaction Modeling for Image-Text Retrieval | Jul 11, 2021 | cross-modal alignmentCross-Modal Retrieval | CodeCode Available | 1 | 5 |
| CoMP: Continual Multimodal Pre-training for Vision Foundation Models | Mar 24, 2025 | cross-modal alignment | CodeCode Available | 1 | 5 |
| Landmark-RxR: Solving Vision-and-Language Navigation with Fine-Grained Alignment Supervision | Dec 1, 2021 | cross-modal alignmentNavigate | CodeCode Available | 1 | 5 |
| Multimodal Fusion and Vision-Language Models: A Survey for Robot Vision | Apr 3, 2025 | 3D Object Detectioncross-modal alignment | CodeCode Available | 1 | 5 |
| Revisiting Multimodal Representation in Contrastive Learning: From Patch and Token Embeddings to Finite Discrete Tokens | Mar 27, 2023 | Contrastive Learningcross-modal alignment | CodeCode Available | 1 | 5 |