| mmE5: Improving Multimodal Multilingual Embeddings via High-quality Synthetic Data | Feb 12, 2025 | cross-modal alignmentLarge Language Model | CodeCode Available | 2 |
| MDE: Modality Discrimination Enhancement for Multi-modal Recommendation | Feb 8, 2025 | cross-modal alignmentMulti-modal Recommendation | —Unverified | 0 |
| Leveraging Pre-Trained Models for Multimodal Class-Incremental Learning under Adaptive Fusion | Feb 7, 2025 | class-incremental learningClass Incremental Learning | —Unverified | 0 |
| Ola: Pushing the Frontiers of Omni-Modal Language Model | Feb 6, 2025 | cross-modal alignmentLanguage Modeling | CodeCode Available | 3 |
| CLIP Behaves like a Bag-of-Words Model Cross-modally but not Uni-modally | Feb 5, 2025 | Attributecross-modal alignment | CodeCode Available | 1 |
| Cross-modal Context Fusion and Adaptive Graph Convolutional Network for Multimodal Conversational Emotion Recognition | Jan 25, 2025 | cross-modal alignmentEmotion Classification | —Unverified | 0 |
| Integrate Temporal Graph Learning into LLM-based Temporal Knowledge Graph Model | Jan 21, 2025 | cross-modal alignmentGraph Embedding | —Unverified | 0 |
| WhiSPA: Semantically and Psychologically Aligned Whisper with Self-Supervised Contrastive and Student-Teacher Learning | Jan 15, 2025 | cross-modal alignmentLanguage Modeling | CodeCode Available | 1 |
| CGP-Tuning: Structure-Aware Soft Prompt Tuning for Code Vulnerability Detection | Jan 8, 2025 | Computational Efficiencycross-modal alignment | —Unverified | 0 |
| Free Lunch Enhancements for Multi-modal Crowd Counting | Jan 1, 2025 | cross-modal alignmentCrowd Counting | CodeCode Available | 1 |
| Chat-based Person Retrieval via Dialogue-Refined Cross-Modal Alignment | Jan 1, 2025 | Attributecross-modal alignment | —Unverified | 0 |
| Diffusion Bridge: Leveraging Diffusion Model to Reduce the Modality Gap Between Text and Vision for Zero-Shot Image Captioning | Jan 1, 2025 | cross-modal alignmentDenoising | CodeCode Available | 1 |
| Align-KD: Distilling Cross-Modal Alignment Knowledge for Mobile Vision-Language Large Model Enhancement | Jan 1, 2025 | cross-modal alignmentKnowledge Distillation | CodeCode Available | 1 |
| Generalized Zero-Shot Classification via Semantics-Free Inter-Class Feature Generation | Jan 1, 2025 | Classificationcross-modal alignment | —Unverified | 0 |
| Audio-Visual Semantic Graph Network for Audio-Visual Event Localization | Jan 1, 2025 | audio-visual event localizationcross-modal alignment | —Unverified | 0 |
| Enhancing Multimodal Emotion Recognition through Multi-Granularity Cross-Modal Alignment | Dec 30, 2024 | cross-modal alignmentEmotion Recognition | —Unverified | 0 |
| ChartAdapter: Large Vision-Language Model for Chart Summarization | Dec 30, 2024 | Chart Understandingcross-modal alignment | —Unverified | 0 |
| Enhancing Visual Representation for Text-based Person Searching | Dec 30, 2024 | cross-modal alignmentPerson Search | CodeCode Available | 0 |
| Bag of Tricks for Multimodal AutoML with Image, Text, and Tabular Data | Dec 19, 2024 | AutoMLcross-modal alignment | —Unverified | 0 |
| ASAP: Advancing Semantic Alignment Promotes Multi-Modal Manipulation Detecting and Grounding | Dec 17, 2024 | cross-modal alignment | CodeCode Available | 1 |
| Wearable Accelerometer Foundation Models for Health via Knowledge Distillation | Dec 15, 2024 | Activity Recognitioncross-modal alignment | —Unverified | 0 |
| RAC3: Retrieval-Augmented Corner Case Comprehension for Autonomous Driving with Vision-Language Models | Dec 15, 2024 | Autonomous DrivingContrastive Learning | —Unverified | 0 |
| Dynamic Cross-Modal Alignment for Robust Semantic Location Prediction | Dec 13, 2024 | cross-modal alignmentPrediction | —Unverified | 0 |
| Enhancing Modality Representation and Alignment for Multimodal Cold-start Active Learning | Dec 12, 2024 | Active Learningcross-modal alignment | —Unverified | 0 |
| Multimodal Music Generation with Explicit Bridges and Retrieval Augmentation | Dec 12, 2024 | cross-modal alignmentMultimodal Music Generation | CodeCode Available | 1 |
| GEAL: Generalizable 3D Affordance Learning with Cross-Modal Consistency | Dec 12, 2024 | cross-modal alignmentTransfer Learning | CodeCode Available | 1 |
| GEXIA: Granularity Expansion and Iterative Approximation for Scalable Multi-grained Video-language Learning | Dec 10, 2024 | cross-modal alignmentVideo Understanding | —Unverified | 0 |
| Towards Brain Passage Retrieval -- An Investigation of EEG Query Representations | Dec 9, 2024 | cross-modal alignmentEEG | —Unverified | 0 |
| CLIP-PING: Boosting Lightweight Vision-Language Models with Proximus Intrinsic Neighbors Guidance | Dec 5, 2024 | Contrastive Learningcross-modal alignment | —Unverified | 0 |
| Align-KD: Distilling Cross-Modal Alignment Knowledge for Mobile Vision-Language Model | Dec 2, 2024 | cross-modal alignmentKnowledge Distillation | CodeCode Available | 1 |
| AlignMamba: Enhancing Multimodal Mamba with Local and Global Cross-modal Alignment | Dec 1, 2024 | cross-modal alignmentMamba | —Unverified | 0 |
| SimCMF: A Simple Cross-modal Fine-tuning Strategy from Vision Foundation Models to Any Imaging Modality | Nov 27, 2024 | cross-modal alignment | CodeCode Available | 1 |
| Revisiting Misalignment in Multispectral Pedestrian Detection: A Language-Driven Approach for Cross-modal Alignment Fusion | Nov 27, 2024 | cross-modal alignmentPedestrian Detection | —Unverified | 0 |
| Tiny-Align: Bridging Automatic Speech Recognition and Large Language Model on the Edge | Nov 21, 2024 | Automatic Speech RecognitionAutomatic Speech Recognition (ASR) | —Unverified | 0 |
| CTPD: Cross-Modal Temporal Pattern Discovery for Enhanced Multimodal Electronic Health Records Analysis | Nov 1, 2024 | cross-modal alignmentPhenotype classification | —Unverified | 0 |
| Towards Cross-Modal Text-Molecule Retrieval with Better Modality Alignment | Oct 31, 2024 | Contrastive Learningcross-modal alignment | CodeCode Available | 0 |
| Multi-path Exploration and Feedback Adjustment for Text-to-Image Person Retrieval | Oct 26, 2024 | cross-modal alignmentPerson Retrieval | —Unverified | 0 |
| AlignVSR: Audio-Visual Cross-Modal Alignment for Visual Speech Recognition | Oct 21, 2024 | cross-modal alignmentspeech-recognition | CodeCode Available | 1 |
| Modeling the Human Visual System: Comparative Insights from Response-Optimized and Task-Optimized Vision Models, Language Models, and different Readout Mechanisms | Oct 17, 2024 | cross-modal alignmentLarge Language Model | —Unverified | 0 |
| Let Me Finish My Sentence: Video Temporal Grounding with Holistic Text Understanding | Oct 17, 2024 | cross-modal alignmentSentence | —Unverified | 0 |
| LESS: Label-Efficient and Single-Stage Referring 3D Segmentation | Oct 17, 2024 | cross-modal alignmentInstance Segmentation | CodeCode Available | 1 |
| OMCAT: Omni Context Aware Transformer | Oct 15, 2024 | Audio-visual Question AnsweringAudio-Visual Question Answering (AVQA) | —Unverified | 0 |
| Evaluating Semantic Variation in Text-to-Image Synthesis: A Causal Perspective | Oct 14, 2024 | cross-modal alignmentImage Generation | CodeCode Available | 0 |
| Deciphering Cross-Modal Alignment in Large Vision-Language Models with Modality Integration Rate | Oct 9, 2024 | cross-modal alignmentVisual Question Answering | CodeCode Available | 2 |
| EMMA: Empowering Multi-modal Mamba with Structural and Hierarchical Alignment | Oct 8, 2024 | cross-modal alignmentHallucination | —Unverified | 0 |
| Intriguing Properties of Large Language and Vision Models | Oct 7, 2024 | cross-modal alignmentLarge Language Model | —Unverified | 0 |
| TANGO: Co-Speech Gesture Video Reenactment with Hierarchical Audio Motion Embedding and Diffusion Interpolation | Oct 5, 2024 | cross-modal alignmentRetrieval | —Unverified | 0 |
| Boosting Masked ECG-Text Auto-Encoders as Discriminative Learners | Oct 3, 2024 | cross-modal alignment | CodeCode Available | 1 |
| Melody-Guided Music Generation | Sep 30, 2024 | cross-modal alignmentMusic Generation | CodeCode Available | 2 |
| Fully Aligned Network for Referring Image Segmentation | Sep 29, 2024 | cross-modal alignmentDecoder | —Unverified | 0 |