SOTAVerified

cross-modal alignment

Papers

Showing 101150 of 342 papers

TitleStatusHype
mmE5: Improving Multimodal Multilingual Embeddings via High-quality Synthetic DataCode2
MDE: Modality Discrimination Enhancement for Multi-modal Recommendation0
Leveraging Pre-Trained Models for Multimodal Class-Incremental Learning under Adaptive Fusion0
Ola: Pushing the Frontiers of Omni-Modal Language ModelCode3
CLIP Behaves like a Bag-of-Words Model Cross-modally but not Uni-modallyCode1
Cross-modal Context Fusion and Adaptive Graph Convolutional Network for Multimodal Conversational Emotion Recognition0
Integrate Temporal Graph Learning into LLM-based Temporal Knowledge Graph Model0
WhiSPA: Semantically and Psychologically Aligned Whisper with Self-Supervised Contrastive and Student-Teacher LearningCode1
CGP-Tuning: Structure-Aware Soft Prompt Tuning for Code Vulnerability Detection0
Free Lunch Enhancements for Multi-modal Crowd CountingCode1
Chat-based Person Retrieval via Dialogue-Refined Cross-Modal Alignment0
Diffusion Bridge: Leveraging Diffusion Model to Reduce the Modality Gap Between Text and Vision for Zero-Shot Image CaptioningCode1
Align-KD: Distilling Cross-Modal Alignment Knowledge for Mobile Vision-Language Large Model EnhancementCode1
Generalized Zero-Shot Classification via Semantics-Free Inter-Class Feature Generation0
Audio-Visual Semantic Graph Network for Audio-Visual Event Localization0
Enhancing Multimodal Emotion Recognition through Multi-Granularity Cross-Modal Alignment0
ChartAdapter: Large Vision-Language Model for Chart Summarization0
Enhancing Visual Representation for Text-based Person SearchingCode0
Bag of Tricks for Multimodal AutoML with Image, Text, and Tabular Data0
ASAP: Advancing Semantic Alignment Promotes Multi-Modal Manipulation Detecting and GroundingCode1
Wearable Accelerometer Foundation Models for Health via Knowledge Distillation0
RAC3: Retrieval-Augmented Corner Case Comprehension for Autonomous Driving with Vision-Language Models0
Dynamic Cross-Modal Alignment for Robust Semantic Location Prediction0
Enhancing Modality Representation and Alignment for Multimodal Cold-start Active Learning0
Multimodal Music Generation with Explicit Bridges and Retrieval AugmentationCode1
GEAL: Generalizable 3D Affordance Learning with Cross-Modal ConsistencyCode1
GEXIA: Granularity Expansion and Iterative Approximation for Scalable Multi-grained Video-language Learning0
Towards Brain Passage Retrieval -- An Investigation of EEG Query Representations0
CLIP-PING: Boosting Lightweight Vision-Language Models with Proximus Intrinsic Neighbors Guidance0
Align-KD: Distilling Cross-Modal Alignment Knowledge for Mobile Vision-Language ModelCode1
AlignMamba: Enhancing Multimodal Mamba with Local and Global Cross-modal Alignment0
SimCMF: A Simple Cross-modal Fine-tuning Strategy from Vision Foundation Models to Any Imaging ModalityCode1
Revisiting Misalignment in Multispectral Pedestrian Detection: A Language-Driven Approach for Cross-modal Alignment Fusion0
Tiny-Align: Bridging Automatic Speech Recognition and Large Language Model on the Edge0
CTPD: Cross-Modal Temporal Pattern Discovery for Enhanced Multimodal Electronic Health Records Analysis0
Towards Cross-Modal Text-Molecule Retrieval with Better Modality AlignmentCode0
Multi-path Exploration and Feedback Adjustment for Text-to-Image Person Retrieval0
AlignVSR: Audio-Visual Cross-Modal Alignment for Visual Speech RecognitionCode1
Modeling the Human Visual System: Comparative Insights from Response-Optimized and Task-Optimized Vision Models, Language Models, and different Readout Mechanisms0
Let Me Finish My Sentence: Video Temporal Grounding with Holistic Text Understanding0
LESS: Label-Efficient and Single-Stage Referring 3D SegmentationCode1
OMCAT: Omni Context Aware Transformer0
Evaluating Semantic Variation in Text-to-Image Synthesis: A Causal PerspectiveCode0
Deciphering Cross-Modal Alignment in Large Vision-Language Models with Modality Integration RateCode2
EMMA: Empowering Multi-modal Mamba with Structural and Hierarchical Alignment0
Intriguing Properties of Large Language and Vision Models0
TANGO: Co-Speech Gesture Video Reenactment with Hierarchical Audio Motion Embedding and Diffusion Interpolation0
Boosting Masked ECG-Text Auto-Encoders as Discriminative LearnersCode1
Melody-Guided Music GenerationCode2
Fully Aligned Network for Referring Image Segmentation0
Show:102550
← PrevPage 3 of 7Next →

No leaderboard results yet.