SOTAVerified

cross-modal alignment

Papers

Showing 201250 of 342 papers

TitleStatusHype
Enhancing Modality Representation and Alignment for Multimodal Cold-start Active Learning0
GEXIA: Granularity Expansion and Iterative Approximation for Scalable Multi-grained Video-language Learning0
Towards Brain Passage Retrieval -- An Investigation of EEG Query Representations0
CLIP-PING: Boosting Lightweight Vision-Language Models with Proximus Intrinsic Neighbors Guidance0
AlignMamba: Enhancing Multimodal Mamba with Local and Global Cross-modal Alignment0
Revisiting Misalignment in Multispectral Pedestrian Detection: A Language-Driven Approach for Cross-modal Alignment Fusion0
Tiny-Align: Bridging Automatic Speech Recognition and Large Language Model on the Edge0
CTPD: Cross-Modal Temporal Pattern Discovery for Enhanced Multimodal Electronic Health Records Analysis0
Towards Cross-Modal Text-Molecule Retrieval with Better Modality AlignmentCode0
Multi-path Exploration and Feedback Adjustment for Text-to-Image Person Retrieval0
Let Me Finish My Sentence: Video Temporal Grounding with Holistic Text Understanding0
Modeling the Human Visual System: Comparative Insights from Response-Optimized and Task-Optimized Vision Models, Language Models, and different Readout Mechanisms0
OMCAT: Omni Context Aware Transformer0
Evaluating Semantic Variation in Text-to-Image Synthesis: A Causal PerspectiveCode0
EMMA: Empowering Multi-modal Mamba with Structural and Hierarchical Alignment0
Intriguing Properties of Large Language and Vision Models0
TANGO: Co-Speech Gesture Video Reenactment with Hierarchical Audio Motion Embedding and Diffusion Interpolation0
Fully Aligned Network for Referring Image Segmentation0
Exploring Information-Theoretic Metrics Associated with Neural Collapse in Supervised Training0
TS-HTFA: Advancing Time Series Forecasting via Hierarchical Text-Free Alignment with Large Language Models0
Learning to Localize Actions in Instructional Videos with LLM-Based Multi-Pathway Text-Video Alignment0
OneEncoder: A Lightweight Framework for Progressive Alignment of Modalities0
CAST: Cross-modal Alignment Similarity Test for Vision Language ModelsCode0
KALE: An Artwork Image Captioning System Augmented with Heterogeneous GraphCode0
NEVLP: Noise-Robust Framework for Efficient Vision-Language Pre-training0
Locality-aware Cross-modal Correspondence Learning for Dense Audio-Visual Events Localization0
GALLa: Graph Aligned Large Language Models for Improved Source Code Understanding0
Temporal Order Preserved Optimal Transport-based Cross-modal Knowledge Transfer Learning for ASR0
Focus on Focus: Focus-oriented Representation Learning and Multi-view Cross-modal Alignment for Glioma GradingCode0
Cross-Modal Denoising: A Novel Training Paradigm for Enhancing Speech-Image Retrieval0
Coarse-to-fine Alignment Makes Better Speech-image Retrieval0
Cross-aware Early Fusion with Stage-divided Vision and Language Transformer Encoders for Referring Image Segmentation0
Disentangled Noisy Correspondence Learning0
Unifying Visual and Semantic Feature Spaces with Diffusion Models for Enhanced Cross-Modal Alignment0
DAC: 2D-3D Retrieval with Noisy Labels via Divide-and-Conquer Alignment and CorrectionCode0
Multimodal Machine Learning in Mental Health: A Survey of Data, Algorithms, and Challenges0
Craft: Cross-modal Aligned Features Improve Robustness of Prompt TuningCode0
Enhancing Emotion Recognition in Incomplete Data: A Novel Cross-Modal Alignment, Reconstruction, and Refinement Framework0
EA-VTR: Event-Aware Video-Text Retrieval0
Cross-Modal Attention Alignment Network with Auxiliary Text Description for zero-shot sketch-based image retrieval0
MLLM as Video Narrator: Mitigating Modality Imbalance in Video Moment Retrieval0
It is Never Too Late to Mend: Separate Learning for Multimedia RecommendationCode0
Hire: Hybrid-modal Interaction with Multiple Relational Enhancements for Image-Text Matching0
Multimodal Reasoning with Multimodal Knowledge Graph0
OmniBind: Teach to Build Unequal-Scale Modality Interaction for Omni-Bind of All0
AlignGPT: Multi-modal Large Language Models with Adaptive Alignment Capability0
Context-Enhanced Video Moment Retrieval with Large Language Models0
Listen Then See: Video Alignment with Speaker AttentionCode0
Distributionally Robust Alignment for Medical Federated Vision-Language Pre-training Under Data Heterogeneity0
CIRP: Cross-Item Relational Pre-training for Multimodal Product Bundling0
Show:102550
← PrevPage 5 of 7Next →

No leaderboard results yet.