SOTAVerified

cross-modal alignment

Papers

Showing 151200 of 342 papers

TitleStatusHype
Exploring Information-Theoretic Metrics Associated with Neural Collapse in Supervised Training0
TS-HTFA: Advancing Time Series Forecasting via Hierarchical Text-Free Alignment with Large Language Models0
Learning to Localize Actions in Instructional Videos with LLM-Based Multi-Pathway Text-Video Alignment0
MaPPER: Multimodal Prior-guided Parameter Efficient Tuning for Referring Expression ComprehensionCode1
OneEncoder: A Lightweight Framework for Progressive Alignment of Modalities0
CAST: Cross-modal Alignment Similarity Test for Vision Language ModelsCode0
KALE: An Artwork Image Captioning System Augmented with Heterogeneous GraphCode0
NEVLP: Noise-Robust Framework for Efficient Vision-Language Pre-training0
Locality-aware Cross-modal Correspondence Learning for Dense Audio-Visual Events Localization0
GALLa: Graph Aligned Large Language Models for Improved Source Code Understanding0
Temporal Order Preserved Optimal Transport-based Cross-modal Knowledge Transfer Learning for ASR0
Law of Vision Representation in MLLMsCode2
Beyond Uncertainty: Evidential Deep Learning for Robust Video Temporal GroundingCode1
A Survey on Facial Expression Recognition of Static and Dynamic EmotionsCode1
Focus on Focus: Focus-oriented Representation Learning and Multi-view Cross-modal Alignment for Glioma GradingCode0
Coarse-to-fine Alignment Makes Better Speech-image Retrieval0
Cross-Modal Denoising: A Novel Training Paradigm for Enhancing Speech-Image Retrieval0
Advancing Multi-grained Alignment for Contrastive Language-Audio Pre-trainingCode1
Cross-aware Early Fusion with Stage-divided Vision and Language Transformer Encoders for Referring Image Segmentation0
Disentangled Noisy Correspondence Learning0
Visible-Thermal Multiple Object Tracking: Large-scale Video Dataset and Progressive Fusion ApproachCode2
Unifying Visual and Semantic Feature Spaces with Diffusion Models for Enhanced Cross-Modal Alignment0
DAC: 2D-3D Retrieval with Noisy Labels via Divide-and-Conquer Alignment and CorrectionCode0
Multimodal Machine Learning in Mental Health: A Survey of Data, Algorithms, and Challenges0
Craft: Cross-modal Aligned Features Improve Robustness of Prompt TuningCode0
Aligning Sight and Sound: Advanced Sound Source Localization Through Audio-Visual AlignmentCode1
Distractors-Immune Representation Learning with Cross-modal Contrastive Regularization for Change CaptioningCode1
Enhancing Emotion Recognition in Incomplete Data: A Novel Cross-Modal Alignment, Reconstruction, and Refinement Framework0
EA-VTR: Event-Aware Video-Text Retrieval0
Towards Bridging the Cross-modal Semantic Gap for Multi-modal RecommendationCode1
Cross-Modal Attention Alignment Network with Auxiliary Text Description for zero-shot sketch-based image retrieval0
MLLM as Video Narrator: Mitigating Modality Imbalance in Video Moment Retrieval0
Mitigate the Gap: Investigating Approaches for Improving Cross-Modal Alignment in CLIPCode2
PRESTO: Progressive Pretraining Enhances Synthetic Chemistry OutcomesCode1
Flash-VStream: Memory-Based Real-Time Understanding for Long Video StreamsCode3
It is Never Too Late to Mend: Separate Learning for Multimedia RecommendationCode0
MMPolymer: A Multimodal Multitask Pretraining Framework for Polymer Property PredictionCode1
Hire: Hybrid-modal Interaction with Multiple Relational Enhancements for Image-Text Matching0
Multimodal Reasoning with Multimodal Knowledge Graph0
Collaborative Novel Object Discovery and Box-Guided Cross-Modal Alignment for Open-Vocabulary 3D Object DetectionCode3
DeCo: Decoupling Token Compression from Semantic Abstraction in Multimodal Large Language ModelsCode2
Transcending Fusion: A Multi-Scale Alignment Method for Remote Sensing Image-Text RetrievalCode1
Seeing the Image: Prioritizing Visual Correlation by Contrastive AlignmentCode2
OmniBind: Teach to Build Unequal-Scale Modality Interaction for Omni-Bind of All0
Structural Entities Extraction and Patient Indications Incorporation for Chest X-ray Report GenerationCode1
AlignGPT: Multi-modal Large Language Models with Adaptive Alignment Capability0
Context-Enhanced Video Moment Retrieval with Large Language Models0
Factual Serialization Enhancement: A Key Innovation for Chest X-ray Report GenerationCode1
Listen Then See: Video Alignment with Speaker AttentionCode0
HiVG: Hierarchical Multimodal Fine-grained Modulation for Visual GroundingCode2
Show:102550
← PrevPage 4 of 7Next →

No leaderboard results yet.