SOTAVerified

cross-modal alignment

Papers

Showing 301342 of 342 papers

TitleStatusHype
Masked Vision and Language Modeling for Multi-modal Representation Learning0
MCAD: Multi-teacher Cross-modal Alignment Distillation for efficient image-text retrieval0
MCQA: Multimodal Co-attention Based Network for Question Answering0
MDE: Modality Discrimination Enhancement for Multi-modal Recommendation0
Mind the Modality Gap: Towards a Remote Sensing Vision-Language Model via Cross-modal Alignment0
Distributionally Robust Alignment for Medical Federated Vision-Language Pre-training Under Data Heterogeneity0
Mix and match networks: cross-modal alignment for zero-pair image-to-image translation0
MLLM as Video Narrator: Mitigating Modality Imbalance in Video Moment Retrieval0
MLLMs are Deeply Affected by Modality Bias0
Modeling the Human Visual System: Comparative Insights from Response-Optimized and Task-Optimized Vision Models, Language Models, and different Readout Mechanisms0
mSLAM: Massively multilingual joint pre-training for speech and text0
Multi-Grained Cross-modal Alignment for Learning Open-vocabulary Semantic Segmentation from Text Supervision0
Multi-level Cross-modal Alignment for Image Clustering0
Multi-modal Attribute Prompting for Vision-Language Models0
Multi-Modal Cross-Domain Alignment Network for Video Moment Retrieval0
Multimodal Machine Learning in Mental Health: A Survey of Data, Algorithms, and Challenges0
Multimodal Reasoning with Multimodal Knowledge Graph0
Multi-path Exploration and Feedback Adjustment for Text-to-Image Person Retrieval0
Multi-Prompts Learning with Cross-Modal Alignment for Attribute-based Person Re-Identification0
Multi-task Paired Masking with Alignment Modeling for Medical Vision-Language Pre-training0
NeuroLIP: Interpretable and Fair Cross-Modal Alignment of fMRI and Phenotypic Text0
NEVLP: Noise-Robust Framework for Efficient Vision-Language Pre-training0
NOTA: Multimodal Music Notation Understanding for Visual Large Language Model0
Observation-Graph Interaction and Key-Detail Guidance for Vision and Language Navigation0
OMCAT: Omni Context Aware Transformer0
OmniBind: Teach to Build Unequal-Scale Modality Interaction for Omni-Bind of All0
OmniVL:One Foundation Model for Image-Language and Video-Language Tasks0
OneEncoder: A Lightweight Framework for Progressive Alignment of Modalities0
On the Language Encoder of Contrastive Cross-modal Models0
OpenSight: A Simple Open-Vocabulary Framework for LiDAR-Based Object Detection0
OV-SCAN: Semantically Consistent Alignment for Novel Object Discovery in Open-Vocabulary 3D Object Detection0
PhysLLM: Harnessing Large Language Models for Cross-Modal Remote Physiological Sensing0
PMMTalk: Speech-Driven 3D Facial Animation from Complementary Pseudo Multi-modal Features0
Prompt-based Context- and Domain-aware Pretraining for Vision and Language Navigation0
Prototype-guided Cross-modal Completion and Alignment for Incomplete Text-based Person Re-identification0
RAC3: Retrieval-Augmented Corner Case Comprehension for Autonomous Driving with Vision-Language Models0
Reinforcement Learning for Weakly Supervised Temporal Grounding of Natural Language in Untrimmed Videos0
Representation Discrepancy Bridging Method for Remote Sensing Image-Text Retrieval0
Retrieving-to-Answer: Zero-Shot Video Question Answering with Frozen Large Language Models0
Revisiting Misalignment in Multispectral Pedestrian Detection: A Language-Driven Approach for Cross-modal Alignment Fusion0
Scene-Intuitive Agent for Remote Embodied Visual Grounding0
SE4Lip: Speech-Lip Encoder for Talking Head Synthesis to Solve Phoneme-Viseme Alignment Ambiguity0
Show:102550
← PrevPage 7 of 7Next →

No leaderboard results yet.