SOTAVerified

cross-modal alignment

Papers

Showing 201250 of 342 papers

TitleStatusHype
GEXIA: Granularity Expansion and Iterative Approximation for Scalable Multi-grained Video-language Learning0
Hierarchical Cross-Modal Alignment for Open-Vocabulary 3D Object Detection0
Hire: Hybrid-modal Interaction with Multiple Relational Enhancements for Image-Text Matching0
HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training0
How do Cross-View and Cross-Modal Alignment Affect Representations in Contrastive Learning?0
Improving Cross-modal Alignment for Text-Guided Image Inpainting0
Improving Cross-modal Alignment with Synthetic Pairs for Text-only Image Captioning0
Improving Medical Visual Representation Learning with Pathological-level Cross-Modal Alignment and Correlation Exploration0
Improving speech translation by fusing speech and text0
InfoMAE: Pair-Efficient Cross-Modal Alignment for Multimodal Time-Series Sensing Signals0
Integrate Temporal Graph Learning into LLM-based Temporal Knowledge Graph Model0
Intriguing Properties of Large Language and Vision Models0
JPG - Jointly Learn to Align: Automated Disease Prediction and Radiology Report Generation0
KD-VLP: Improving End-to-End Vision-and-Language Pretraining with Object Knowledge Distillation0
LangBridge: Interpreting Image as a Combination of Language Embeddings0
Linguistic Query-Guided Mask Generation for Referring Image Segmentation0
Learning Better Visual Representations for Weakly-Supervised Object Detection Using Natural Language Supervision0
Learning by Hallucinating: Vision-Language Pre-training with Weak Supervision0
Learning Joint Embedding with Modality Alignments for Cross-Modal Retrieval of Recipes and Food Images0
Learning Multi-Modal Nonlinear Embeddings: Performance Bounds and an Algorithm0
Learning to Localize Actions in Instructional Videos with LLM-Based Multi-Pathway Text-Video Alignment0
Let Me Finish My Sentence: Video Temporal Grounding with Holistic Text Understanding0
Leveraging Modality Tags for Enhanced Cross-Modal Video Retrieval0
Leveraging Pre-Trained Models for Multimodal Class-Incremental Learning under Adaptive Fusion0
OV-SCAN: Semantically Consistent Alignment for Novel Object Discovery in Open-Vocabulary 3D Object Detection0
PhysLLM: Harnessing Large Language Models for Cross-Modal Remote Physiological Sensing0
PMMTalk: Speech-Driven 3D Facial Animation from Complementary Pseudo Multi-modal Features0
Prompt-based Context- and Domain-aware Pretraining for Vision and Language Navigation0
Prototype-guided Cross-modal Completion and Alignment for Incomplete Text-based Person Re-identification0
RAC3: Retrieval-Augmented Corner Case Comprehension for Autonomous Driving with Vision-Language Models0
Reinforcement Learning for Weakly Supervised Temporal Grounding of Natural Language in Untrimmed Videos0
Representation Discrepancy Bridging Method for Remote Sensing Image-Text Retrieval0
Retrieving-to-Answer: Zero-Shot Video Question Answering with Frozen Large Language Models0
Revisiting Misalignment in Multispectral Pedestrian Detection: A Language-Driven Approach for Cross-modal Alignment Fusion0
Scene-Intuitive Agent for Remote Embodied Visual Grounding0
SE4Lip: Speech-Lip Encoder for Talking Head Synthesis to Solve Phoneme-Viseme Alignment Ambiguity0
See What You See: Self-supervised Cross-modal Retrieval of Visual Stimuli from Brain Activity0
Self-supervised Feature Adaptation for 3D Industrial Anomaly Detection0
Semantics-enhanced Cross-modal Masked Image Modeling for Vision-Language Pre-training0
Semantic-Space-Intervened Diffusive Alignment for Visual Classification0
Separating Invisible Sounds Toward Universal Audiovisual Scene-Aware Sound Separation0
Shushing! Let's Imagine an Authentic Speech from the Silent Video0
SMAUG: Sparse Masked Autoencoder for Efficient Video-Language Pre-training0
SoftCLIP: Softer Cross-modal Alignment Makes CLIP Stronger0
Sound Source Localization is All about Cross-Modal Alignment0
Speech-Language Models with Decoupled Tokenizers and Multi-Token Prediction0
ST-BERT: Cross-modal Language Model Pre-training For End-to-end Spoken Language Understanding0
Structured Multi-modal Feature Embedding and Alignment for Image-Sentence Retrieval0
SViQA: A Unified Speech-Vision Multimodal Model for Textless Visual Question Answering0
TANGO: Co-Speech Gesture Video Reenactment with Hierarchical Audio Motion Embedding and Diffusion Interpolation0
Show:102550
← PrevPage 5 of 7Next →

No leaderboard results yet.