SOTAVerified

cross-modal alignment

Papers

Showing 276300 of 342 papers

TitleStatusHype
HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training0
MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and Video GenerationCode2
SimVTP: Simple Video Text Pre-training with Masked AutoencodersCode0
Asymmetric Cross-Scale Alignment for Text-Based Person SearchCode0
Seeing What You Miss: Vision-Language Pre-training with Semantic Completion LearningCode1
How do Cross-View and Cross-Modal Alignment Affect Representations in Contrastive Learning?0
SMAUG: Sparse Masked Autoencoder for Efficient Video-Language Pre-training0
CAMANet: Class Activation Map Guided Attention Network for Radiology Report GenerationCode1
Learning by Hallucinating: Vision-Language Pre-training with Weak Supervision0
Fine-grained Semantic Alignment Network for Weakly Supervised Temporal Language Grounding0
CLIP-Driven Fine-grained Text-Image Person Re-identificationCode1
Discrete Cross-Modal Alignment Enables Zero-Shot Speech TranslationCode0
Cross-modal Semantic Enhanced Interaction for Image-Sentence Retrieval0
Low-resource Neural Machine Translation with Cross-modal AlignmentCode1
Multi-Granularity Cross-modal Alignment for Generalized Medical Visual Representation LearningCode1
Video Referring Expression Comprehension via Transformer with Content-aware Query0
JPG - Jointly Learn to Align: Automated Disease Prediction and Radiology Report Generation0
Translation, Scale and Rotation: Cross-Modal Alignment Meets RGB-Infrared Vehicle Detection0
TokenFlow: Rethinking Fine-grained Cross-modal Alignment in Vision-Language Retrieval0
Multi-Modal Cross-Domain Alignment Network for Video Moment Retrieval0
OmniVL:One Foundation Model for Image-Language and Video-Language Tasks0
Efficient Vision-Language Pretraining with Visual Concepts and Hierarchical AlignmentCode1
See What You See: Self-supervised Cross-modal Retrieval of Visual Stimuli from Brain Activity0
Fine-Grained Semantically Aligned Vision-Language Pre-TrainingCode1
Masked Vision and Language Modeling for Multi-modal Representation Learning0
Show:102550
← PrevPage 12 of 14Next →

No leaderboard results yet.