SOTAVerified

cross-modal alignment

Papers

Showing 251300 of 342 papers

TitleStatusHype
Contrast-augmented Diffusion Model with Fine-grained Sequence Alignment for Markup-to-Image GenerationCode0
Text-guided Image Restoration and Semantic Enhancement for Text-to-Image Person RetrievalCode1
WiCo: Win-win Cooperation of Bottom-up and Top-down Referring Image Segmentation0
Retrieving-to-Answer: Zero-Shot Video Question Answering with Frozen Large Language Models0
Global and Local Semantic Completion Learning for Vision-Language Pre-trainingCode1
ManagerTower: Aggregating the Insights of Uni-Modal Experts for Vision-Language Representation LearningCode1
SOC: Semantic-Assisted Object Cluster for Referring Video Object SegmentationCode1
Improving speech translation by fusing speech and text0
Speech-Text Dialog Pre-training for Spoken Dialog Understanding with Explicit Cross-Modal Alignment0
Multi-task Paired Masking with Alignment Modeling for Medical Vision-Language Pre-training0
AlignSTS: Speech-to-Singing Conversion via Cross-Modal Alignment0
Towards Medical Artificial General Intelligence via Knowledge-Enhanced Multimodal PretrainingCode1
CoVLR: Coordinating Cross-Modal Consistency and Intra-Modal Structure for Vision-Language Retrieval0
Unraveling Instance Associations: A Closer Look for Audio-Visual SegmentationCode1
SoftCLIP: Softer Cross-modal Alignment Makes CLIP Stronger0
Unmasked Teacher: Towards Training-Efficient Video Foundation ModelsCode0
Revisiting Multimodal Representation in Contrastive Learning: From Patch and Token Embeddings to Finite Discrete TokensCode1
CVT-SLR: Contrastive Visual-Textual Transformation for Sign Language Recognition with Variational AlignmentCode1
LoGoNet: Towards Accurate 3D Object Detection with Local-to-Global Cross-Modal FusionCode0
HiCLIP: Contrastive Language-Image Pretraining with Hierarchy-aware AttentionCode1
TOT: Topology-Aware Optimal Transport For Multimodal Hate Detection0
End-to-end Semantic Object Detection with Cross-Modal Alignment0
Does Vision Accelerate Hierarchical Generalization in Neural Language Learners?0
Improving Cross-modal Alignment for Text-Guided Image Inpainting0
Linguistic Query-Guided Mask Generation for Referring Image Segmentation0
HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training0
MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and Video GenerationCode2
SimVTP: Simple Video Text Pre-training with Masked AutoencodersCode0
Asymmetric Cross-Scale Alignment for Text-Based Person SearchCode0
Seeing What You Miss: Vision-Language Pre-training with Semantic Completion LearningCode1
How do Cross-View and Cross-Modal Alignment Affect Representations in Contrastive Learning?0
SMAUG: Sparse Masked Autoencoder for Efficient Video-Language Pre-training0
CAMANet: Class Activation Map Guided Attention Network for Radiology Report GenerationCode1
Learning by Hallucinating: Vision-Language Pre-training with Weak Supervision0
Fine-grained Semantic Alignment Network for Weakly Supervised Temporal Language Grounding0
CLIP-Driven Fine-grained Text-Image Person Re-identificationCode1
Discrete Cross-Modal Alignment Enables Zero-Shot Speech TranslationCode0
Cross-modal Semantic Enhanced Interaction for Image-Sentence Retrieval0
Low-resource Neural Machine Translation with Cross-modal AlignmentCode1
Multi-Granularity Cross-modal Alignment for Generalized Medical Visual Representation LearningCode1
Video Referring Expression Comprehension via Transformer with Content-aware Query0
JPG - Jointly Learn to Align: Automated Disease Prediction and Radiology Report Generation0
Translation, Scale and Rotation: Cross-Modal Alignment Meets RGB-Infrared Vehicle Detection0
TokenFlow: Rethinking Fine-grained Cross-modal Alignment in Vision-Language Retrieval0
Multi-Modal Cross-Domain Alignment Network for Video Moment Retrieval0
OmniVL:One Foundation Model for Image-Language and Video-Language Tasks0
Efficient Vision-Language Pretraining with Visual Concepts and Hierarchical AlignmentCode1
See What You See: Self-supervised Cross-modal Retrieval of Visual Stimuli from Brain Activity0
Fine-Grained Semantically Aligned Vision-Language Pre-TrainingCode1
Masked Vision and Language Modeling for Multi-modal Representation Learning0
Show:102550
← PrevPage 6 of 7Next →

No leaderboard results yet.