SOTAVerified

cross-modal alignment

Papers

Showing 2650 of 342 papers

TitleStatusHype
AerialVLN: Vision-and-Language Navigation for UAVsCode2
MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and Video GenerationCode2
Vision-Language Pre-Training with Triple Contrastive LearningCode2
RSRefSeg 2: Decoupling Referring Remote Sensing Image Segmentation with Foundation ModelsCode1
Modality Curation: Building Universal Embeddings for Advanced Multimodal Information RetrievalCode1
U-SAM: An audio language Model for Unified Speech, Audio, and Music UnderstandingCode1
MSCI: Addressing CLIP's Inherent Limitations for Compositional Zero-Shot LearningCode1
CAV-MAE Sync: Improving Contrastive Audio-Visual Mask Autoencoders via Fine-Grained AlignmentCode1
Multimodal Fusion and Vision-Language Models: A Survey for Robot VisionCode1
BiPVL-Seg: Bidirectional Progressive Vision-Language Fusion with Global-Local Alignment for Medical Image SegmentationCode1
LPOSS: Label Propagation Over Patches and Pixels for Open-vocabulary Semantic SegmentationCode1
CoMP: Continual Multimodal Pre-training for Vision Foundation ModelsCode1
Enhanced OoD Detection through Cross-Modal Alignment of Multi-Modal RepresentationsCode1
Cross-modal Causal Relation Alignment for Video Question GroundingCode1
SwimVG: Step-wise Multimodal Fusion and Adaption for Visual GroundingCode1
CLIP Behaves like a Bag-of-Words Model Cross-modally but not Uni-modallyCode1
WhiSPA: Semantically and Psychologically Aligned Whisper with Self-Supervised Contrastive and Student-Teacher LearningCode1
Diffusion Bridge: Leveraging Diffusion Model to Reduce the Modality Gap Between Text and Vision for Zero-Shot Image CaptioningCode1
Align-KD: Distilling Cross-Modal Alignment Knowledge for Mobile Vision-Language Large Model EnhancementCode1
Free Lunch Enhancements for Multi-modal Crowd CountingCode1
ASAP: Advancing Semantic Alignment Promotes Multi-Modal Manipulation Detecting and GroundingCode1
GEAL: Generalizable 3D Affordance Learning with Cross-Modal ConsistencyCode1
Multimodal Music Generation with Explicit Bridges and Retrieval AugmentationCode1
Align-KD: Distilling Cross-Modal Alignment Knowledge for Mobile Vision-Language ModelCode1
SimCMF: A Simple Cross-modal Fine-tuning Strategy from Vision Foundation Models to Any Imaging ModalityCode1
Show:102550
← PrevPage 2 of 14Next →

No leaderboard results yet.