SOTAVerified

cross-modal alignment

Papers

Showing 125 of 342 papers

TitleStatusHype
Transformer-based Spatial Grounding: A Comprehensive Survey0
Bridge Feature Matching and Cross-Modal Alignment with Mutual-filtering for Zero-shot Anomaly Detection0
CATVis: Context-Aware Thought Visualization0
Evaluating Attribute Confusion in Fashion Text-to-Image Generation0
RSRefSeg 2: Decoupling Referring Remote Sensing Image Segmentation with Foundation ModelsCode1
Skywork-R1V3 Technical ReportCode7
DeSTA2.5-Audio: Toward General-Purpose Large Audio Language Model with Self-Generated Cross-Modal AlignmentCode2
Flash-VStream: Efficient Real-Time Understanding for Long Video StreamsCode3
DALR: Dual-level Alignment Learning for Multimodal Sentence Representation Learning0
TSDASeg: A Two-Stage Model with Direct Alignment for Interactive Point Cloud Segmentation0
HyperPath: Knowledge-Guided Hyperbolic Semantic Hierarchy Modeling for WSI AnalysisCode0
Speech-Language Models with Decoupled Tokenizers and Multi-Token Prediction0
TAViS: Text-bridged Audio-Visual Segmentation with Foundation Models0
Improving Medical Visual Representation Learning with Pathological-level Cross-Modal Alignment and Correlation Exploration0
OmniDRCA: Parallel Speech-Text Foundation Model via Dual-Resolution Speech Representations and Contrastive AlignmentCode0
ReID5o: Achieving Omni Multi-modal Person Re-identification in a Single ModelCode2
Fusing Cross-modal and Uni-modal Representations: A Kronecker Product Approach0
Generating Vision-Language Navigation Instructions Incorporated Fine-Grained Alignment Annotations0
WhisQ: Cross-Modal Representation Learning for Text-to-Music MOS Prediction0
Does Your 3D Encoder Really Work? When Pretrain-SFT from 2D VLMs Meets 3D VLMs0
Towards LLM-Centric Multimodal Fusion: A Survey on Integration Strategies and Techniques0
UniCUE: Unified Recognition and Generation Framework for Chinese Cued Speech Video-to-Speech Generation0
EmotionRankCLAP: Bridging Natural Language Speaking Styles and Ordinal Speech Emotion via Rank-N-Contrast0
DiSa: Directional Saliency-Aware Prompt Learning for Generalizable Vision-Language Models0
From Alignment to Advancement: Bootstrapping Audio-Language Alignment with Synthetic Data0
Show:102550
← PrevPage 1 of 14Next →

No leaderboard results yet.