SOTAVerified

cross-modal alignment

Papers

Showing 101150 of 342 papers

TitleStatusHype
CoMP: Continual Multimodal Pre-training for Vision Foundation ModelsCode1
Global and Local Semantic Completion Learning for Vision-Language Pre-trainingCode1
Mask Grounding for Referring Image SegmentationCode1
mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connectionsCode1
RSRefSeg 2: Decoupling Referring Remote Sensing Image Segmentation with Foundation ModelsCode1
WhiSPA: Semantically and Psychologically Aligned Whisper with Self-Supervised Contrastive and Student-Teacher LearningCode1
Enhancing Multimodal Emotion Recognition through Multi-Granularity Cross-Modal Alignment0
Enhancing LLMs for Time Series Forecasting via Structure-Guided Cross-Modal Alignment0
Co-AttenDWG: Co-Attentive Dimension-Wise Gating and Expert Fusion for Multi-Modal Offensive Content Detection0
Enhancing Emotion Recognition in Incomplete Data: A Novel Cross-Modal Alignment, Reconstruction, and Refinement Framework0
EmotionRankCLAP: Bridging Natural Language Speaking Styles and Ordinal Speech Emotion via Rank-N-Contrast0
Coarse-to-fine Alignment Makes Better Speech-image Retrieval0
A Survey of Automatic Prompt Engineering: An Optimization Perspective0
EMMA: Empowering Multi-modal Mamba with Structural and Hierarchical Alignment0
EA-VTR: Event-Aware Video-Text Retrieval0
CLIP-PING: Boosting Lightweight Vision-Language Models with Proximus Intrinsic Neighbors Guidance0
Dynamic Cross-Modal Alignment for Robust Semantic Location Prediction0
DUNIA: Pixel-Sized Embeddings via Cross-Modal Alignment for Earth Observation Applications0
Technical Approach for the EMI Challenge in the 8th Affective Behavior Analysis in-the-Wild Competition0
End-to-end Semantic Object Detection with Cross-Modal Alignment0
Clip4Retrofit: Enabling Real-Time Image Labeling on Edge Devices via Cross-Architecture CLIP Distillation0
ALAS: Measuring Latent Speech-Text Alignment For Spoken Language Understanding In Multimodal LLMs0
4D-ACFNet: A 4D Attention Mechanism-Based Prognostic Framework for Colorectal Cancer Liver Metastasis Integrating Multimodal Spatiotemporal Features0
Enhancing Modality Representation and Alignment for Multimodal Cold-start Active Learning0
Does Your 3D Encoder Really Work? When Pretrain-SFT from 2D VLMs Meets 3D VLMs0
Enhancing Vision-Language Compositional Understanding with Multimodal Synthetic Data0
Does Vision Accelerate Hierarchical Generalization in Neural Language Learners?0
CIRP: Cross-Item Relational Pre-training for Multimodal Product Bundling0
Disentangled Noisy Correspondence Learning0
Chat-based Person Retrieval via Dialogue-Refined Cross-Modal Alignment0
Language Model Mapping in Multimodal Music Learning: A Grand Challenge Proposal0
ChartAdapter: Large Vision-Language Model for Chart Summarization0
DiSa: Directional Saliency-Aware Prompt Learning for Generalizable Vision-Language Models0
CGP-Tuning: Structure-Aware Soft Prompt Tuning for Code Vulnerability Detection0
DiffCloth: Diffusion Based Garment Synthesis and Manipulation via Structural Cross-modal Semantic Alignment0
KD-VLP: Improving End-to-End Vision-and-Language Pretraining with Object Knowledge Distillation0
DF-Calib: Targetless LiDAR-Camera Calibration via Depth Flow0
A Multi-Agent Framework for Automated Qinqiang Opera Script Generation Using Large Language Models0
Detection-based Intermediate Supervision for Visual Question Answering0
CATVis: Context-Aware Thought Visualization0
Intriguing Properties of Large Language and Vision Models0
DenseGrounding: Improving Dense Language-Vision Semantics for Ego-Centric 3D Visual Grounding0
Denoising and Alignment: Rethinking Domain Generalization for Multimodal Face Anti-Spoofing0
ALN-P3: Unified Language Alignment for Perception, Prediction, and Planning in Autonomous Driving0
Deformable Attentive Visual Enhancement for Referring Segmentation Using Vision-Language Model0
Towards Brain Passage Retrieval -- An Investigation of EEG Query Representations0
Integrate Temporal Graph Learning into LLM-based Temporal Knowledge Graph Model0
JPG - Jointly Learn to Align: Automated Disease Prediction and Radiology Report Generation0
LangBridge: Interpreting Image as a Combination of Language Embeddings0
DAP: Domain-aware Prompt Learning for Vision-and-Language Navigation0
Show:102550
← PrevPage 3 of 7Next →

No leaderboard results yet.