SOTAVerified

cross-modal alignment

Papers

Showing 101150 of 342 papers

TitleStatusHype
Align and Prompt: Video-and-Language Pre-training with Entity PromptsCode1
Landmark-RxR: Solving Vision-and-Language Navigation with Fine-Grained Alignment SupervisionCode1
Dynamic Modality Interaction Modeling for Image-Text RetrievalCode1
EPMF: Efficient Perception-aware Multi-sensor Fusion for 3D Semantic SegmentationCode1
DanceIt: Music-inspired Dancing Video SynthesisCode1
Symbiotic Adversarial Learning for Attribute-based Person SearchCode1
Transformer-based Spatial Grounding: A Comprehensive Survey0
CATVis: Context-Aware Thought Visualization0
Bridge Feature Matching and Cross-Modal Alignment with Mutual-filtering for Zero-shot Anomaly Detection0
Evaluating Attribute Confusion in Fashion Text-to-Image Generation0
DALR: Dual-level Alignment Learning for Multimodal Sentence Representation Learning0
TSDASeg: A Two-Stage Model with Direct Alignment for Interactive Point Cloud Segmentation0
HyperPath: Knowledge-Guided Hyperbolic Semantic Hierarchy Modeling for WSI AnalysisCode0
Speech-Language Models with Decoupled Tokenizers and Multi-Token Prediction0
TAViS: Text-bridged Audio-Visual Segmentation with Foundation Models0
Improving Medical Visual Representation Learning with Pathological-level Cross-Modal Alignment and Correlation Exploration0
OmniDRCA: Parallel Speech-Text Foundation Model via Dual-Resolution Speech Representations and Contrastive AlignmentCode0
Generating Vision-Language Navigation Instructions Incorporated Fine-Grained Alignment Annotations0
Fusing Cross-modal and Uni-modal Representations: A Kronecker Product Approach0
WhisQ: Cross-Modal Representation Learning for Text-to-Music MOS Prediction0
Does Your 3D Encoder Really Work? When Pretrain-SFT from 2D VLMs Meets 3D VLMs0
Towards LLM-Centric Multimodal Fusion: A Survey on Integration Strategies and Techniques0
UniCUE: Unified Recognition and Generation Framework for Chinese Cued Speech Video-to-Speech Generation0
EmotionRankCLAP: Bridging Natural Language Speaking Styles and Ordinal Speech Emotion via Rank-N-Contrast0
ALAS: Measuring Latent Speech-Text Alignment For Spoken Language Understanding In Multimodal LLMs0
DiSa: Directional Saliency-Aware Prompt Learning for Generalizable Vision-Language Models0
ViTaPEs: Visuotactile Position Encodings for Cross-Modal Alignment in Multimodal Transformers0
From Alignment to Advancement: Bootstrapping Audio-Language Alignment with Synthetic Data0
Co-AttenDWG: Co-Attentive Dimension-Wise Gating and Expert Fusion for Multi-Modal Offensive Content Detection0
Deformable Attentive Visual Enhancement for Referring Segmentation Using Vision-Language Model0
MLLMs are Deeply Affected by Modality Bias0
Clip4Retrofit: Enabling Real-Time Image Labeling on Edge Devices via Cross-Architecture CLIP Distillation0
ICPL-ReID: Identity-Conditional Prompt Learning for Multi-Spectral Object Re-IdentificationCode0
Representation Discrepancy Bridging Method for Remote Sensing Image-Text Retrieval0
ALN-P3: Unified Language Alignment for Perception, Prediction, and Planning in Autonomous Driving0
CAD: A General Multimodal Framework for Video Deepfake Detection via Cross-Modal Alignment and Distillation0
Enhancing LLMs for Time Series Forecasting via Structure-Guided Cross-Modal Alignment0
FALCON: False-Negative Aware Learning of Contrastive Negatives in Vision-Language Pretraining0
Beyond Modality Collapse: Representations Blending for Multimodal Dataset Distillation0
VISTA: Enhancing Vision-Text Alignment in MLLMs via Cross-Modal Mutual Information Maximization0
Adaptive Spatial Transcriptomics Interpolation via Cross-modal Cross-slice ModelingCode0
Denoising and Alignment: Rethinking Domain Generalization for Multimodal Face Anti-Spoofing0
Anatomical Attention Alignment representation for Radiology Report GenerationCode0
HCMA: Hierarchical Cross-model Alignment for Grounded Text-to-Image GenerationCode0
Semantic-Space-Intervened Diffusive Alignment for Visual Classification0
Task-Adapter++: Task-specific Adaptation with Order-aware Alignment for Few-shot Action RecognitionCode0
Probabilistic Embeddings for Frozen Vision-Language Models: Uncertainty Quantification with Gaussian Process Latent Variable ModelsCode0
DenseGrounding: Improving Dense Language-Vision Semantics for Ego-Centric 3D Visual Grounding0
PhysLLM: Harnessing Large Language Models for Cross-Modal Remote Physiological Sensing0
MicarVLMoE: A Modern Gated Cross-Aligned Vision-Language Mixture of Experts Model for Medical Image Captioning and Report GenerationCode0
Show:102550
← PrevPage 3 of 7Next →

No leaderboard results yet.