SOTAVerified

cross-modal alignment

Papers

Showing 201250 of 342 papers

TitleStatusHype
MMA-DFER: MultiModal Adaptation of unimodal models for Dynamic Facial Expression Recognition in-the-wildCode2
Distributionally Robust Alignment for Medical Federated Vision-Language Pre-training Under Data Heterogeneity0
CIRP: Cross-Item Relational Pre-training for Multimodal Product Bundling0
SeCG: Semantic-Enhanced 3D Visual Grounding via Cross-modal Graph AttentionCode0
Multi-Grained Cross-modal Alignment for Learning Open-vocabulary Semantic Segmentation from Text Supervision0
A Cross-Modal Approach to Silent Speech with LLM-Enhanced RecognitionCode1
Multi-modal Attribute Prompting for Vision-Language Models0
Semantics-enhanced Cross-modal Masked Image Modeling for Vision-Language Pre-training0
MENTOR: Multi-level Self-supervised Learning for Multimodal RecommendationCode1
Mind the Modality Gap: Towards a Remote Sensing Vision-Language Model via Cross-modal Alignment0
Cross-Modal Prototype based Multimodal Federated Learning under Severely Missing Modality0
Multi-level Cross-modal Alignment for Image Clustering0
The Devil is in the Details: Boosting Guided Depth Super-Resolution via Rethinking Cross-Modal Alignment and AggregationCode1
Self-supervised Feature Adaptation for 3D Industrial Anomaly Detection0
Auffusion: Leveraging the Power of Diffusion and Large Language Models for Text-to-Audio GenerationCode2
Linguistic-Aware Patch Slimming Framework for Fine-grained Cross-Modal AlignmentCode2
Multi-Prompts Learning with Cross-Modal Alignment for Attribute-based Person Re-Identification0
Detection-based Intermediate Supervision for Visual Question Answering0
Conditional Variational Autoencoder for Sign Language Translation with Cross-Modal AlignmentCode1
BrainVis: Exploring the Bridge between Brain and Visual Signals via Image ReconstructionCode1
Towards Balanced Alignment: Modal-Enhanced Semantic Modeling for Video Moment RetrievalCode1
Mask Grounding for Referring Image SegmentationCode1
M^2ConceptBase: A Fine-Grained Aligned Concept-Centric Multimodal Knowledge BaseCode0
Improving Cross-modal Alignment with Synthetic Pairs for Text-only Image Captioning0
ViLA: Efficient Video-Language Alignment for Video Question AnsweringCode1
OpenSight: A Simple Open-Vocabulary Framework for LiDAR-Based Object Detection0
Navigating Open Set Scenarios for Skeleton-based Action RecognitionCode1
Progressive Multi-Modality Learning for Inverse Protein FoldingCode1
PMMTalk: Speech-Driven 3D Facial Animation from Complementary Pseudo Multi-modal Features0
DAP: Domain-aware Prompt Learning for Vision-and-Language Navigation0
MCAD: Multi-teacher Cross-modal Alignment Distillation for efficient image-text retrieval0
Video Referring Expression Comprehension via Transformer with Content-conditioned Query0
On the Language Encoder of Contrastive Cross-modal Models0
Separating Invisible Sounds Toward Universal Audiovisual Scene-Aware Sound Separation0
Robust Graph Matching Using An Unbalanced Hierarchical Optimal Transport FrameworkCode0
CoDA: Collaborative Novel Box Discovery and Cross-modal Alignment for Open-vocabulary 3D Object DetectionCode2
ReForm-Eval: Evaluating Large Vision Language Models via Unified Re-Formulation of Task-Oriented BenchmarksCode1
Prototype-guided Cross-modal Completion and Alignment for Incomplete Text-based Person Re-identification0
Align before Search: Aligning Ads Image to Text for Accurate Cross-Modal Sponsored SearchCode0
VDC: Versatile Data Cleanser based on Visual-Linguistic Inconsistency by Multimodal Large Language ModelsCode1
Cross-modal Alignment with Optimal Transport for CTC-based ASR0
Sound Source Localization is All about Cross-Modal Alignment0
Multi-Semantic Fusion Model for Generalized Zero-Shot Skeleton-Based Action RecognitionCode1
Prompt-based Context- and Domain-aware Pretraining for Vision and Language Navigation0
Towards High-Fidelity Text-Guided 3D Face Generation and Manipulation Using only Images0
Position-Enhanced Visual Instruction Tuning for Multimodal Large Language ModelsCode1
Grounded Entity-Landmark Adaptive Pre-training for Vision-and-Language NavigationCode1
DiffCloth: Diffusion Based Garment Synthesis and Manipulation via Structural Cross-modal Semantic Alignment0
Language-Guided Diffusion Model for Visual GroundingCode0
AerialVLN: Vision-and-Language Navigation for UAVsCode2
Show:102550
← PrevPage 5 of 7Next →

No leaderboard results yet.