SOTAVerified

cross-modal alignment

Papers

Showing 151200 of 342 papers

TitleStatusHype
A Multi-Agent Framework for Automated Qinqiang Opera Script Generation Using Large Language Models0
Cross-attention for State-based model RWKV-70
TMCIR: Token Merge Benefits Composed Image Retrieval0
3D CoCa: Contrastive Learners are 3D CaptionersCode0
InfoMAE: Pair-Efficient Cross-Modal Alignment for Multimodal Time-Series Sensing Signals0
VLMT: Vision-Language Multimodal Transformer for Multimodal Multi-hop Question Answering0
SE4Lip: Speech-Lip Encoder for Talking Head Synthesis to Solve Phoneme-Viseme Alignment Ambiguity0
Gaze-Guided Learning: Avoiding Shortcut Bias in Visual ClassificationCode0
Leveraging Modality Tags for Enhanced Cross-Modal Video Retrieval0
DF-Calib: Targetless LiDAR-Camera Calibration via Depth Flow0
COST: Contrastive One-Stage Transformer for Vision-Language Small Object Tracking0
FineLIP: Extending CLIP's Reach via Fine-Grained Alignment with Longer Text Inputs0
SViQA: A Unified Speech-Vision Multimodal Model for Textless Visual Question Answering0
CADFormer: Fine-Grained Cross-modal Alignment and Decoding Transformer for Referring Remote Sensing Image Segmentation0
NeuroLIP: Interpretable and Fair Cross-Modal Alignment of fMRI and Phenotypic Text0
AutoRad-Lung: A Radiomic-Guided Prompting Autoregressive Vision-Language Model for Lung Nodule Malignancy Prediction0
GatedxLSTM: A Multimodal Affective Computing Approach for Emotion Recognition in Conversations0
LangBridge: Interpreting Image as a Combination of Language Embeddings0
Language-based Image Colorization: A Benchmark and BeyondCode0
Shushing! Let's Imagine an Authentic Speech from the Silent Video0
Observation-Graph Interaction and Key-Detail Guidance for Vision and Language Navigation0
Technical Approach for the EMI Challenge in the 8th Affective Behavior Analysis in-the-Wild Competition0
4D-ACFNet: A 4D Attention Mechanism-Based Prognostic Framework for Colorectal Cancer Liver Metastasis Integrating Multimodal Spatiotemporal Features0
Hierarchical Cross-Modal Alignment for Open-Vocabulary 3D Object Detection0
LLaVA-RadZ: Can Multimodal Large Language Models Effectively Tackle Zero-shot Radiology Recognition?0
OV-SCAN: Semantically Consistent Alignment for Novel Object Discovery in Open-Vocabulary 3D Object Detection0
RCRank: Multimodal Ranking of Root Causes of Slow Queries in Cloud Database SystemsCode0
Enhancing Vision-Language Compositional Understanding with Multimodal Synthetic Data0
Language Model Mapping in Multimodal Music Learning: A Grand Challenge Proposal0
UniGS: Unified Language-Image-3D Pretraining with Gaussian Splatting0
DUNIA: Pixel-Sized Embeddings via Cross-Modal Alignment for Earth Observation Applications0
MV-CLAM: Multi-View Molecular Interpretation with Cross-Modal Projection via Language ModelCode0
CardiacMamba: A Multimodal RGB-RF Fusion Framework with State Space Models for Remote Physiological MeasurementCode0
NOTA: Multimodal Music Notation Understanding for Visual Large Language Model0
A Survey of Automatic Prompt Engineering: An Optimization Perspective0
MDE: Modality Discrimination Enhancement for Multi-modal Recommendation0
Leveraging Pre-Trained Models for Multimodal Class-Incremental Learning under Adaptive Fusion0
Cross-modal Context Fusion and Adaptive Graph Convolutional Network for Multimodal Conversational Emotion Recognition0
Integrate Temporal Graph Learning into LLM-based Temporal Knowledge Graph Model0
CGP-Tuning: Structure-Aware Soft Prompt Tuning for Code Vulnerability Detection0
Audio-Visual Semantic Graph Network for Audio-Visual Event Localization0
Generalized Zero-Shot Classification via Semantics-Free Inter-Class Feature Generation0
Chat-based Person Retrieval via Dialogue-Refined Cross-Modal Alignment0
Enhancing Multimodal Emotion Recognition through Multi-Granularity Cross-Modal Alignment0
ChartAdapter: Large Vision-Language Model for Chart Summarization0
Enhancing Visual Representation for Text-based Person SearchingCode0
Bag of Tricks for Multimodal AutoML with Image, Text, and Tabular Data0
RAC3: Retrieval-Augmented Corner Case Comprehension for Autonomous Driving with Vision-Language Models0
Wearable Accelerometer Foundation Models for Health via Knowledge Distillation0
Dynamic Cross-Modal Alignment for Robust Semantic Location Prediction0
Show:102550
← PrevPage 4 of 7Next →

No leaderboard results yet.