SOTAVerified

cross-modal alignment

Papers

Showing 151200 of 342 papers

TitleStatusHype
SimVTP: Simple Video Text Pre-training with Masked AutoencodersCode0
Asymmetric Cross-Scale Alignment for Text-Based Person SearchCode0
LayoutLMv3: Pre-training for Document AI with Unified Text and Image MaskingCode0
KALE: An Artwork Image Captioning System Augmented with Heterogeneous GraphCode0
KD-VLP: Improving End-to-End Vision-and-Language Pretraining with Object Knowledge DistillationCode0
Shushing! Let's Imagine an Authentic Speech from the Silent Video0
SMAUG: Sparse Masked Autoencoder for Efficient Video-Language Pre-training0
SoftCLIP: Softer Cross-modal Alignment Makes CLIP Stronger0
Sound Source Localization is All about Cross-Modal Alignment0
Speech-Language Models with Decoupled Tokenizers and Multi-Token Prediction0
ST-BERT: Cross-modal Language Model Pre-training For End-to-end Spoken Language Understanding0
Structured Multi-modal Feature Embedding and Alignment for Image-Sentence Retrieval0
SViQA: A Unified Speech-Vision Multimodal Model for Textless Visual Question Answering0
TANGO: Co-Speech Gesture Video Reenactment with Hierarchical Audio Motion Embedding and Diffusion Interpolation0
TAViS: Text-bridged Audio-Visual Segmentation with Foundation Models0
Temporal Order Preserved Optimal Transport-based Cross-modal Knowledge Transfer Learning for ASR0
Tiny-Align: Bridging Automatic Speech Recognition and Large Language Model on the Edge0
TMCIR: Token Merge Benefits Composed Image Retrieval0
TokenFlow: Rethinking Fine-grained Cross-modal Alignment in Vision-Language Retrieval0
TOT: Topology-Aware Optimal Transport For Multimodal Hate Detection0
Towards High-Fidelity Text-Guided 3D Face Generation and Manipulation Using only Images0
Towards LLM-Centric Multimodal Fusion: A Survey on Integration Strategies and Techniques0
Transformer-based Spatial Grounding: A Comprehensive Survey0
Translation, Scale and Rotation: Cross-Modal Alignment Meets RGB-Infrared Vehicle Detection0
TSDASeg: A Two-Stage Model with Direct Alignment for Interactive Point Cloud Segmentation0
TS-HTFA: Advancing Time Series Forecasting via Hierarchical Text-Free Alignment with Large Language Models0
UniCUE: Unified Recognition and Generation Framework for Chinese Cued Speech Video-to-Speech Generation0
Unifying Visual and Semantic Feature Spaces with Diffusion Models for Enhanced Cross-Modal Alignment0
UniGS: Unified Language-Image-3D Pretraining with Gaussian Splatting0
Unsupervised Cross-Modal Alignment of Speech and Text Embedding Spaces0
Video Referring Expression Comprehension via Transformer with Content-aware Query0
Video Referring Expression Comprehension via Transformer with Content-conditioned Query0
ViTaPEs: Visuotactile Position Encodings for Cross-Modal Alignment in Multimodal Transformers0
VLMixer: Unpaired Vision-Language Pre-training via Cross-Modal CutMix0
VLMT: Vision-Language Multimodal Transformer for Multimodal Multi-hop Question Answering0
Wearable Accelerometer Foundation Models for Health via Knowledge Distillation0
WhisQ: Cross-Modal Representation Learning for Text-to-Music MOS Prediction0
WiCo: Win-win Cooperation of Bottom-up and Top-down Referring Image Segmentation0
Language Model Mapping in Multimodal Music Learning: A Grand Challenge Proposal0
VISTA: Enhancing Vision-Text Alignment in MLLMs via Cross-Modal Mutual Information Maximization0
FALCON: False-Negative Aware Learning of Contrastive Negatives in Vision-Language Pretraining0
4D-ACFNet: A 4D Attention Mechanism-Based Prognostic Framework for Colorectal Cancer Liver Metastasis Integrating Multimodal Spatiotemporal Features0
ACMM: Aligned Cross-Modal Memory for Few-Shot Image and Sentence Matching0
ALAS: Measuring Latent Speech-Text Alignment For Spoken Language Understanding In Multimodal LLMs0
AlignGPT: Multi-modal Large Language Models with Adaptive Alignment Capability0
AlignMamba: Enhancing Multimodal Mamba with Local and Global Cross-modal Alignment0
AlignSTS: Speech-to-Singing Conversion via Cross-Modal Alignment0
ALN-P3: Unified Language Alignment for Perception, Prediction, and Planning in Autonomous Driving0
A Multi-Agent Framework for Automated Qinqiang Opera Script Generation Using Large Language Models0
A Survey of Automatic Prompt Engineering: An Optimization Perspective0
Show:102550
← PrevPage 4 of 7Next →

No leaderboard results yet.