| Separating Invisible Sounds Toward Universal Audiovisual Scene-Aware Sound Separation | Oct 18, 2023 | cross-modal alignment | —Unverified | 0 |
| Shushing! Let's Imagine an Authentic Speech from the Silent Video | Mar 19, 2025 | cross-modal alignmentLanguage Modeling | —Unverified | 0 |
| SMAUG: Sparse Masked Autoencoder for Efficient Video-Language Pre-training | Nov 21, 2022 | cross-modal alignmentGPU | —Unverified | 0 |
| SoftCLIP: Softer Cross-modal Alignment Makes CLIP Stronger | Mar 30, 2023 | cross-modal alignmentzero-shot-classification | —Unverified | 0 |
| Sound Source Localization is All about Cross-Modal Alignment | Sep 19, 2023 | Allcross-modal alignment | —Unverified | 0 |
| Speech-Language Models with Decoupled Tokenizers and Multi-Token Prediction | Jun 14, 2025 | cross-modal alignment | —Unverified | 0 |
| Speech-Text Dialog Pre-training for Spoken Dialog Understanding with Explicit Cross-Modal Alignment | May 19, 2023 | cross-modal alignmentEmotion Recognition in Conversation | —Unverified | 0 |
| ST-BERT: Cross-modal Language Model Pre-training For End-to-end Spoken Language Understanding | Oct 23, 2020 | cross-modal alignmentLanguage Modeling | —Unverified | 0 |
| Structured Multi-modal Feature Embedding and Alignment for Image-Sentence Retrieval | Aug 5, 2021 | cross-modal alignmentRetrieval | —Unverified | 0 |
| SViQA: A Unified Speech-Vision Multimodal Model for Textless Visual Question Answering | Apr 1, 2025 | cross-modal alignmentQuestion Answering | —Unverified | 0 |