| LESS: Label-Efficient and Single-Stage Referring 3D Segmentation | Oct 17, 2024 | cross-modal alignmentInstance Segmentation | CodeCode Available | 1 |
| OMCAT: Omni Context Aware Transformer | Oct 15, 2024 | Audio-visual Question AnsweringAudio-Visual Question Answering (AVQA) | —Unverified | 0 |
| Evaluating Semantic Variation in Text-to-Image Synthesis: A Causal Perspective | Oct 14, 2024 | cross-modal alignmentImage Generation | CodeCode Available | 0 |
| Deciphering Cross-Modal Alignment in Large Vision-Language Models with Modality Integration Rate | Oct 9, 2024 | cross-modal alignmentVisual Question Answering | CodeCode Available | 2 |
| EMMA: Empowering Multi-modal Mamba with Structural and Hierarchical Alignment | Oct 8, 2024 | cross-modal alignmentHallucination | —Unverified | 0 |
| Intriguing Properties of Large Language and Vision Models | Oct 7, 2024 | cross-modal alignmentLarge Language Model | —Unverified | 0 |
| TANGO: Co-Speech Gesture Video Reenactment with Hierarchical Audio Motion Embedding and Diffusion Interpolation | Oct 5, 2024 | cross-modal alignmentRetrieval | —Unverified | 0 |
| Boosting Masked ECG-Text Auto-Encoders as Discriminative Learners | Oct 3, 2024 | cross-modal alignment | CodeCode Available | 1 |
| Melody-Guided Music Generation | Sep 30, 2024 | cross-modal alignmentMusic Generation | CodeCode Available | 2 |
| Fully Aligned Network for Referring Image Segmentation | Sep 29, 2024 | cross-modal alignmentDecoder | —Unverified | 0 |