| HiVG: Hierarchical Multimodal Fine-grained Modulation for Visual Grounding | Apr 20, 2024 | cross-modal alignmentVisual Grounding | CodeCode Available | 2 |
| MMA-DFER: MultiModal Adaptation of unimodal models for Dynamic Facial Expression Recognition in-the-wild | Apr 13, 2024 | cross-modal alignmentDynamic Facial Expression Recognition | CodeCode Available | 2 |
| Auffusion: Leveraging the Power of Diffusion and Large Language Models for Text-to-Audio Generation | Jan 2, 2024 | Audio Generationcross-modal alignment | CodeCode Available | 2 |
| Linguistic-Aware Patch Slimming Framework for Fine-grained Cross-Modal Alignment | Jan 1, 2024 | cross-modal alignmentCross-Modal Retrieval | CodeCode Available | 2 |
| CoDA: Collaborative Novel Box Discovery and Cross-modal Alignment for Open-vocabulary 3D Object Detection | Oct 4, 2023 | 3D Object Detectioncross-modal alignment | CodeCode Available | 2 |
| AerialVLN: Vision-and-Language Navigation for UAVs | Aug 13, 2023 | cross-modal alignmentNavigate | CodeCode Available | 2 |
| MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and Video Generation | Dec 19, 2022 | cross-modal alignmentDenoising | CodeCode Available | 2 |
| Vision-Language Pre-Training with Triple Contrastive Learning | Feb 21, 2022 | Contrastive Learningcross-modal alignment | CodeCode Available | 2 |
| RSRefSeg 2: Decoupling Referring Remote Sensing Image Segmentation with Foundation Models | Jul 8, 2025 | cross-modal alignmentImage Segmentation | CodeCode Available | 1 |
| Modality Curation: Building Universal Embeddings for Advanced Multimodal Information Retrieval | May 26, 2025 | Contrastive Learningcross-modal alignment | CodeCode Available | 1 |