| Skywork-R1V3 Technical Report | Jul 8, 2025 | cross-modal alignmentMathematical Reasoning | CodeCode Available | 7 | 5 |
| Phantom: Subject-consistent video generation via cross-modal alignment | Feb 16, 2025 | cross-modal alignmentHuman-Domain Subject-to-Video | CodeCode Available | 5 | 5 |
| CrossOver: 3D Scene Cross-Modal Alignment | Feb 20, 2025 | cross-modal alignmentObject | CodeCode Available | 3 | 5 |
| MathCoder-VL: Bridging Vision and Code for Enhanced Multimodal Mathematical Reasoning | May 15, 2025 | cross-modal alignmentGeometry Problem Solving | CodeCode Available | 3 | 5 |
| Ola: Pushing the Frontiers of Omni-Modal Language Model | Feb 6, 2025 | cross-modal alignmentLanguage Modeling | CodeCode Available | 3 | 5 |
| Collaborative Novel Object Discovery and Box-Guided Cross-Modal Alignment for Open-Vocabulary 3D Object Detection | Jun 2, 2024 | 3D Object Detectioncross-modal alignment | CodeCode Available | 3 | 5 |
| Ask in Any Modality: A Comprehensive Survey on Multimodal Retrieval-Augmented Generation | Feb 12, 2025 | cross-modal alignmentmultimodal generation | CodeCode Available | 3 | 5 |
| Flash-VStream: Efficient Real-Time Understanding for Long Video Streams | Jun 30, 2025 | cross-modal alignmentEgoSchema | CodeCode Available | 3 | 5 |
| Flash-VStream: Memory-Based Real-Time Understanding for Long Video Streams | Jun 12, 2024 | cross-modal alignmentLanguage Modelling | CodeCode Available | 3 | 5 |
| GEM: Empowering MLLM for Grounded ECG Understanding with Time Series and Images | Mar 8, 2025 | cross-modal alignmentDiagnostic | CodeCode Available | 3 | 5 |
| DeCo: Decoupling Token Compression from Semantic Abstraction in Multimodal Large Language Models | May 31, 2024 | cross-modal alignmentVisual Localization | CodeCode Available | 2 | 5 |
| AerialVLN: Vision-and-Language Navigation for UAVs | Aug 13, 2023 | cross-modal alignmentNavigate | CodeCode Available | 2 | 5 |
| DeSTA2.5-Audio: Toward General-Purpose Large Audio Language Model with Self-Generated Cross-Modal Alignment | Jul 3, 2025 | cross-modal alignmentInstruction Following | CodeCode Available | 2 | 5 |
| Deciphering Cross-Modal Alignment in Large Vision-Language Models with Modality Integration Rate | Oct 9, 2024 | cross-modal alignmentVisual Question Answering | CodeCode Available | 2 | 5 |
| CoDA: Collaborative Novel Box Discovery and Cross-modal Alignment for Open-vocabulary 3D Object Detection | Oct 4, 2023 | 3D Object Detectioncross-modal alignment | CodeCode Available | 2 | 5 |
| Law of Vision Representation in MLLMs | Aug 29, 2024 | cross-modal alignmentLanguage Modeling | CodeCode Available | 2 | 5 |
| ReID5o: Achieving Omni Multi-modal Person Re-identification in a Single Model | Jun 11, 2025 | cross-modal alignmentDescriptive | CodeCode Available | 2 | 5 |
| Seeing the Image: Prioritizing Visual Correlation by Contrastive Alignment | May 28, 2024 | cross-modal alignment | CodeCode Available | 2 | 5 |
| MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and Video Generation | Dec 19, 2022 | cross-modal alignmentDenoising | CodeCode Available | 2 | 5 |
| Vision-Language Pre-Training with Triple Contrastive Learning | Feb 21, 2022 | Contrastive Learningcross-modal alignment | CodeCode Available | 2 | 5 |
| Mitigate the Gap: Investigating Approaches for Improving Cross-Modal Alignment in CLIP | Jun 25, 2024 | cross-modal alignmentImage Classification | CodeCode Available | 2 | 5 |
| mmE5: Improving Multimodal Multilingual Embeddings via High-quality Synthetic Data | Feb 12, 2025 | cross-modal alignmentLarge Language Model | CodeCode Available | 2 | 5 |
| Auffusion: Leveraging the Power of Diffusion and Large Language Models for Text-to-Audio Generation | Jan 2, 2024 | Audio Generationcross-modal alignment | CodeCode Available | 2 | 5 |
| MMA-DFER: MultiModal Adaptation of unimodal models for Dynamic Facial Expression Recognition in-the-wild | Apr 13, 2024 | cross-modal alignmentDynamic Facial Expression Recognition | CodeCode Available | 2 | 5 |
| Linguistic-Aware Patch Slimming Framework for Fine-grained Cross-Modal Alignment | Jan 1, 2024 | cross-modal alignmentCross-Modal Retrieval | CodeCode Available | 2 | 5 |
| Melody-Guided Music Generation | Sep 30, 2024 | cross-modal alignmentMusic Generation | CodeCode Available | 2 | 5 |
| Visible-Thermal Multiple Object Tracking: Large-scale Video Dataset and Progressive Fusion Approach | Aug 2, 2024 | cross-modal alignmentMultiple Object Tracking | CodeCode Available | 2 | 5 |
| HiVG: Hierarchical Multimodal Fine-grained Modulation for Visual Grounding | Apr 20, 2024 | cross-modal alignmentVisual Grounding | CodeCode Available | 2 | 5 |
| A Survey on Facial Expression Recognition of Static and Dynamic Emotions | Aug 28, 2024 | cross-modal alignmentFacial Expression Recognition | CodeCode Available | 1 | 5 |
| Align and Prompt: Video-and-Language Pre-training with Entity Prompts | Dec 17, 2021 | cross-modal alignmentEntity Alignment | CodeCode Available | 1 | 5 |
| ASAP: Advancing Semantic Alignment Promotes Multi-Modal Manipulation Detecting and Grounding | Dec 17, 2024 | cross-modal alignment | CodeCode Available | 1 | 5 |
| Global and Local Semantic Completion Learning for Vision-Language Pre-training | Jun 12, 2023 | cross-modal alignmentImage-text Retrieval | CodeCode Available | 1 | 5 |
| Free Lunch Enhancements for Multi-modal Crowd Counting | Jan 1, 2025 | cross-modal alignmentCrowd Counting | CodeCode Available | 1 | 5 |
| GEAL: Generalizable 3D Affordance Learning with Cross-Modal Consistency | Dec 12, 2024 | cross-modal alignmentTransfer Learning | CodeCode Available | 1 | 5 |
| CVT-SLR: Contrastive Visual-Textual Transformation for Sign Language Recognition with Variational Alignment | Mar 10, 2023 | cross-modal alignmentSign Language Recognition | CodeCode Available | 1 | 5 |
| DanceIt: Music-inspired Dancing Video Synthesis | Sep 17, 2020 | cross-modal alignmentRhythm | CodeCode Available | 1 | 5 |
| Advancing Multi-grained Alignment for Contrastive Language-Audio Pre-training | Aug 15, 2024 | cross-modal alignment | CodeCode Available | 1 | 5 |
| Grounded Entity-Landmark Adaptive Pre-training for Vision-and-Language Navigation | Aug 24, 2023 | cross-modal alignmentDescriptive | CodeCode Available | 1 | 5 |
| Efficient Vision-Language Pretraining with Visual Concepts and Hierarchical Alignment | Aug 29, 2022 | cross-modal alignmentImage-text Retrieval | CodeCode Available | 1 | 5 |
| BridgeTower: Building Bridges Between Encoders in Vision-Language Representation Learning | Jun 17, 2022 | cross-modal alignmentRepresentation Learning | CodeCode Available | 1 | 5 |
| Dynamic Modality Interaction Modeling for Image-Text Retrieval | Jul 11, 2021 | cross-modal alignmentCross-Modal Retrieval | CodeCode Available | 1 | 5 |
| Enhanced OoD Detection through Cross-Modal Alignment of Multi-Modal Representations | Mar 24, 2025 | cross-modal alignmentImage Classification | CodeCode Available | 1 | 5 |
| Conditional Variational Autoencoder for Sign Language Translation with Cross-Modal Alignment | Dec 25, 2023 | cross-modal alignmentDecoder | CodeCode Available | 1 | 5 |
| A Cross-Modal Approach to Silent Speech with LLM-Enhanced Recognition | Mar 2, 2024 | Automatic Speech RecognitionAutomatic Speech Recognition (ASR) | CodeCode Available | 1 | 5 |
| AlignVSR: Audio-Visual Cross-Modal Alignment for Visual Speech Recognition | Oct 21, 2024 | cross-modal alignmentspeech-recognition | CodeCode Available | 1 | 5 |
| CAMANet: Class Activation Map Guided Attention Network for Radiology Report Generation | Nov 2, 2022 | cross-modal alignmentDecision Making | CodeCode Available | 1 | 5 |
| BrainVis: Exploring the Bridge between Brain and Visual Signals via Image Reconstruction | Dec 22, 2023 | cross-modal alignmentEEG | CodeCode Available | 1 | 5 |
| Distractors-Immune Representation Learning with Cross-modal Contrastive Regularization for Change Captioning | Jul 16, 2024 | Caption Generationcross-modal alignment | CodeCode Available | 1 | 5 |
| Align-KD: Distilling Cross-Modal Alignment Knowledge for Mobile Vision-Language Large Model Enhancement | Jan 1, 2025 | cross-modal alignmentKnowledge Distillation | CodeCode Available | 1 | 5 |
| BiPVL-Seg: Bidirectional Progressive Vision-Language Fusion with Global-Local Alignment for Medical Image Segmentation | Mar 30, 2025 | cross-modal alignmentImage Segmentation | CodeCode Available | 1 | 5 |