| Transformer-based Spatial Grounding: A Comprehensive Survey | Jul 17, 2025 | cross-modal alignmentSurvey | —Unverified | 0 |
| Bridge Feature Matching and Cross-Modal Alignment with Mutual-filtering for Zero-shot Anomaly Detection | Jul 15, 2025 | Anomaly ClassificationAnomaly Detection | —Unverified | 0 |
| CATVis: Context-Aware Thought Visualization | Jul 15, 2025 | cross-modal alignmentEEG | —Unverified | 0 |
| Evaluating Attribute Confusion in Fashion Text-to-Image Generation | Jul 9, 2025 | Attributecross-modal alignment | —Unverified | 0 |
| RSRefSeg 2: Decoupling Referring Remote Sensing Image Segmentation with Foundation Models | Jul 8, 2025 | cross-modal alignmentImage Segmentation | CodeCode Available | 1 |
| Skywork-R1V3 Technical Report | Jul 8, 2025 | cross-modal alignmentMathematical Reasoning | CodeCode Available | 7 |
| DeSTA2.5-Audio: Toward General-Purpose Large Audio Language Model with Self-Generated Cross-Modal Alignment | Jul 3, 2025 | cross-modal alignmentInstruction Following | CodeCode Available | 2 |
| Flash-VStream: Efficient Real-Time Understanding for Long Video Streams | Jun 30, 2025 | cross-modal alignmentEgoSchema | CodeCode Available | 3 |
| DALR: Dual-level Alignment Learning for Multimodal Sentence Representation Learning | Jun 26, 2025 | cross-modal alignmentRepresentation Learning | —Unverified | 0 |
| TSDASeg: A Two-Stage Model with Direct Alignment for Interactive Point Cloud Segmentation | Jun 26, 2025 | cross-modal alignmentInteractive Segmentation | —Unverified | 0 |
| HyperPath: Knowledge-Guided Hyperbolic Semantic Hierarchy Modeling for WSI Analysis | Jun 19, 2025 | cross-modal alignmentMultiple Instance Learning | CodeCode Available | 0 |
| Speech-Language Models with Decoupled Tokenizers and Multi-Token Prediction | Jun 14, 2025 | cross-modal alignment | —Unverified | 0 |
| TAViS: Text-bridged Audio-Visual Segmentation with Foundation Models | Jun 13, 2025 | cross-modal alignmentSegmentation | —Unverified | 0 |
| Improving Medical Visual Representation Learning with Pathological-level Cross-Modal Alignment and Correlation Exploration | Jun 12, 2025 | cross-modal alignmentImage to text | —Unverified | 0 |
| OmniDRCA: Parallel Speech-Text Foundation Model via Dual-Resolution Speech Representations and Contrastive Alignment | Jun 11, 2025 | cross-modal alignmentQuestion Answering | CodeCode Available | 0 |
| ReID5o: Achieving Omni Multi-modal Person Re-identification in a Single Model | Jun 11, 2025 | cross-modal alignmentDescriptive | CodeCode Available | 2 |
| Fusing Cross-modal and Uni-modal Representations: A Kronecker Product Approach | Jun 10, 2025 | cross-modal alignment | —Unverified | 0 |
| Generating Vision-Language Navigation Instructions Incorporated Fine-Grained Alignment Annotations | Jun 10, 2025 | cross-modal alignmentNavigate | —Unverified | 0 |
| WhisQ: Cross-Modal Representation Learning for Text-to-Music MOS Prediction | Jun 6, 2025 | cross-modal alignmentLanguage Modeling | —Unverified | 0 |
| Does Your 3D Encoder Really Work? When Pretrain-SFT from 2D VLMs Meets 3D VLMs | Jun 5, 2025 | cross-modal alignmentDense Captioning | —Unverified | 0 |
| Towards LLM-Centric Multimodal Fusion: A Survey on Integration Strategies and Techniques | Jun 5, 2025 | cross-modal alignmentLarge Language Model | —Unverified | 0 |
| UniCUE: Unified Recognition and Generation Framework for Chinese Cued Speech Video-to-Speech Generation | Jun 4, 2025 | cross-modal alignmentLipreading | —Unverified | 0 |
| EmotionRankCLAP: Bridging Natural Language Speaking Styles and Ordinal Speech Emotion via Rank-N-Contrast | May 29, 2025 | Contrastive Learningcross-modal alignment | —Unverified | 0 |
| DiSa: Directional Saliency-Aware Prompt Learning for Generalizable Vision-Language Models | May 26, 2025 | cross-modal alignmentDomain Generalization | —Unverified | 0 |
| From Alignment to Advancement: Bootstrapping Audio-Language Alignment with Synthetic Data | May 26, 2025 | cross-modal alignmentInstruction Following | —Unverified | 0 |