| Stream-Omni: Simultaneous Multimodal Interactions with Large Language-Vision-Speech Model | Jun 16, 2025 | Large Language Modelmultimodal interaction | CodeCode Available | 5 |
| Ming-Lite-Uni: Advancements in Unified Architecture for Natural Multimodal Interaction | May 5, 2025 | Image Generationmultimodal interaction | CodeCode Available | 4 |
| Segment and Track Anything | May 11, 2023 | Autonomous Drivingmultimodal interaction | CodeCode Available | 4 |
| I2MoE: Interpretable Multimodal Interaction-aware Mixture-of-Experts | May 25, 2025 | Mixture-of-Expertsmultimodal interaction | CodeCode Available | 2 |
| OVTR: End-to-End Open-Vocabulary Multiple Object Tracking with Transformer | Mar 13, 2025 | Decodermultimodal interaction | CodeCode Available | 2 |
| Foundations and Recent Trends in Multimodal Mobile Agents: A Survey | Nov 4, 2024 | multimodal interactionSurvey | CodeCode Available | 2 |
| TIP: Tabular-Image Pre-training for Multimodal Classification with Incomplete Data | Jul 10, 2024 | Contrastive Learningmultimodal interaction | CodeCode Available | 2 |
| LOOK-M: Look-Once Optimization in KV Cache for Efficient Multimodal Long-Context Inference | Jun 26, 2024 | multimodal interaction | CodeCode Available | 2 |
| Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want | Mar 29, 2024 | Instruction FollowingLanguage Modelling | CodeCode Available | 2 |
| Agent AI: Surveying the Horizons of Multimodal Interaction | Jan 7, 2024 | multimodal interaction | CodeCode Available | 2 |