| LISA: Reasoning Segmentation via Large Language Model | Aug 1, 2023 | Language ModelingLanguage Modelling | CodeCode Available | 4 | 5 |
| WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation | Mar 10, 2025 | Common Sense ReasoningImage Generation | CodeCode Available | 4 | 5 |
| VILA: On Pre-training for Visual Language Models | Dec 12, 2023 | In-Context LearningLanguage Modelling | CodeCode Available | 4 | 5 |
| Text2SQL is Not Enough: Unifying AI and Databases with TAG | Aug 27, 2024 | RAGRetrieval-augmented Generation | CodeCode Available | 4 | 5 |
| V?: Guided Visual Search as a Core Mechanism in Multimodal LLMs | Jan 1, 2024 | Visual GroundingWorld Knowledge | CodeCode Available | 4 | 5 |
| Are We on the Right Way for Evaluating Large Vision-Language Models? | Mar 29, 2024 | World Knowledge | CodeCode Available | 3 | 5 |
| DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge | Jul 6, 2025 | Image GenerationMultimodal Reasoning | CodeCode Available | 3 | 5 |
| AutoVLA: A Vision-Language-Action Model for End-to-End Autonomous Driving with Adaptive Reasoning and Reinforcement Fine-Tuning | Jun 16, 2025 | Action GenerationAutonomous Driving | CodeCode Available | 3 | 5 |
| LLaRA: Supercharging Robot Learning Data for Vision-Language Policy | Jun 28, 2024 | Vision-Language-ActionWorld Knowledge | CodeCode Available | 3 | 5 |
| HERMES: A Unified Self-Driving World Model for Simultaneous 3D Scene Understanding and Generation | Jan 24, 2025 | Autonomous DrivingLanguage Modeling | CodeCode Available | 3 | 5 |