| Visually Descriptive Language Model for Vector Graphics Reasoning | Apr 9, 2024 | DescriptiveLanguage Modeling | CodeCode Available | 9 | 5 |
| VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model | Apr 10, 2025 | Language ModelingLanguage Modelling | CodeCode Available | 9 | 5 |
| Skywork-R1V3 Technical Report | Jul 8, 2025 | cross-modal alignmentMathematical Reasoning | CodeCode Available | 7 | 5 |
| MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models | Apr 20, 2023 | Image DescriptionLanguage Modelling | CodeCode Available | 7 | 5 |
| Visual Instruction Tuning | Apr 17, 2023 | 1 Image, 2*2 Stitching3D Question Answering (3D-QA) | CodeCode Available | 6 | 5 |
| BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation | Jan 28, 2022 | Image CaptioningImage-text matching | CodeCode Available | 5 | 5 |
| DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning | May 20, 2025 | HallucinationMathematical Reasoning | CodeCode Available | 5 | 5 |
| MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI | Nov 27, 2023 | Complex Query AnsweringLogical Reasoning | CodeCode Available | 5 | 5 |
| BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models | Jan 30, 2023 | Generative Visual Question AnsweringImage Captioning | CodeCode Available | 4 | 5 |
| R1-Onevision:An Open-Source Multimodal Large Language Model Capable of Deep Reasoning | Feb 24, 2025 | Language ModelingLanguage Modelling | CodeCode Available | 4 | 5 |
| Otter: A Multi-Modal Model with In-Context Instruction Tuning | May 5, 2023 | GPUIn-Context Learning | CodeCode Available | 4 | 5 |
| R1-Zero's "Aha Moment" in Visual Reasoning on a 2B Non-SFT Model | Mar 7, 2025 | Multimodal Reasoningreinforcement-learning | CodeCode Available | 4 | 5 |
| How is ChatGPT's behavior changing over time? | Jul 18, 2023 | Code GenerationLanguage Modelling | CodeCode Available | 4 | 5 |
| Reason-RFT: Reinforcement Fine-Tuning for Visual Reasoning | Mar 26, 2025 | Few-Shot LearningVisual Reasoning | CodeCode Available | 3 | 5 |
| DALL-Eval: Probing the Reasoning Skills and Social Biases of Text-to-Image Generation Models | Feb 8, 2022 | DiagnosticImage Captioning | CodeCode Available | 3 | 5 |
| UniBench: Visual Reasoning Requires Rethinking Vision-Language Beyond Scaling | Aug 9, 2024 | GPULanguage Modeling | CodeCode Available | 3 | 5 |
| Machine Mental Imagery: Empower Multimodal Reasoning with Latent Visual Tokens | Jun 20, 2025 | Image GenerationMultimodal Reasoning | CodeCode Available | 3 | 5 |
| Neural networks for abstraction and reasoning: Towards broad generalization in machines | Feb 5, 2024 | ARCVisual Reasoning | CodeCode Available | 3 | 5 |
| Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models | Nov 21, 2024 | Visual Reasoning | CodeCode Available | 3 | 5 |
| CogCoM: Train Large Vision-Language Models Diving into Details through Chain of Manipulations | Feb 6, 2024 | Visual Reasoning | CodeCode Available | 3 | 5 |
| LaViDa: A Large Diffusion Language Model for Multimodal Understanding | May 22, 2025 | Instruction FollowingLanguage Modeling | CodeCode Available | 3 | 5 |
| TokenPacker: Efficient Visual Projector for Multimodal LLM | Jul 2, 2024 | Language ModellingLarge Language Model | CodeCode Available | 3 | 5 |
| LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs | Jan 10, 2025 | 4kVisual Reasoning | CodeCode Available | 3 | 5 |
| OpenThinkIMG: Learning to Think with Images via Visual Tool Reinforcement Learning | May 13, 2025 | Reinforcement Learning (RL)Visual Reasoning | CodeCode Available | 3 | 5 |
| Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal Language Models | Jun 13, 2024 | Mathobject-detection | CodeCode Available | 3 | 5 |
| SegAgent: Exploring Pixel Understanding Capabilities in MLLMs by Imitating Human Annotator Trajectories | Mar 11, 2025 | Decision MakingInteractive Segmentation | CodeCode Available | 2 | 5 |
| DanmakuTPPBench: A Multi-modal Benchmark for Temporal Point Process Modeling and Understanding | May 23, 2025 | Language ModelingLanguage Modelling | CodeCode Available | 2 | 5 |
| SegAgent: Exploring Pixel Understanding Capabilities in MLLMs by Imitating Human Annotator Trajectories | Mar 11, 2025 | Decision MakingInteractive Segmentation | CodeCode Available | 2 | 5 |
| SoTA with Less: MCTS-Guided Sample Selection for Data-Efficient Visual Reasoning Self-Improvement | Apr 10, 2025 | Knowledge DistillationVisual Reasoning | CodeCode Available | 2 | 5 |
| SFT or RL? An Early Investigation into Training R1-Like Reasoning Large Vision-Language Models | Apr 10, 2025 | Reinforcement Learning (RL)Visual Reasoning | CodeCode Available | 2 | 5 |
| ReFocus: Visual Editing as a Chain of Thought for Structured Image Understanding | Jan 9, 2025 | Visual Question Answering (VQA)Visual Reasoning | CodeCode Available | 2 | 5 |
| PALO: A Polyglot Large Multimodal Model for 5B People | Feb 22, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 2 | 5 |
| Q-Insight: Understanding Image Quality via Visual Reinforcement Learning | Mar 28, 2025 | DescriptiveImage Quality Assessment | CodeCode Available | 2 | 5 |
| ChatGPT Asks, BLIP-2 Answers: Automatic Questioning Towards Enriched Visual Descriptions | Mar 12, 2023 | Image CaptioningQuestion Answering | CodeCode Available | 2 | 5 |
| Rethinking RL Scaling for Vision Language Models: A Transparent, From-Scratch Framework and Comprehensive Evaluation Scheme | Apr 3, 2025 | Reinforcement Learning (RL)Visual Reasoning | CodeCode Available | 2 | 5 |
| NoisyRollout: Reinforcing Visual Reasoning with Data Augmentation | Apr 17, 2025 | Data AugmentationDiversity | CodeCode Available | 2 | 5 |
| Multimodal Self-Instruct: Synthetic Abstract Image and Visual Reasoning Instruction Using Language Model | Jul 9, 2024 | Chart UnderstandingLanguage Modeling | CodeCode Available | 2 | 5 |
| OmniCaptioner: One Captioner to Rule Them All | Apr 9, 2025 | AllImage Captioning | CodeCode Available | 2 | 5 |
| MINT-CoT: Enabling Interleaved Visual Tokens in Mathematical Chain-of-Thought Reasoning | Jun 5, 2025 | MathMathematical Reasoning | CodeCode Available | 2 | 5 |
| MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning | Sep 14, 2023 | HallucinationIn-Context Learning | CodeCode Available | 2 | 5 |
| LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models | Mar 22, 2024 | Language ModellingLarge Language Model | CodeCode Available | 2 | 5 |
| Open CaptchaWorld: A Comprehensive Web-based Platform for Testing and Benchmarking Multimodal LLM Agents | May 30, 2025 | BenchmarkingBlocking | CodeCode Available | 2 | 5 |
| Learning to Compose Dynamic Tree Structures for Visual Contexts | Dec 5, 2018 | Graph GenerationPanoptic Scene Graph Generation | CodeCode Available | 2 | 5 |
| AlphaMaze: Enhancing Large Language Models' Spatial Intelligence via GRPO | Feb 20, 2025 | Autonomous NavigationNavigate | CodeCode Available | 2 | 5 |
| Learning Transferable Visual Models From Natural Language Supervision | Feb 26, 2021 | Action RecognitionBenchmarking | CodeCode Available | 2 | 5 |
| Neurosymbolic Diffusion Models | May 19, 2025 | Autonomous DrivingUncertainty Quantification | CodeCode Available | 2 | 5 |
| 3D-VisTA: Pre-trained Transformer for 3D Vision and Text Alignment | Aug 8, 2023 | 3D Question Answering (3D-QA)Dense Captioning | CodeCode Available | 2 | 5 |
| List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs | Apr 25, 2024 | Visual GroundingVisual Question Answering | CodeCode Available | 2 | 5 |
| High-Resolution Visual Reasoning via Multi-Turn Grounding-Based Reinforcement Learning | Jul 8, 2025 | MMEReinforcement Learning (RL) | CodeCode Available | 2 | 5 |
| Embodied-Reasoner: Synergizing Visual Search, Reasoning, and Action for Embodied Interactive Tasks | Mar 27, 2025 | Imitation LearningMathematical Reasoning | CodeCode Available | 2 | 5 |