| ColorBench: Can VLMs See and Understand the Colorful World? A Comprehensive Benchmark for Color Perception, Reasoning, and Robustness | Apr 10, 2025 | Visual Reasoning | CodeCode Available | 1 |
| SoTA with Less: MCTS-Guided Sample Selection for Data-Efficient Visual Reasoning Self-Improvement | Apr 10, 2025 | Knowledge DistillationVisual Reasoning | CodeCode Available | 2 |
| VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model | Apr 10, 2025 | Language ModelingLanguage Modelling | CodeCode Available | 9 |
| OmniCaptioner: One Captioner to Rule Them All | Apr 9, 2025 | AllImage Captioning | CodeCode Available | 2 |
| V-MAGE: A Game Evaluation Framework for Assessing Vision-Centric Capabilities in Multimodal Large Language Models | Apr 8, 2025 | BenchmarkingVisual Reasoning | CodeCode Available | 1 |
| TGraphX: Tensor-Aware Graph Neural Network for Multi-Dimensional Feature Learning | Apr 4, 2025 | Graph Neural Networkobject-detection | CodeCode Available | 0 |
| Rethinking RL Scaling for Vision Language Models: A Transparent, From-Scratch Framework and Comprehensive Evaluation Scheme | Apr 3, 2025 | Reinforcement Learning (RL)Visual Reasoning | CodeCode Available | 2 |
| On Data Synthesis and Post-training for Visual Abstract Reasoning | Apr 2, 2025 | Visual Reasoning | —Unverified | 0 |
| TDBench: Benchmarking Vision-Language Models in Understanding Top-Down Images | Apr 1, 2025 | Autonomous NavigationBenchmarking | CodeCode Available | 0 |
| GenVP: Generating Visual Puzzles with Contrastive Hierarchical VAEs | Mar 30, 2025 | Visual Reasoning | —Unverified | 0 |
| Q-Insight: Understanding Image Quality via Visual Reinforcement Learning | Mar 28, 2025 | DescriptiveImage Quality Assessment | CodeCode Available | 2 |
| Embodied-Reasoner: Synergizing Visual Search, Reasoning, and Action for Embodied Interactive Tasks | Mar 27, 2025 | Imitation LearningMathematical Reasoning | CodeCode Available | 2 |
| Reason-RFT: Reinforcement Fine-Tuning for Visual Reasoning | Mar 26, 2025 | Few-Shot LearningVisual Reasoning | CodeCode Available | 3 |
| DWIM: Towards Tool-aware Visual Reasoning via Discrepancy-aware Workflow Generation & Instruct-Masking Tuning | Mar 25, 2025 | Visual Reasoning | —Unverified | 0 |
| RGB-Th-Bench: A Dense benchmark for Visual-Thermal Understanding of Vision Language Models | Mar 25, 2025 | Image ComprehensionVisual Reasoning | —Unverified | 0 |
| Neuro-Symbolic Scene Graph Conditioning for Synthetic Image Dataset Generation | Mar 21, 2025 | Dataset GenerationGraph Generation | —Unverified | 0 |
| Chain of Functions: A Programmatic Pipeline for Fine-Grained Chart Reasoning Data | Mar 20, 2025 | DiversityVisual Reasoning | —Unverified | 0 |
| Agentic Keyframe Search for Video Question Answering | Mar 20, 2025 | EgoSchemaQuestion Answering | CodeCode Available | 1 |
| From Head to Tail: Towards Balanced Representation in Large Vision-Language Models through Adaptive Data Calibration | Mar 17, 2025 | DenoisingQuestion Answering | —Unverified | 0 |
| VERIFY: A Benchmark of Visual Explanation and Reasoning for Investigating Multimodal Reasoning Fidelity | Mar 14, 2025 | BenchmarkingDecision Making | —Unverified | 0 |
| Interpretable Image Classification via Non-parametric Part Prototype Learning | Mar 13, 2025 | image-classificationImage Classification | CodeCode Available | 1 |
| SciVerse: Unveiling the Knowledge Comprehension and Visual Reasoning of LMMs on Multi-modal Scientific Problems | Mar 13, 2025 | Visual Reasoning | —Unverified | 0 |
| How Do Multimodal Large Language Models Handle Complex Multimodal Reasoning? Placing Them in An Extensible Escape Game | Mar 13, 2025 | Multimodal ReasoningQuestion Answering | CodeCode Available | 1 |
| DriveLMM-o1: A Step-by-Step Reasoning Dataset and Large Multimodal Model for Driving Scenario Understanding | Mar 13, 2025 | 4kAutonomous Driving | CodeCode Available | 2 |
| SegAgent: Exploring Pixel Understanding Capabilities in MLLMs by Imitating Human Annotator Trajectories | Mar 11, 2025 | Decision MakingInteractive Segmentation | CodeCode Available | 2 |
| SegAgent: Exploring Pixel Understanding Capabilities in MLLMs by Imitating Human Annotator Trajectories | Mar 11, 2025 | Decision MakingInteractive Segmentation | CodeCode Available | 2 |
| PhysVLM: Enabling Visual Language Models to Understand Robotic Physical Reachability | Mar 11, 2025 | Visual Reasoning | CodeCode Available | 1 |
| VisRL: Intention-Driven Visual Perception via Reinforced Reasoning | Mar 10, 2025 | Reinforcement Learning (RL)Visual Reasoning | CodeCode Available | 1 |
| Does Acceleration Cause Hidden Instability in Vision Language Models? Uncovering Instance-Level Divergence Through a Large-Scale Empirical Study | Mar 9, 2025 | QuantizationToken Reduction | —Unverified | 0 |
| Poisoned-MRAG: Knowledge Poisoning Attacks to Multimodal Retrieval Augmented Generation | Mar 8, 2025 | RAGRetrieval | —Unverified | 0 |
| R1-Zero's "Aha Moment" in Visual Reasoning on a 2B Non-SFT Model | Mar 7, 2025 | Multimodal Reasoningreinforcement-learning | CodeCode Available | 4 |
| LVLM-Compress-Bench: Benchmarking the Broader Impact of Large Vision-Language Model Compression | Mar 6, 2025 | BenchmarkingCommon Sense Reasoning | CodeCode Available | 0 |
| Towards Visual Discrimination and Reasoning of Real-World Physical Dynamics: Physics-Grounded Anomaly Detection | Mar 5, 2025 | Anomaly DetectionObject | —Unverified | 0 |
| EXCLAIM: An Explainable Cross-Modal Agentic System for Misinformation Detection with Hierarchical Retrieval | Mar 1, 2025 | Explanation GenerationMisinformation | —Unverified | 0 |
| MMSciBench: Benchmarking Language Models on Multimodal Scientific Problems | Feb 27, 2025 | BenchmarkingVisual Reasoning | —Unverified | 0 |
| M-LLM Based Video Frame Selection for Efficient Video Understanding | Feb 27, 2025 | EgoSchemaLanguage Modeling | —Unverified | 0 |
| End-to-End Chart Summarization via Visual Chain-of-Thought in Vision-Language Models | Feb 24, 2025 | Visual Reasoning | —Unverified | 0 |
| Shakti-VLMs: Scalable Vision-Language Models for Enterprise AI | Feb 24, 2025 | document understandingMultimodal Reasoning | —Unverified | 0 |
| Unraveling the geometry of visual relational reasoning | Feb 24, 2025 | Relational ReasoningRelation Network | CodeCode Available | 0 |
| R1-Onevision:An Open-Source Multimodal Large Language Model Capable of Deep Reasoning | Feb 24, 2025 | Language ModelingLanguage Modelling | CodeCode Available | 4 |
| VisFactor: Benchmarking Fundamental Visual Cognition in Multimodal Large Language Models | Feb 23, 2025 | BenchmarkingSpatial Reasoning | CodeCode Available | 0 |
| Visual Reasoning Evaluation of Grok, Deepseek Janus, Gemini, Qwen, Mistral, and ChatGPT | Feb 23, 2025 | Bias DetectionVisual Reasoning | —Unverified | 0 |
| Chitrarth: Bridging Vision and Language for a Billion People | Feb 21, 2025 | DiversityLanguage Modeling | —Unverified | 0 |
| Forgotten Polygons: Multimodal Large Language Models are Shape-Blind | Feb 21, 2025 | MathMathematical Problem-Solving | CodeCode Available | 1 |
| KnowZRel: Common Sense Knowledge-based Zero-Shot Relationship Retrieval for Generalised Scene Graph Generation | Feb 21, 2025 | Common Sense ReasoningGraph Generation | CodeCode Available | 0 |
| AlphaMaze: Enhancing Large Language Models' Spatial Intelligence via GRPO | Feb 20, 2025 | Autonomous NavigationNavigate | CodeCode Available | 2 |
| Enhancing Cognition and Explainability of Multimodal Foundation Models with Self-Synthesized Data | Feb 19, 2025 | Fine-Grained Visual RecognitionPneumonia Detection | CodeCode Available | 1 |
| CityEQA: A Hierarchical LLM Agent on Embodied Question Answering Benchmark in City Space | Feb 18, 2025 | Embodied Question AnsweringQuestion Answering | CodeCode Available | 1 |
| Do we Really Need Visual Instructions? Towards Visual Instruction-Free Fine-tuning for Large Vision-Language Models | Feb 17, 2025 | Instruction Followingvisual instruction following | —Unverified | 0 |
| Learning to Stop Overthinking at Test Time | Feb 16, 2025 | Visual Reasoning | —Unverified | 0 |