| RGB-Th-Bench: A Dense benchmark for Visual-Thermal Understanding of Vision Language Models | Mar 25, 2025 | Image ComprehensionVisual Reasoning | —Unverified | 0 |
| Neuro-Symbolic Scene Graph Conditioning for Synthetic Image Dataset Generation | Mar 21, 2025 | Dataset GenerationGraph Generation | —Unverified | 0 |
| Chain of Functions: A Programmatic Pipeline for Fine-Grained Chart Reasoning Data | Mar 20, 2025 | DiversityVisual Reasoning | —Unverified | 0 |
| From Head to Tail: Towards Balanced Representation in Large Vision-Language Models through Adaptive Data Calibration | Mar 17, 2025 | DenoisingQuestion Answering | —Unverified | 0 |
| VERIFY: A Benchmark of Visual Explanation and Reasoning for Investigating Multimodal Reasoning Fidelity | Mar 14, 2025 | BenchmarkingDecision Making | —Unverified | 0 |
| SciVerse: Unveiling the Knowledge Comprehension and Visual Reasoning of LMMs on Multi-modal Scientific Problems | Mar 13, 2025 | Visual Reasoning | —Unverified | 0 |
| Does Acceleration Cause Hidden Instability in Vision Language Models? Uncovering Instance-Level Divergence Through a Large-Scale Empirical Study | Mar 9, 2025 | QuantizationToken Reduction | —Unverified | 0 |
| Poisoned-MRAG: Knowledge Poisoning Attacks to Multimodal Retrieval Augmented Generation | Mar 8, 2025 | RAGRetrieval | —Unverified | 0 |
| LVLM-Compress-Bench: Benchmarking the Broader Impact of Large Vision-Language Model Compression | Mar 6, 2025 | BenchmarkingCommon Sense Reasoning | CodeCode Available | 0 |
| Towards Visual Discrimination and Reasoning of Real-World Physical Dynamics: Physics-Grounded Anomaly Detection | Mar 5, 2025 | Anomaly DetectionObject | —Unverified | 0 |
| EXCLAIM: An Explainable Cross-Modal Agentic System for Misinformation Detection with Hierarchical Retrieval | Mar 1, 2025 | Explanation GenerationMisinformation | —Unverified | 0 |
| M-LLM Based Video Frame Selection for Efficient Video Understanding | Feb 27, 2025 | EgoSchemaLanguage Modeling | —Unverified | 0 |
| MMSciBench: Benchmarking Language Models on Multimodal Scientific Problems | Feb 27, 2025 | BenchmarkingVisual Reasoning | —Unverified | 0 |
| Shakti-VLMs: Scalable Vision-Language Models for Enterprise AI | Feb 24, 2025 | document understandingMultimodal Reasoning | —Unverified | 0 |
| End-to-End Chart Summarization via Visual Chain-of-Thought in Vision-Language Models | Feb 24, 2025 | Visual Reasoning | —Unverified | 0 |
| Unraveling the geometry of visual relational reasoning | Feb 24, 2025 | Relational ReasoningRelation Network | CodeCode Available | 0 |
| Visual Reasoning Evaluation of Grok, Deepseek Janus, Gemini, Qwen, Mistral, and ChatGPT | Feb 23, 2025 | Bias DetectionVisual Reasoning | —Unverified | 0 |
| VisFactor: Benchmarking Fundamental Visual Cognition in Multimodal Large Language Models | Feb 23, 2025 | BenchmarkingSpatial Reasoning | CodeCode Available | 0 |
| Chitrarth: Bridging Vision and Language for a Billion People | Feb 21, 2025 | DiversityLanguage Modeling | —Unverified | 0 |
| KnowZRel: Common Sense Knowledge-based Zero-Shot Relationship Retrieval for Generalised Scene Graph Generation | Feb 21, 2025 | Common Sense ReasoningGraph Generation | CodeCode Available | 0 |
| Do we Really Need Visual Instructions? Towards Visual Instruction-Free Fine-tuning for Large Vision-Language Models | Feb 17, 2025 | Instruction Followingvisual instruction following | —Unverified | 0 |
| Learning to Stop Overthinking at Test Time | Feb 16, 2025 | Visual Reasoning | —Unverified | 0 |
| MET-Bench: Multimodal Entity Tracking for Evaluating the Limitations of Vision-Language and Reasoning Models | Feb 15, 2025 | Natural Language UnderstandingVisual Reasoning | —Unverified | 0 |
| ZeroBench: An Impossible Visual Benchmark for Contemporary Large Multimodal Models | Feb 13, 2025 | Visual Reasoning | —Unverified | 0 |
| Visual Agentic AI for Spatial Reasoning with a Dynamic API | Feb 10, 2025 | Program SynthesisSpatial Reasoning | —Unverified | 0 |
| Boosting Multimodal Reasoning with MCTS-Automated Structured Thinking | Feb 4, 2025 | Computational EfficiencyMultimodal Reasoning | —Unverified | 0 |
| Integrating LMM Planners and 3D Skill Policies for Generalizable Manipulation | Jan 30, 2025 | MemorizationScene Understanding | —Unverified | 0 |
| Rethinking Bottlenecks in Safety Fine-Tuning of Vision Language Models | Jan 30, 2025 | Instruction FollowingVisual Reasoning | —Unverified | 0 |
| 3D-MoE: A Mixture-of-Experts Multi-modal LLM for 3D Vision and Pose Diffusion via Rectified Flow | Jan 28, 2025 | Instruction FollowingMixture-of-Experts | —Unverified | 0 |
| A Cognitive Paradigm Approach to Probe the Perception-Reasoning Interface in VLMs | Jan 23, 2025 | DescriptiveDiagnostic | —Unverified | 0 |
| Systematic Abductive Reasoning via Diverse Relation Representations in Vector-symbolic Architecture | Jan 21, 2025 | AttributeDiversity | —Unverified | 0 |
| MAPS: Advancing Multi-Modal Reasoning in Expert-Level Physical Science | Jan 18, 2025 | Visual Reasoning | —Unverified | 0 |
| CityLoc: 6DoF Pose Distributional Localization for Text Descriptions in Large-Scale Scenes with Gaussian Representation | Jan 15, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 |
| DRIVINGVQA: Analyzing Visual Chain-of-Thought Reasoning of Vision Language Models in Real-World Scenarios with Driving Theory Tests | Jan 8, 2025 | Multimodal ReasoningMultiple-choice | —Unverified | 0 |
| From Code to Compliance: Assessing ChatGPT's Utility in Designing an Accessible Webpage -- A Case Study | Jan 7, 2025 | Prompt EngineeringVisual Reasoning | —Unverified | 0 |
| Socratic Questioning: Learn to Self-guide Multimodal Reasoning in the Wild | Jan 6, 2025 | HallucinationMultimodal Reasoning | CodeCode Available | 0 |
| LogicAD: Explainable Anomaly Detection via VLM-based Text Feature Extraction | Jan 3, 2025 | Anomaly DetectionVisual Reasoning | —Unverified | 0 |
| Language-Guided Salient Object Ranking | Jan 1, 2025 | ObjectSaliency Ranking | —Unverified | 0 |
| Probing Visual Language Priors in VLMs | Dec 31, 2024 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Slow Perception: Let's Perceive Geometric Figures Step-by-step | Dec 30, 2024 | MathVisual Reasoning | —Unverified | 0 |
| HALLUCINOGEN: A Benchmark for Evaluating Object Hallucination in Large Visual-Language Models | Dec 29, 2024 | HallucinationObject | CodeCode Available | 0 |
| Revisiting MLLMs: An In-Depth Analysis of Image Classification Abilities | Dec 21, 2024 | AttributeClassification | —Unverified | 0 |
| EarthDial: Turning Multi-sensory Earth Observations to Interactive Dialogues | Dec 19, 2024 | Change DetectionDisaster Response | —Unverified | 0 |
| ViUniT: Visual Unit Tests for More Robust Visual Programming | Dec 12, 2024 | Image GenerationImage-text matching | —Unverified | 0 |
| Progressive Multi-granular Alignments for Grounded Reasoning in Large Vision-Language Models | Dec 11, 2024 | Question AnsweringVisual Grounding | CodeCode Available | 0 |
| MM-PoE: Multiple Choice Reasoning via. Process of Elimination using Multi-Modal Models | Dec 10, 2024 | Multiple-choiceQuestion Answering | CodeCode Available | 0 |
| Perception Tokens Enhance Visual Reasoning in Multimodal Language Models | Dec 4, 2024 | Depth Estimationobject-detection | —Unverified | 0 |
| VISCO: Benchmarking Fine-Grained Critique and Correction Towards Self-Improvement in Visual Reasoning | Dec 3, 2024 | BenchmarkingVisual Reasoning | —Unverified | 0 |
| Learning Visual Abstract Reasoning through Dual-Stream Networks | Nov 29, 2024 | Visual Reasoning | CodeCode Available | 0 |
| Abductive Symbolic Solver on Abstraction and Reasoning Corpus | Nov 27, 2024 | ARCVisual Reasoning | —Unverified | 0 |