| MET-Bench: Multimodal Entity Tracking for Evaluating the Limitations of Vision-Language and Reasoning Models | Feb 15, 2025 | Natural Language UnderstandingVisual Reasoning | —Unverified | 0 |
| ZeroBench: An Impossible Visual Benchmark for Contemporary Large Multimodal Models | Feb 13, 2025 | Visual Reasoning | —Unverified | 0 |
| Visual Agentic AI for Spatial Reasoning with a Dynamic API | Feb 10, 2025 | Program SynthesisSpatial Reasoning | —Unverified | 0 |
| Boosting Multimodal Reasoning with MCTS-Automated Structured Thinking | Feb 4, 2025 | Computational EfficiencyMultimodal Reasoning | —Unverified | 0 |
| Integrating LMM Planners and 3D Skill Policies for Generalizable Manipulation | Jan 30, 2025 | MemorizationScene Understanding | —Unverified | 0 |
| Rethinking Bottlenecks in Safety Fine-Tuning of Vision Language Models | Jan 30, 2025 | Instruction FollowingVisual Reasoning | —Unverified | 0 |
| 3D-MoE: A Mixture-of-Experts Multi-modal LLM for 3D Vision and Pose Diffusion via Rectified Flow | Jan 28, 2025 | Instruction FollowingMixture-of-Experts | —Unverified | 0 |
| A Cognitive Paradigm Approach to Probe the Perception-Reasoning Interface in VLMs | Jan 23, 2025 | DescriptiveDiagnostic | —Unverified | 0 |
| Systematic Abductive Reasoning via Diverse Relation Representations in Vector-symbolic Architecture | Jan 21, 2025 | AttributeDiversity | —Unverified | 0 |
| MAPS: Advancing Multi-Modal Reasoning in Expert-Level Physical Science | Jan 18, 2025 | Visual Reasoning | —Unverified | 0 |
| CityLoc: 6DoF Pose Distributional Localization for Text Descriptions in Large-Scale Scenes with Gaussian Representation | Jan 15, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 |
| LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs | Jan 10, 2025 | 4kVisual Reasoning | CodeCode Available | 3 |
| ReFocus: Visual Editing as a Chain of Thought for Structured Image Understanding | Jan 9, 2025 | Visual Question Answering (VQA)Visual Reasoning | CodeCode Available | 2 |
| DRIVINGVQA: Analyzing Visual Chain-of-Thought Reasoning of Vision Language Models in Real-World Scenarios with Driving Theory Tests | Jan 8, 2025 | Multimodal ReasoningMultiple-choice | —Unverified | 0 |
| From Code to Compliance: Assessing ChatGPT's Utility in Designing an Accessible Webpage -- A Case Study | Jan 7, 2025 | Prompt EngineeringVisual Reasoning | —Unverified | 0 |
| Socratic Questioning: Learn to Self-guide Multimodal Reasoning in the Wild | Jan 6, 2025 | HallucinationMultimodal Reasoning | CodeCode Available | 0 |
| Generalizing from SIMPLE to HARD Visual Reasoning: Can We Mitigate Modality Imbalance in VLMs? | Jan 5, 2025 | Image CaptioningImage to text | CodeCode Available | 1 |
| LogicAD: Explainable Anomaly Detection via VLM-based Text Feature Extraction | Jan 3, 2025 | Anomaly DetectionVisual Reasoning | —Unverified | 0 |
| Virgo: A Preliminary Exploration on Reproducing o1-like MLLM | Jan 3, 2025 | Language ModelingLanguage Modelling | CodeCode Available | 2 |
| Language-Guided Salient Object Ranking | Jan 1, 2025 | ObjectSaliency Ranking | —Unverified | 0 |
| Probing Visual Language Priors in VLMs | Dec 31, 2024 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Slow Perception: Let's Perceive Geometric Figures Step-by-step | Dec 30, 2024 | MathVisual Reasoning | —Unverified | 0 |
| HALLUCINOGEN: A Benchmark for Evaluating Object Hallucination in Large Visual-Language Models | Dec 29, 2024 | HallucinationObject | CodeCode Available | 0 |
| Revisiting MLLMs: An In-Depth Analysis of Image Classification Abilities | Dec 21, 2024 | AttributeClassification | —Unverified | 0 |
| EarthDial: Turning Multi-sensory Earth Observations to Interactive Dialogues | Dec 19, 2024 | Change DetectionDisaster Response | —Unverified | 0 |
| ViUniT: Visual Unit Tests for More Robust Visual Programming | Dec 12, 2024 | Image GenerationImage-text matching | —Unverified | 0 |
| Progressive Multi-granular Alignments for Grounded Reasoning in Large Vision-Language Models | Dec 11, 2024 | Question AnsweringVisual Grounding | CodeCode Available | 0 |
| MM-PoE: Multiple Choice Reasoning via. Process of Elimination using Multi-Modal Models | Dec 10, 2024 | Multiple-choiceQuestion Answering | CodeCode Available | 0 |
| Perception Tokens Enhance Visual Reasoning in Multimodal Language Models | Dec 4, 2024 | Depth Estimationobject-detection | —Unverified | 0 |
| VISCO: Benchmarking Fine-Grained Critique and Correction Towards Self-Improvement in Visual Reasoning | Dec 3, 2024 | BenchmarkingVisual Reasoning | —Unverified | 0 |
| Learning Visual Abstract Reasoning through Dual-Stream Networks | Nov 29, 2024 | Visual Reasoning | CodeCode Available | 0 |
| Enhancing Visual Reasoning with Autonomous Imagination in Multimodal Large Language Models | Nov 27, 2024 | Visual Reasoning | —Unverified | 0 |
| Immune: Improving Safety Against Jailbreaks in Multi-modal LLMs via Inference-Time Alignment | Nov 27, 2024 | Safety AlignmentVisual Reasoning | CodeCode Available | 1 |
| Abductive Symbolic Solver on Abstraction and Reasoning Corpus | Nov 27, 2024 | ARCVisual Reasoning | —Unverified | 0 |
| Visual Contexts Clarify Ambiguous Expressions: A Benchmark Dataset | Nov 21, 2024 | Question AnsweringVisual Grounding | CodeCode Available | 0 |
| Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models | Nov 21, 2024 | Visual Reasoning | CodeCode Available | 3 |
| Beyond Visual Understanding: Introducing PARROT-360V for Vision Language Model Benchmarking | Nov 20, 2024 | BenchmarkingLanguage Modeling | —Unverified | 0 |
| Learning to Reason Iteratively and Parallelly for Complex Visual Reasoning Scenarios | Nov 20, 2024 | Question AnsweringVisual Question Answering (VQA) | —Unverified | 0 |
| Automated 3D Physical Simulation of Open-world Scene with Gaussian Splatting | Nov 19, 2024 | 3D GenerationGPU | —Unverified | 0 |
| ClevrSkills: Compositional Language and Visual Reasoning in Robotics | Nov 13, 2024 | Visual Reasoning | CodeCode Available | 1 |
| On Erroneous Agreements of CLIP Image Embeddings | Nov 7, 2024 | Visual Reasoning | CodeCode Available | 0 |
| HourVideo: 1-Hour Video-Language Understanding | Nov 7, 2024 | Benchmarkingcounterfactual | CodeCode Available | 2 |
| Inference Optimal VLMs Need Fewer Visual Tokens and More Parameters | Nov 5, 2024 | Token ReductionVisual Reasoning | CodeCode Available | 1 |
| Bootstrapping Top-down Information for Self-modulating Slot Attention | Nov 4, 2024 | ObjectObject Discovery | —Unverified | 0 |
| Reasoning Limitations of Multimodal Large Language Models. A case study of Bongard Problems | Nov 2, 2024 | SpecificityVisual Reasoning | —Unverified | 0 |
| Replace-then-Perturb: Targeted Adversarial Attacks With Visual Reasoning for Vision-Language Models | Nov 1, 2024 | Adversarial AttackContrastive Learning | —Unverified | 0 |
| LogiCity: Advancing Neuro-Symbolic AI with Abstract Urban Simulation | Nov 1, 2024 | Logical ReasoningSequential Decision Making | CodeCode Available | 1 |
| VisAidMath: Benchmarking Visual-Aided Mathematical Reasoning | Oct 30, 2024 | BenchmarkingHallucination | —Unverified | 0 |
| Improving Generalization in Visual Reasoning via Self-Ensemble | Oct 28, 2024 | Visual Question Answering (VQA)Visual Reasoning | —Unverified | 0 |
| Bongard in Wonderland: Visual Puzzles that Still Make AI Go Mad? | Oct 25, 2024 | Visual Reasoning | CodeCode Available | 0 |