| Selective Vision is the Challenge for Visual Reasoning: A Benchmark for Visual Argument Understanding | Jun 27, 2024 | Visual Reasoning | CodeCode Available | 1 |
| Disentangling Knowledge-based and Visual Reasoning by Question Decomposition in KB-VQA | Jun 27, 2024 | General KnowledgeQuestion Answering | —Unverified | 0 |
| Visual Reasoning and Multi-Agent Approach in Multimodal Large Language Models (MLLMs): Solving TSP and mTSP Combinatorial Challenges | Jun 26, 2024 | In-Context LearningTraveling Salesman Problem | CodeCode Available | 0 |
| Evaluating Visual and Cultural Interpretation: The K-Viscuit Benchmark with Human-VLM Collaboration | Jun 24, 2024 | DiversityMultiple-choice | —Unverified | 0 |
| Beyond the Doors of Perception: Vision Transformers Represent Relations Between Objects | Jun 22, 2024 | Relational ReasoningVisual Reasoning | CodeCode Available | 0 |
| Whiteboard-of-Thought: Thinking Step-by-Step Across Modalities | Jun 20, 2024 | Spatial ReasoningVisual Reasoning | —Unverified | 0 |
| GSR-BENCH: A Benchmark for Grounded Spatial Reasoning Evaluation via Multimodal LLMs | Jun 19, 2024 | Spatial ReasoningVisual Reasoning | —Unverified | 0 |
| VDebugger: Harnessing Execution Feedback for Debugging Visual Programs | Jun 19, 2024 | Visual Reasoning | CodeCode Available | 0 |
| RS-GPT4V: A Unified Multimodal Instruction-Following Dataset for Remote Sensing Image Understanding | Jun 18, 2024 | AttributeInstruction Following | CodeCode Available | 1 |
| Beyond Visual Appearances: Privacy-sensitive Objects Identification via Hybrid Graph Reasoning | Jun 18, 2024 | Data AugmentationGraph Generation | —Unverified | 0 |
| Slot State Space Models | Jun 18, 2024 | MambaState Space Models | CodeCode Available | 1 |
| ClawMachine: Learning to Fetch Visual Tokens for Referential Comprehension | Jun 17, 2024 | DecoderVisual Reasoning | CodeCode Available | 1 |
| A Unified View of Abstract Visual Reasoning Problems | Jun 16, 2024 | Transfer LearningVisual Reasoning | —Unverified | 0 |
| A-I-RAVEN and I-RAVEN-Mesh: Two New Benchmarks for Abstract Visual Reasoning | Jun 16, 2024 | Transfer LearningVisual Reasoning | —Unverified | 0 |
| What is the Visual Cognition Gap between Humans and Multimodal LLMs? | Jun 14, 2024 | object-detectionObject Detection | CodeCode Available | 0 |
| Neural Concept Binder | Jun 14, 2024 | DescriptiveRetrieval | CodeCode Available | 1 |
| Comparison Visual Instruction Tuning | Jun 13, 2024 | Instruction FollowingNovelty Detection | —Unverified | 0 |
| Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal Language Models | Jun 13, 2024 | Mathobject-detection | CodeCode Available | 3 |
| INS-MMBench: A Comprehensive Benchmark for Evaluating LVLMs' Performance in Insurance | Jun 13, 2024 | Multiple-choiceVisual Reasoning | CodeCode Available | 1 |
| Eyeballing Combinatorial Problems: A Case Study of Using Multimodal Large Language Models to Solve Traveling Salesman Problems | Jun 11, 2024 | In-Context LearningTraveling Salesman Problem | —Unverified | 0 |
| HENASY: Learning to Assemble Scene-Entities for Egocentric Video-Language Model | Jun 1, 2024 | Action RecognitionActivity Recognition | —Unverified | 0 |
| MMCTAgent: Multi-modal Critical Thinking Agent Framework for Complex Visual Reasoning | May 28, 2024 | Decision MakingVideo Understanding | —Unverified | 0 |
| Do Vision-Language Transformers Exhibit Visual Commonsense? An Empirical Study of VCR | May 27, 2024 | Question AnsweringTAG | —Unverified | 0 |
| Code Repair with LLMs gives an Exploration-Exploitation Tradeoff | May 26, 2024 | Code RepairLanguage Modeling | —Unverified | 0 |
| Visual Description Grounding Reduces Hallucinations and Boosts Reasoning in LVLMs | May 24, 2024 | HallucinationResponse Generation | CodeCode Available | 1 |
| Image-of-Thought Prompting for Visual Reasoning Refinement in Multimodal Large Language Models | May 22, 2024 | Multimodal ReasoningVisual Question Answering | —Unverified | 0 |
| Analogist: Out-of-the-box Visual In-Context Learning with Image Diffusion Model | May 16, 2024 | Image InpaintingIn-Context Learning | —Unverified | 0 |
| CLIP-Powered TASS: Target-Aware Single-Stream Network for Audio-Visual Question Answering | May 13, 2024 | Audio-visual Question AnsweringAudio-Visual Question Answering (AVQA) | —Unverified | 0 |
| Learning to Compose: Improving Object Centric Learning by Injecting Compositionality | May 1, 2024 | ObjectSystematic Generalization | CodeCode Available | 0 |
| Naturally Supervised 3D Visual Grounding with Language-Regularized Concept Learners | Apr 30, 2024 | 3D visual groundingVisual Grounding | —Unverified | 0 |
| BlenderAlchemy: Editing 3D Graphics with Vision-Language Models | Apr 26, 2024 | Game DesignImage Generation | —Unverified | 0 |
| List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs | Apr 25, 2024 | Visual GroundingVisual Question Answering | CodeCode Available | 2 |
| Cantor: Inspiring Multimodal Chain-of-Thought of MLLM | Apr 24, 2024 | Decision MakingLogical Reasoning | —Unverified | 0 |
| Think-Program-reCtify: 3D Situated Reasoning with Large Language Models | Apr 23, 2024 | Visual Reasoning | —Unverified | 0 |
| MARVEL: Multidimensional Abstraction and Reasoning through Visual Evaluation and Learning | Apr 21, 2024 | Visual Reasoning | CodeCode Available | 0 |
| Automated Evaluation of Large Vision-Language Models on Self-driving Corner Cases | Apr 16, 2024 | Autonomous DrivingVisual Reasoning | —Unverified | 0 |
| MMCode: Benchmarking Multimodal Large Language Models for Code Generation with Visually Rich Programming Problems | Apr 15, 2024 | BenchmarkingCode Generation | CodeCode Available | 1 |
| Visually Descriptive Language Model for Vector Graphics Reasoning | Apr 9, 2024 | DescriptiveLanguage Modeling | CodeCode Available | 9 |
| Wu's Method can Boost Symbolic AI to Rival Silver Medalists and AlphaGeometry to Outperform Gold Medalists at IMO Geometry | Apr 9, 2024 | Automated Theorem ProvingCPU | —Unverified | 0 |
| Plug-and-Play Grounding of Reasoning in Multimodal Large Language Models | Mar 28, 2024 | Instruction FollowingVisual Reasoning | —Unverified | 0 |
| Beyond Embeddings: The Promise of Visual Table in Visual Reasoning | Mar 27, 2024 | Representation LearningVisual Question Answering | CodeCode Available | 1 |
| PropTest: Automatic Property Testing for Improved Visual Programming | Mar 25, 2024 | Question AnsweringReferring Expression | —Unverified | 0 |
| LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models | Mar 22, 2024 | Language ModellingLarge Language Model | CodeCode Available | 2 |
| VURF: A General-purpose Reasoning and Self-refinement Framework for Video Understanding | Mar 21, 2024 | Pose EstimationVideo Understanding | CodeCode Available | 0 |
| Just Shift It: Test-Time Prototype Shifting for Zero-Shot Generalization with Vision-Language Models | Mar 19, 2024 | image-classificationImage Classification | CodeCode Available | 1 |
| HYDRA: A Hyper Agent for Dynamic Compositional Visual Reasoning | Mar 19, 2024 | Reinforcement Learning (RL)Visual Grounding | CodeCode Available | 1 |
| Just Say the Name: Online Continual Learning with Category Names Only via Data Generation | Mar 16, 2024 | Continual LearningDiversity | —Unverified | 0 |
| Test-time Distribution Learning Adapter for Cross-modal Visual Reasoning | Mar 10, 2024 | Human-Object Interaction DetectionPrediction | —Unverified | 0 |
| How Far Are We from Intelligent Visual Deductive Reasoning? | Mar 7, 2024 | In-Context LearningVisual Reasoning | CodeCode Available | 1 |
| Slot Abstractors: Toward Scalable Abstract Visual Reasoning | Mar 6, 2024 | ObjectSystematic Generalization | CodeCode Available | 0 |