| Whiteboard-of-Thought: Thinking Step-by-Step Across Modalities | Jun 20, 2024 | Spatial ReasoningVisual Reasoning | —Unverified | 0 |
| VDebugger: Harnessing Execution Feedback for Debugging Visual Programs | Jun 19, 2024 | Visual Reasoning | CodeCode Available | 0 |
| GSR-BENCH: A Benchmark for Grounded Spatial Reasoning Evaluation via Multimodal LLMs | Jun 19, 2024 | Spatial ReasoningVisual Reasoning | —Unverified | 0 |
| Beyond Visual Appearances: Privacy-sensitive Objects Identification via Hybrid Graph Reasoning | Jun 18, 2024 | Data AugmentationGraph Generation | —Unverified | 0 |
| A Unified View of Abstract Visual Reasoning Problems | Jun 16, 2024 | Transfer LearningVisual Reasoning | —Unverified | 0 |
| A-I-RAVEN and I-RAVEN-Mesh: Two New Benchmarks for Abstract Visual Reasoning | Jun 16, 2024 | Transfer LearningVisual Reasoning | —Unverified | 0 |
| What is the Visual Cognition Gap between Humans and Multimodal LLMs? | Jun 14, 2024 | object-detectionObject Detection | CodeCode Available | 0 |
| Comparison Visual Instruction Tuning | Jun 13, 2024 | Instruction FollowingNovelty Detection | —Unverified | 0 |
| Eyeballing Combinatorial Problems: A Case Study of Using Multimodal Large Language Models to Solve Traveling Salesman Problems | Jun 11, 2024 | In-Context LearningTraveling Salesman Problem | —Unverified | 0 |
| HENASY: Learning to Assemble Scene-Entities for Egocentric Video-Language Model | Jun 1, 2024 | Action RecognitionActivity Recognition | —Unverified | 0 |
| MMCTAgent: Multi-modal Critical Thinking Agent Framework for Complex Visual Reasoning | May 28, 2024 | Decision MakingVideo Understanding | —Unverified | 0 |
| Do Vision-Language Transformers Exhibit Visual Commonsense? An Empirical Study of VCR | May 27, 2024 | Question AnsweringTAG | —Unverified | 0 |
| Code Repair with LLMs gives an Exploration-Exploitation Tradeoff | May 26, 2024 | Code RepairLanguage Modeling | —Unverified | 0 |
| Image-of-Thought Prompting for Visual Reasoning Refinement in Multimodal Large Language Models | May 22, 2024 | Multimodal ReasoningVisual Question Answering | —Unverified | 0 |
| Analogist: Out-of-the-box Visual In-Context Learning with Image Diffusion Model | May 16, 2024 | Image InpaintingIn-Context Learning | —Unverified | 0 |
| CLIP-Powered TASS: Target-Aware Single-Stream Network for Audio-Visual Question Answering | May 13, 2024 | Audio-visual Question AnsweringAudio-Visual Question Answering (AVQA) | —Unverified | 0 |
| Learning to Compose: Improving Object Centric Learning by Injecting Compositionality | May 1, 2024 | ObjectSystematic Generalization | CodeCode Available | 0 |
| Naturally Supervised 3D Visual Grounding with Language-Regularized Concept Learners | Apr 30, 2024 | 3D visual groundingVisual Grounding | —Unverified | 0 |
| BlenderAlchemy: Editing 3D Graphics with Vision-Language Models | Apr 26, 2024 | Game DesignImage Generation | —Unverified | 0 |
| Cantor: Inspiring Multimodal Chain-of-Thought of MLLM | Apr 24, 2024 | Decision MakingLogical Reasoning | —Unverified | 0 |
| Think-Program-reCtify: 3D Situated Reasoning with Large Language Models | Apr 23, 2024 | Visual Reasoning | —Unverified | 0 |
| MARVEL: Multidimensional Abstraction and Reasoning through Visual Evaluation and Learning | Apr 21, 2024 | Visual Reasoning | CodeCode Available | 0 |
| Automated Evaluation of Large Vision-Language Models on Self-driving Corner Cases | Apr 16, 2024 | Autonomous DrivingVisual Reasoning | —Unverified | 0 |
| Wu's Method can Boost Symbolic AI to Rival Silver Medalists and AlphaGeometry to Outperform Gold Medalists at IMO Geometry | Apr 9, 2024 | Automated Theorem ProvingCPU | —Unverified | 0 |
| Plug-and-Play Grounding of Reasoning in Multimodal Large Language Models | Mar 28, 2024 | Instruction FollowingVisual Reasoning | —Unverified | 0 |
| PropTest: Automatic Property Testing for Improved Visual Programming | Mar 25, 2024 | Question AnsweringReferring Expression | —Unverified | 0 |
| VURF: A General-purpose Reasoning and Self-refinement Framework for Video Understanding | Mar 21, 2024 | Pose EstimationVideo Understanding | CodeCode Available | 0 |
| Just Say the Name: Online Continual Learning with Category Names Only via Data Generation | Mar 16, 2024 | Continual LearningDiversity | —Unverified | 0 |
| Test-time Distribution Learning Adapter for Cross-modal Visual Reasoning | Mar 10, 2024 | Human-Object Interaction DetectionPrediction | —Unverified | 0 |
| Slot Abstractors: Toward Scalable Abstract Visual Reasoning | Mar 6, 2024 | ObjectSystematic Generalization | CodeCode Available | 0 |
| SNIFFER: Multimodal Large Language Model for Explainable Out-of-Context Misinformation Detection | Mar 5, 2024 | Concept AlignmentExplanation Generation | —Unverified | 0 |
| What Is Missing in Multilingual Visual Reasoning and How to Fix It | Mar 3, 2024 | Image CaptioningVisual Reasoning | CodeCode Available | 0 |
| Revisiting Disentanglement in Downstream Tasks: A Study on Its Necessity for Abstract Visual Reasoning | Mar 1, 2024 | DisentanglementInformativeness | CodeCode Available | 0 |
| VISREAS: Complex Visual Reasoning with Unanswerable Questions | Feb 23, 2024 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Visual Reasoning in Object-Centric Deep Neural Networks: A Comparative Cognition Approach | Feb 20, 2024 | ObjectRelational Reasoning | CodeCode Available | 0 |
| Visual In-Context Learning for Large Vision-Language Models | Feb 18, 2024 | In-Context LearningPosition | —Unverified | 0 |
| ViGoR: Improving Visual Grounding of Large Vision Language Models with Fine-Grained Reward Modeling | Feb 9, 2024 | HallucinationNatural Language Understanding | CodeCode Available | 0 |
| Muffin or Chihuahua? Challenging Multimodal Large Language Models with Multipanel VQA | Jan 29, 2024 | BenchmarkingImage Comprehension | —Unverified | 0 |
| Prompting Large Vision-Language Models for Compositional Reasoning | Jan 20, 2024 | RetrievalVisual Reasoning | CodeCode Available | 0 |
| Image Safeguarding: Reasoning with Conditional Vision Language Model and Obfuscating Unsafe Content Counterfactually | Jan 19, 2024 | counterfactualCounterfactual Explanation | CodeCode Available | 0 |
| Towards Generative Abstract Reasoning: Completing Raven's Progressive Matrix via Rule Abstraction and Selection | Jan 18, 2024 | Answer GenerationAttribute | —Unverified | 0 |
| Language-Conditioned Robotic Manipulation with Fast and Slow Thinking | Jan 8, 2024 | Decision MakingIntent Recognition | —Unverified | 0 |
| CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs | Jan 5, 2024 | Image ComprehensionImage to text | —Unverified | 0 |
| Towards Truly Zero-shot Compositional Visual Reasoning with LLMs as Programmers | Jan 3, 2024 | Question AnsweringVisual Grounding | —Unverified | 0 |
| Generate Subgoal Images before Act: Unlocking the Chain-of-Thought Reasoning in Diffusion Model for Robot Manipulation with Multimodal Prompts | Jan 1, 2024 | Image GenerationInstruction Following | —Unverified | 0 |
| ChartBench: A Benchmark for Complex Visual Reasoning in Charts | Dec 26, 2023 | Visual Reasoning | —Unverified | 0 |
| A Challenger to GPT-4V? Early Explorations of Gemini in Visual Expertise | Dec 19, 2023 | MMEVisual Reasoning | —Unverified | 0 |
| One Self-Configurable Model to Solve Many Abstract Visual Reasoning Problems | Dec 15, 2023 | Odd One OutTransfer Learning | CodeCode Available | 0 |
| GPT4SGG: Synthesizing Scene Graphs from Holistic and Region-specific Narratives | Dec 7, 2023 | Graph GenerationLanguage Modelling | CodeCode Available | 0 |
| Leveraging VLM-Based Pipelines to Annotate 3D Objects | Nov 29, 2023 | In-Context LearningLanguage Modeling | —Unverified | 0 |