| Distill Visual Chart Reasoning Ability from LLMs to MLLMs | Oct 24, 2024 | Multimodal ReasoningVisual Reasoning | CodeCode Available | 2 |
| CAMEL-Bench: A Comprehensive Arabic LMM Benchmark | Oct 24, 2024 | document understandingVideo Understanding | CodeCode Available | 1 |
| ProReason: Multi-Modal Proactive Reasoning with Decoupled Eyesight and Wisdom | Oct 18, 2024 | Visual Reasoning | —Unverified | 0 |
| HumanEval-V: Evaluating Visual Understanding and Reasoning Abilities of Large Multimodal Models Through Coding Tasks | Oct 16, 2024 | Code GenerationHumanEval | CodeCode Available | 1 |
| MCTBench: Multimodal Cognition towards Text-Rich Visual Scenes Benchmark | Oct 15, 2024 | FairnessScene Text Recognition | CodeCode Available | 0 |
| ForgeryGPT: Multimodal Large Language Model For Explainable Image Forgery Detection and Localization | Oct 14, 2024 | Explanation GenerationImage Forgery Detection | —Unverified | 0 |
| Towards Efficient Visual-Language Alignment of the Q-Former for Visual Reasoning Tasks | Oct 12, 2024 | parameter-efficient fine-tuningVisual Reasoning | CodeCode Available | 1 |
| TVBench: Redesigning Video-Language Evaluation | Oct 10, 2024 | Multiple-choiceOpen-Ended Question Answering | —Unverified | 0 |
| Tackling the Abstraction and Reasoning Corpus with Vision Transformers: the Importance of 2D Representation, Positions, and Objects | Oct 8, 2024 | ARCProgram Synthesis | CodeCode Available | 1 |
| Transformers Utilization in Chart Understanding: A Review of Recent Advances & Future Trends | Oct 5, 2024 | BenchmarkingChart Understanding | —Unverified | 0 |
| Mind the GAP: Glimpse-based Active Perception improves generalization and sample efficiency of visual reasoning | Sep 30, 2024 | Visual Reasoning | CodeCode Available | 0 |
| From Seconds to Hours: Reviewing MultiModal Large Language Models on Comprehensive Long Video Understanding | Sep 27, 2024 | Video UnderstandingVisual Reasoning | CodeCode Available | 1 |
| Advancing Object Detection in Transportation with Multimodal Large Language Models (MLLMs): A Comprehensive Review and Empirical Testing | Sep 26, 2024 | Event DetectionObject | —Unverified | 0 |
| GSON: A Group-based Social Navigation Framework with Large Multimodal Model | Sep 26, 2024 | Autonomous VehiclesMotion Planning | —Unverified | 0 |
| FineCops-Ref: A new Dataset and Task for Fine-Grained Compositional Referring Expression Comprehension | Sep 23, 2024 | Image ComprehensionReferring Expression | CodeCode Available | 1 |
| Enhancing Advanced Visual Reasoning Ability of Large Language Models | Sep 21, 2024 | In-Context LearningVisual Reasoning | —Unverified | 0 |
| Impact of ML Optimization Tactics on Greener Pre-Trained ML Models | Sep 19, 2024 | GPUimage-classification | —Unverified | 0 |
| JourneyBench: A Challenging One-Stop Vision-Language Understanding Benchmark of Generated Images | Sep 19, 2024 | HallucinationImage Captioning | CodeCode Available | 0 |
| What Makes a Maze Look Like a Maze? | Sep 12, 2024 | Visual Reasoning | —Unverified | 0 |
| Critical Features Tracking on Triangulated Irregular Networks by a Scale-Space Method | Sep 10, 2024 | Visual Reasoning | —Unverified | 0 |
| MMEvol: Empowering Multimodal Large Language Models with Evol-Instruct | Sep 9, 2024 | DiversityVisual Reasoning | —Unverified | 0 |
| How to Determine the Preferred Image Distribution of a Black-Box Vision-Language Model? | Sep 3, 2024 | In-Context LearningLanguage Modeling | CodeCode Available | 0 |
| Zero-Shot Visual Reasoning by Vision-Language Models: Benchmarking and Analysis | Aug 27, 2024 | BenchmarkingLarge Language Model | —Unverified | 0 |
| Multi-Modal Dialogue State Tracking for Playing GuessWhich Game | Aug 15, 2024 | Dialogue State TrackingVisual Reasoning | CodeCode Available | 0 |
| UniBench: Visual Reasoning Requires Rethinking Vision-Language Beyond Scaling | Aug 9, 2024 | GPULanguage Modeling | CodeCode Available | 3 |
| ArtVLM: Attribute Recognition Through Vision-Based Prefix Language Modeling | Aug 7, 2024 | AttributeLanguage Modeling | CodeCode Available | 0 |
| Compromising Embodied Agents with Contextual Backdoor Attacks | Aug 6, 2024 | Autonomous DrivingRobot Manipulation | —Unverified | 0 |
| ExoViP: Step-by-step Verification and Exploration with Exoskeleton Modules for Compositional Visual Reasoning | Aug 5, 2024 | Visual Reasoning | —Unverified | 0 |
| Chat2Layout: Interactive 3D Furniture Layout with a Multimodal LLM | Jul 31, 2024 | In-Context LearningLayout Design | —Unverified | 0 |
| A Plug-and-Play Method for Rare Human-Object Interactions Detection by Bridging Domain Gap | Jul 31, 2024 | Human-Object Interaction DetectionImage Reconstruction | CodeCode Available | 0 |
| Pyramid Coder: Hierarchical Code Generator for Compositional Visual Question Answering | Jul 30, 2024 | Code GenerationQuestion Answering | —Unverified | 0 |
| Take A Step Back: Rethinking the Two Stages in Visual Reasoning | Jul 29, 2024 | Logical ReasoningQuestion Answering | —Unverified | 0 |
| Wonderful Team: Zero-Shot Physical Task Planning with Visual LLMs | Jul 26, 2024 | Action GenerationLarge Language Model | CodeCode Available | 1 |
| KiVA: Kid-inspired Visual Analogies for Testing Large Multimodal Models | Jul 25, 2024 | Visual AnalogiesVisual Reasoning | CodeCode Available | 1 |
| Untrained neural networks can demonstrate memorization-independent abstract reasoning | Jul 25, 2024 | MemorizationVisual Reasoning | CodeCode Available | 0 |
| LLMs can be Dangerous Reasoners: Analyzing-based Jailbreak Attack on Large Language Models | Jul 23, 2024 | Multimodal ReasoningPrompt Engineering | CodeCode Available | 1 |
| Can VLMs be used on videos for action recognition? LLMs are Visual Reasoning Coordinators | Jul 20, 2024 | Action RecognitionCoLA | —Unverified | 0 |
| I Know About "Up"! Enhancing Spatial Reasoning in Visual Language Models Through 3D Reconstruction | Jul 19, 2024 | 3D ReconstructionSpatial Reasoning | —Unverified | 0 |
| X-Former: Unifying Contrastive and Reconstruction Learning for MLLMs | Jul 18, 2024 | Contrastive LearningRepresentation Learning | —Unverified | 0 |
| Open-World Visual Reasoning by a Neuro-Symbolic Program of Zero-Shot Symbols | Jul 18, 2024 | Visual Reasoning | —Unverified | 0 |
| SwitchCIT: Switching for Continual Instruction Tuning | Jul 16, 2024 | Text GenerationVisual Reasoning | —Unverified | 0 |
| NTSEBENCH: Cognitive Reasoning Benchmark for Vision Language Models | Jul 15, 2024 | Common Sense ReasoningMultiple-choice | —Unverified | 0 |
| Affordance-Guided Reinforcement Learning via Visual Prompting | Jul 14, 2024 | reinforcement-learningReinforcement Learning | —Unverified | 0 |
| NODE-Adapter: Neural Ordinary Differential Equations for Better Vision-Language Reasoning | Jul 11, 2024 | Domain GeneralizationHuman-Object Interaction Detection | —Unverified | 0 |
| Multimodal Self-Instruct: Synthetic Abstract Image and Visual Reasoning Instruction Using Language Model | Jul 9, 2024 | Chart UnderstandingLanguage Modeling | CodeCode Available | 2 |
| TokenPacker: Efficient Visual Projector for Multimodal LLM | Jul 2, 2024 | Language ModellingLarge Language Model | CodeCode Available | 3 |
| We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning? | Jul 1, 2024 | MathMathematical Reasoning | CodeCode Available | 2 |
| From the Least to the Most: Building a Plug-and-Play Visual Reasoner via Data Synthesis | Jun 28, 2024 | Visual Question Answering (VQA)Visual Reasoning | CodeCode Available | 1 |
| MMRo: Are Multimodal LLMs Eligible as the Brain for In-Home Robotics? | Jun 28, 2024 | Task PlanningVisual Reasoning | —Unverified | 0 |
| Disentangling Knowledge-based and Visual Reasoning by Question Decomposition in KB-VQA | Jun 27, 2024 | General KnowledgeQuestion Answering | —Unverified | 0 |