| ARB: A Comprehensive Arabic Multimodal Reasoning Benchmark | May 22, 2025 | document understandingMultimodal Reasoning | CodeCode Available | 1 | 5 |
| Immune: Improving Safety Against Jailbreaks in Multi-modal LLMs via Inference-Time Alignment | Nov 27, 2024 | Safety AlignmentVisual Reasoning | CodeCode Available | 1 | 5 |
| Grounded Situation Recognition with Transformers | Nov 19, 2021 | DecoderGrounded Situation Recognition | CodeCode Available | 1 | 5 |
| Are Vision Language Models Ready for Clinical Diagnosis? A 3D Medical Benchmark for Tumor-centric Visual Question Answering | May 25, 2025 | AnatomyBenchmarking | CodeCode Available | 1 | 5 |
| ClawMachine: Learning to Fetch Visual Tokens for Referential Comprehension | Jun 17, 2024 | DecoderVisual Reasoning | CodeCode Available | 1 | 5 |
| Comprehensive Visual Question Answering on Point Clouds through Compositional Scene Manipulation | Dec 22, 2021 | Common Sense ReasoningQuestion Answering | CodeCode Available | 1 | 5 |
| MIST: Multi-modal Iterative Spatial-Temporal Transformer for Long-form Video Question Answering | Dec 19, 2022 | FormQuestion Answering | CodeCode Available | 1 | 5 |
| MixGen: A New Multi-Modal Data Augmentation | Jun 16, 2022 | Data AugmentationImage-text Retrieval | CodeCode Available | 1 | 5 |
| A Survey on Interpretable Cross-modal Reasoning | Sep 5, 2023 | Cross-Modal RetrievalDecision Making | CodeCode Available | 1 | 5 |
| MMCode: Benchmarking Multimodal Large Language Models for Code Generation with Visually Rich Programming Problems | Apr 15, 2024 | BenchmarkingCode Generation | CodeCode Available | 1 | 5 |
| Measuring Progress in Fine-grained Vision-and-Language Understanding | May 12, 2023 | Visual Reasoning | CodeCode Available | 1 | 5 |
| GRE Suite: Geo-localization Inference via Fine-Tuned Vision-Language Models and Enhanced Reasoning Chains | May 24, 2025 | geo-localizationVisual Reasoning | CodeCode Available | 1 | 5 |
| Interpreting and Controlling Vision Foundation Models via Text Explanations | Oct 16, 2023 | Model EditingVisual Reasoning | CodeCode Available | 1 | 5 |
| Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone | Jun 15, 2022 | Described Object DetectionImage Captioning | CodeCode Available | 1 | 5 |
| GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering | Feb 25, 2019 | Question AnsweringVisual Question Answering (VQA) | CodeCode Available | 1 | 5 |
| Attention-Based Context Aware Reasoning for Situation Recognition | Jun 1, 2020 | Action RecognitionFine-grained Action Recognition | CodeCode Available | 1 | 5 |
| A Benchmark for Compositional Visual Reasoning | Jun 11, 2022 | Visual Reasoning | CodeCode Available | 1 | 5 |
| How Do Multimodal Large Language Models Handle Complex Multimodal Reasoning? Placing Them in An Extensible Escape Game | Mar 13, 2025 | Multimodal ReasoningQuestion Answering | CodeCode Available | 1 | 5 |
| Just Shift It: Test-Time Prototype Shifting for Zero-Shot Generalization with Vision-Language Models | Mar 19, 2024 | image-classificationImage Classification | CodeCode Available | 1 | 5 |
| Collaborative Transformers for Grounded Situation Recognition | Mar 30, 2022 | Grounded Situation RecognitionImage Classification | CodeCode Available | 1 | 5 |
| Perceive, Ground, Reason, and Act: A Benchmark for General-purpose Visual Representation | Nov 28, 2022 | 3D ReconstructionDecoder | CodeCode Available | 1 | 5 |
| ColorBench: Can VLMs See and Understand the Colorful World? A Comprehensive Benchmark for Color Perception, Reasoning, and Robustness | Apr 10, 2025 | Visual Reasoning | CodeCode Available | 1 | 5 |
| From the Least to the Most: Building a Plug-and-Play Visual Reasoner via Data Synthesis | Jun 28, 2024 | Visual Question Answering (VQA)Visual Reasoning | CodeCode Available | 1 | 5 |
| Large-Scale Adversarial Training for Vision-and-Language Representation Learning | Jun 11, 2020 | Image-text RetrievalQuestion Answering | CodeCode Available | 1 | 5 |
| Belief Revision based Caption Re-ranker with Visual Semantic Information | Sep 16, 2022 | Caption GenerationImage Captioning | CodeCode Available | 1 | 5 |
| Compositional Attention Networks for Machine Reasoning | Mar 8, 2018 | Referring Expression ComprehensionVisual Question Answering (VQA) | CodeCode Available | 1 | 5 |
| From Seconds to Hours: Reviewing MultiModal Large Language Models on Comprehensive Long Video Understanding | Sep 27, 2024 | Video UnderstandingVisual Reasoning | CodeCode Available | 1 | 5 |
| Generalizing from SIMPLE to HARD Visual Reasoning: Can We Mitigate Modality Imbalance in VLMs? | Jan 5, 2025 | Image CaptioningImage to text | CodeCode Available | 1 | 5 |
| CAMEL-Bench: A Comprehensive Arabic LMM Benchmark | Oct 24, 2024 | document understandingVideo Understanding | CodeCode Available | 1 | 5 |
| An Examination of the Compositionality of Large Generative Vision-Language Models | Aug 21, 2023 | Visual Reasoning | CodeCode Available | 1 | 5 |
| From EduVisBench to EduVisAgent: A Benchmark and Multi-Agent Framework for Pedagogical Visualization | May 22, 2025 | Visual Reasoning | CodeCode Available | 1 | 5 |
| GENOME: GenerativE Neuro-symbOlic visual reasoning by growing and reusing ModulEs | Nov 8, 2023 | Question AnsweringReferring Expression | CodeCode Available | 1 | 5 |
| Forgotten Polygons: Multimodal Large Language Models are Shape-Blind | Feb 21, 2025 | MathMathematical Problem-Solving | CodeCode Available | 1 | 5 |
| An Empirical Study of Training End-to-End Vision-and-Language Transformers | Nov 3, 2021 | Cross-Modal RetrievalDecoder | CodeCode Available | 1 | 5 |
| Forward Prediction for Physical Reasoning | Jun 18, 2020 | PredictionVisual Reasoning | CodeCode Available | 1 | 5 |
| FineCops-Ref: A new Dataset and Task for Fine-Grained Compositional Referring Expression Comprehension | Sep 23, 2024 | Image ComprehensionReferring Expression | CodeCode Available | 1 | 5 |
| FiLM: Visual Reasoning with a General Conditioning Layer | Sep 22, 2017 | Image Retrieval with Multi-Modal QueryVisual Question Answering (VQA) | CodeCode Available | 1 | 5 |
| FLAVA: A Foundational Language And Vision Alignment Model | Dec 8, 2021 | Image RetrievalImage-to-Text Retrieval | CodeCode Available | 1 | 5 |
| Going Beyond Nouns With Vision & Language Models Using Synthetic Data | Mar 30, 2023 | SentenceVisual Reasoning | CodeCode Available | 1 | 5 |
| How Far Are We from Intelligent Visual Deductive Reasoning? | Mar 7, 2024 | In-Context LearningVisual Reasoning | CodeCode Available | 1 | 5 |
| Bongard-OpenWorld: Few-Shot Reasoning for Free-form Visual Concepts in the Real World | Oct 16, 2023 | Few-Shot LearningForm | CodeCode Available | 1 | 5 |
| Bongard-LOGO: A New Benchmark for Human-Level Concept Learning and Reasoning | Oct 2, 2020 | Novel ConceptsRepresentation Learning | CodeCode Available | 1 | 5 |
| Bongard-HOI: Benchmarking Few-Shot Visual Reasoning for Human-Object Interactions | May 27, 2022 | BenchmarkingFew-Shot Image Classification | CodeCode Available | 1 | 5 |
| Expressive Scene Graph Generation Using Commonsense Knowledge Infusion for Visual Understanding and Reasoning | May 31, 2022 | Common Sense ReasoningGraph Generation | CodeCode Available | 1 | 5 |
| Machine Number Sense: A Dataset of Visual Arithmetic Problems for Abstract and Relational Reasoning | Apr 25, 2020 | Relational ReasoningVisual Reasoning | CodeCode Available | 1 | 5 |
| Align before Fuse: Vision and Language Representation Learning with Momentum Distillation | Jul 16, 2021 | Cross-Modal RetrievalGrounded language learning | CodeCode Available | 1 | 5 |
| Equivariant Similarity for Vision-Language Foundation Models | Mar 25, 2023 | Image-text RetrievalRetrieval | CodeCode Available | 1 | 5 |
| LogiCity: Advancing Neuro-Symbolic AI with Abstract Urban Simulation | Nov 1, 2024 | Logical ReasoningSequential Decision Making | CodeCode Available | 1 | 5 |
| Enhancing Cognition and Explainability of Multimodal Foundation Models with Self-Synthesized Data | Feb 19, 2025 | Fine-Grained Visual RecognitionPneumonia Detection | CodeCode Available | 1 | 5 |
| LLMs can be Dangerous Reasoners: Analyzing-based Jailbreak Attack on Large Language Models | Jul 23, 2024 | Multimodal ReasoningPrompt Engineering | CodeCode Available | 1 | 5 |