| Large-Scale Adversarial Training for Vision-and-Language Representation Learning | Jun 11, 2020 | Image-text RetrievalQuestion Answering | CodeCode Available | 1 |
| MIST: Multi-modal Iterative Spatial-Temporal Transformer for Long-form Video Question Answering | Dec 19, 2022 | FormQuestion Answering | CodeCode Available | 1 |
| Learning Differentiable Logic Programs for Abstract Visual Reasoning | Jul 3, 2023 | Program inductionVisual Reasoning | CodeCode Available | 1 |
| Are Vision Language Models Ready for Clinical Diagnosis? A 3D Medical Benchmark for Tumor-centric Visual Question Answering | May 25, 2025 | AnatomyBenchmarking | CodeCode Available | 1 |
| ClawMachine: Learning to Fetch Visual Tokens for Referential Comprehension | Jun 17, 2024 | DecoderVisual Reasoning | CodeCode Available | 1 |
| Comprehensive Visual Question Answering on Point Clouds through Compositional Scene Manipulation | Dec 22, 2021 | Common Sense ReasoningQuestion Answering | CodeCode Available | 1 |
| Just Shift It: Test-Time Prototype Shifting for Zero-Shot Generalization with Vision-Language Models | Mar 19, 2024 | image-classificationImage Classification | CodeCode Available | 1 |
| DrVD-Bench: Do Vision-Language Models Reason Like Human Doctors in Medical Image Diagnosis? | May 30, 2025 | DiagnosticMedical Image Analysis | CodeCode Available | 1 |
| A Survey on Interpretable Cross-modal Reasoning | Sep 5, 2023 | Cross-Modal RetrievalDecision Making | CodeCode Available | 1 |
| A Benchmark for Compositional Visual Reasoning | Jun 11, 2022 | Visual Reasoning | CodeCode Available | 1 |
| ClevrSkills: Compositional Language and Visual Reasoning in Robotics | Nov 13, 2024 | Visual Reasoning | CodeCode Available | 1 |
| CLEVR-X: A Visual Reasoning Dataset for Natural Language Explanations | Apr 5, 2022 | Explanation GenerationQuestion Answering | CodeCode Available | 1 |
| KiVA: Kid-inspired Visual Analogies for Testing Large Multimodal Models | Jul 25, 2024 | Visual AnalogiesVisual Reasoning | CodeCode Available | 1 |
| Learning Long-term Visual Dynamics with Region Proposal Interaction Networks | Aug 5, 2020 | Common Sense ReasoningObject | CodeCode Available | 1 |
| Interpretable Image Classification via Non-parametric Part Prototype Learning | Mar 13, 2025 | image-classificationImage Classification | CodeCode Available | 1 |
| Inter-GPS: Interpretable Geometry Problem Solving with Formal Language and Symbolic Reasoning | May 10, 2021 | Arithmetic ReasoningGeometry Problem Solving | CodeCode Available | 1 |
| Interpreting and Controlling Vision Foundation Models via Text Explanations | Oct 16, 2023 | Model EditingVisual Reasoning | CodeCode Available | 1 |
| CAMEL-Bench: A Comprehensive Arabic LMM Benchmark | Oct 24, 2024 | document understandingVideo Understanding | CodeCode Available | 1 |
| An Examination of the Compositionality of Large Generative Vision-Language Models | Aug 21, 2023 | Visual Reasoning | CodeCode Available | 1 |
| INS-MMBench: A Comprehensive Benchmark for Evaluating LVLMs' Performance in Insurance | Jun 13, 2024 | Multiple-choiceVisual Reasoning | CodeCode Available | 1 |
| IRFL: Image Recognition of Figurative Language | Mar 27, 2023 | ClassificationVisual Reasoning | CodeCode Available | 1 |
| ColorBench: Can VLMs See and Understand the Colorful World? A Comprehensive Benchmark for Color Perception, Reasoning, and Robustness | Apr 10, 2025 | Visual Reasoning | CodeCode Available | 1 |
| Dynamic Language Binding in Relational Visual Reasoning | Apr 30, 2020 | ObjectQuestion Answering | CodeCode Available | 1 |
| PHYRE: A New Benchmark for Physical Reasoning | Aug 15, 2019 | Visual Reasoning | CodeCode Available | 1 |
| Belief Revision based Caption Re-ranker with Visual Semantic Information | Sep 16, 2022 | Caption GenerationImage Captioning | CodeCode Available | 1 |
| Compositional Attention Networks for Machine Reasoning | Mar 8, 2018 | Referring Expression ComprehensionVisual Question Answering (VQA) | CodeCode Available | 1 |
| Immune: Improving Safety Against Jailbreaks in Multi-modal LLMs via Inference-Time Alignment | Nov 27, 2024 | Safety AlignmentVisual Reasoning | CodeCode Available | 1 |
| An Empirical Study of Training End-to-End Vision-and-Language Transformers | Nov 3, 2021 | Cross-Modal RetrievalDecoder | CodeCode Available | 1 |
| Inference Optimal VLMs Need Fewer Visual Tokens and More Parameters | Nov 5, 2024 | Token ReductionVisual Reasoning | CodeCode Available | 1 |
| HYDRA: A Hyper Agent for Dynamic Compositional Visual Reasoning | Mar 19, 2024 | Reinforcement Learning (RL)Visual Grounding | CodeCode Available | 1 |
| HumanEval-V: Evaluating Visual Understanding and Reasoning Abilities of Large Multimodal Models Through Coding Tasks | Oct 16, 2024 | Code GenerationHumanEval | CodeCode Available | 1 |
| Image Retrieval on Real-life Images with Pre-trained Vision-and-Language Models | Aug 9, 2021 | Composed Image Retrieval (CoIR)Image Retrieval | CodeCode Available | 1 |
| Is BERT Blind? Exploring the Effect of Vision-and-Language Pretraining on Visual Language Understanding | Mar 21, 2023 | Knowledge ProbingLanguage Modelling | CodeCode Available | 1 |
| LogiCity: Advancing Neuro-Symbolic AI with Abstract Urban Simulation | Nov 1, 2024 | Logical ReasoningSequential Decision Making | CodeCode Available | 1 |
| Bongard-OpenWorld: Few-Shot Reasoning for Free-form Visual Concepts in the Real World | Oct 16, 2023 | Few-Shot LearningForm | CodeCode Available | 1 |
| Bongard-LOGO: A New Benchmark for Human-Level Concept Learning and Reasoning | Oct 2, 2020 | Novel ConceptsRepresentation Learning | CodeCode Available | 1 |
| How Do Multimodal Large Language Models Handle Complex Multimodal Reasoning? Placing Them in An Extensible Escape Game | Mar 13, 2025 | Multimodal ReasoningQuestion Answering | CodeCode Available | 1 |
| Bongard-HOI: Benchmarking Few-Shot Visual Reasoning for Human-Object Interactions | May 27, 2022 | BenchmarkingFew-Shot Image Classification | CodeCode Available | 1 |
| How Far Are We from Intelligent Visual Deductive Reasoning? | Mar 7, 2024 | In-Context LearningVisual Reasoning | CodeCode Available | 1 |
| Align before Fuse: Vision and Language Representation Learning with Momentum Distillation | Jul 16, 2021 | Cross-Modal RetrievalGrounded language learning | CodeCode Available | 1 |
| Grounded Situation Recognition with Transformers | Nov 19, 2021 | DecoderGrounded Situation Recognition | CodeCode Available | 1 |
| Distilled Dual-Encoder Model for Vision-Language Understanding | Dec 16, 2021 | Image to textmodel | CodeCode Available | 1 |
| Divide and Conquer: Answering Questions with Object Factorization and Compositional Reasoning | Mar 18, 2023 | Decision MakingVisual Reasoning | CodeCode Available | 1 |
| Daily-Omni: Towards Audio-Visual Reasoning with Temporal Alignment across Modalities | May 23, 2025 | Automatic Speech RecognitionAutomatic Speech Recognition (ASR) | CodeCode Available | 1 |
| GRE Suite: Geo-localization Inference via Fine-Tuned Vision-Language Models and Enhanced Reasoning Chains | May 24, 2025 | geo-localizationVisual Reasoning | CodeCode Available | 1 |
| CyCLIP: Cyclic Contrastive Language-Image Pretraining | May 28, 2022 | Representation LearningVisual Reasoning | CodeCode Available | 1 |
| From the Least to the Most: Building a Plug-and-Play Visual Reasoner via Data Synthesis | Jun 28, 2024 | Visual Question Answering (VQA)Visual Reasoning | CodeCode Available | 1 |
| Generalizing from SIMPLE to HARD Visual Reasoning: Can We Mitigate Modality Imbalance in VLMs? | Jan 5, 2025 | Image CaptioningImage to text | CodeCode Available | 1 |
| Cross-Modality Relevance for Reasoning on Language and Vision | May 12, 2020 | Question AnsweringVisual Question Answering | CodeCode Available | 1 |
| Cross-modal Attention Congruence Regularization for Vision-Language Relation Alignment | Dec 20, 2022 | RelationVisual Reasoning | CodeCode Available | 1 |