| Grounded Situation Recognition with Transformers | Nov 19, 2021 | DecoderGrounded Situation Recognition | CodeCode Available | 1 | 5 |
| Align before Fuse: Vision and Language Representation Learning with Momentum Distillation | Jul 16, 2021 | Cross-Modal RetrievalGrounded language learning | CodeCode Available | 1 | 5 |
| How Many Unicorns Are in This Image? A Safety Evaluation Benchmark for Vision LLMs | Nov 27, 2023 | Adversarial RobustnessVisual Question Answering (VQA) | CodeCode Available | 1 | 5 |
| Differentiable Adaptive Computation Time for Visual Reasoning | Apr 27, 2020 | Visual Reasoning | CodeCode Available | 1 | 5 |
| Bongard-LOGO: A New Benchmark for Human-Level Concept Learning and Reasoning | Oct 2, 2020 | Novel ConceptsRepresentation Learning | CodeCode Available | 1 | 5 |
| How Far Are We from Intelligent Visual Deductive Reasoning? | Mar 7, 2024 | In-Context LearningVisual Reasoning | CodeCode Available | 1 | 5 |
| Distilled Dual-Encoder Model for Vision-Language Understanding | Dec 16, 2021 | Image to textmodel | CodeCode Available | 1 | 5 |
| Peacock: A Family of Arabic Multimodal Large Language Models and Benchmarks | Mar 1, 2024 | Visual Reasoning | CodeCode Available | 1 | 5 |
| GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering | Feb 25, 2019 | Question AnsweringVisual Question Answering (VQA) | CodeCode Available | 1 | 5 |
| HumanEval-V: Evaluating Visual Understanding and Reasoning Abilities of Large Multimodal Models Through Coding Tasks | Oct 16, 2024 | Code GenerationHumanEval | CodeCode Available | 1 | 5 |
| Immune: Improving Safety Against Jailbreaks in Multi-modal LLMs via Inference-Time Alignment | Nov 27, 2024 | Safety AlignmentVisual Reasoning | CodeCode Available | 1 | 5 |
| Image Retrieval on Real-life Images with Pre-trained Vision-and-Language Models | Aug 9, 2021 | Composed Image Retrieval (CoIR)Image Retrieval | CodeCode Available | 1 | 5 |
| GRE Suite: Geo-localization Inference via Fine-Tuned Vision-Language Models and Enhanced Reasoning Chains | May 24, 2025 | geo-localizationVisual Reasoning | CodeCode Available | 1 | 5 |
| Inference Optimal VLMs Need Fewer Visual Tokens and More Parameters | Nov 5, 2024 | Token ReductionVisual Reasoning | CodeCode Available | 1 | 5 |
| Daily-Omni: Towards Audio-Visual Reasoning with Temporal Alignment across Modalities | May 23, 2025 | Automatic Speech RecognitionAutomatic Speech Recognition (ASR) | CodeCode Available | 1 | 5 |
| Perceive, Ground, Reason, and Act: A Benchmark for General-purpose Visual Representation | Nov 28, 2022 | 3D ReconstructionDecoder | CodeCode Available | 1 | 5 |
| ProTo: Program-Guided Transformer for Program-Guided Tasks | Oct 2, 2021 | Decision MakingLearning to Execute | CodeCode Available | 1 | 5 |
| CyCLIP: Cyclic Contrastive Language-Image Pretraining | May 28, 2022 | Representation LearningVisual Reasoning | CodeCode Available | 1 | 5 |
| DrVD-Bench: Do Vision-Language Models Reason Like Human Doctors in Medical Image Diagnosis? | May 30, 2025 | DiagnosticMedical Image Analysis | CodeCode Available | 1 | 5 |
| CAMEL-Bench: A Comprehensive Arabic LMM Benchmark | Oct 24, 2024 | document understandingVideo Understanding | CodeCode Available | 1 | 5 |
| From the Least to the Most: Building a Plug-and-Play Visual Reasoner via Data Synthesis | Jun 28, 2024 | Visual Question Answering (VQA)Visual Reasoning | CodeCode Available | 1 | 5 |
| IRFL: Image Recognition of Figurative Language | Mar 27, 2023 | ClassificationVisual Reasoning | CodeCode Available | 1 | 5 |
| Is BERT Blind? Exploring the Effect of Vision-and-Language Pretraining on Visual Language Understanding | Mar 21, 2023 | Knowledge ProbingLanguage Modelling | CodeCode Available | 1 | 5 |
| Dynamic Language Binding in Relational Visual Reasoning | Apr 30, 2020 | ObjectQuestion Answering | CodeCode Available | 1 | 5 |
| Generalizing from SIMPLE to HARD Visual Reasoning: Can We Mitigate Modality Imbalance in VLMs? | Jan 5, 2025 | Image CaptioningImage to text | CodeCode Available | 1 | 5 |
| Cross-Modality Relevance for Reasoning on Language and Vision | May 12, 2020 | Question AnsweringVisual Question Answering | CodeCode Available | 1 | 5 |
| Cross-modal Attention Congruence Regularization for Vision-Language Relation Alignment | Dec 20, 2022 | RelationVisual Reasoning | CodeCode Available | 1 | 5 |
| Efficient Vision-Language Pretraining with Visual Concepts and Hierarchical Alignment | Aug 29, 2022 | cross-modal alignmentImage-text Retrieval | CodeCode Available | 1 | 5 |
| RS-GPT4V: A Unified Multimodal Instruction-Following Dataset for Remote Sensing Image Understanding | Jun 18, 2024 | AttributeInstruction Following | CodeCode Available | 1 | 5 |
| A Closer Look at Generalisation in RAVEN | Aug 1, 2020 | Visual Reasoning | CodeCode Available | 1 | 5 |
| Learning Differentiable Logic Programs for Abstract Visual Reasoning | Jul 3, 2023 | Program inductionVisual Reasoning | CodeCode Available | 1 | 5 |
| Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning | Apr 7, 2021 | Representation LearningRetrieval | CodeCode Available | 1 | 5 |
| GENOME: GenerativE Neuro-symbOlic visual reasoning by growing and reusing ModulEs | Nov 8, 2023 | Question AnsweringReferring Expression | CodeCode Available | 1 | 5 |
| CrossGET: Cross-Guided Ensemble of Tokens for Accelerating Vision-Language Transformers | May 27, 2023 | Image CaptioningImage Retrieval | CodeCode Available | 1 | 5 |
| From EduVisBench to EduVisAgent: A Benchmark and Multi-Agent Framework for Pedagogical Visualization | May 22, 2025 | Visual Reasoning | CodeCode Available | 1 | 5 |
| Learning Long-term Visual Dynamics with Region Proposal Interaction Networks | Aug 5, 2020 | Common Sense ReasoningObject | CodeCode Available | 1 | 5 |
| From Seconds to Hours: Reviewing MultiModal Large Language Models on Comprehensive Long Video Understanding | Sep 27, 2024 | Video UnderstandingVisual Reasoning | CodeCode Available | 1 | 5 |
| ARB: A Comprehensive Arabic Multimodal Reasoning Benchmark | May 22, 2025 | document understandingMultimodal Reasoning | CodeCode Available | 1 | 5 |
| Forgotten Polygons: Multimodal Large Language Models are Shape-Blind | Feb 21, 2025 | MathMathematical Problem-Solving | CodeCode Available | 1 | 5 |
| Learning to Discretely Compose Reasoning Module Networks for Video Captioning | Jul 17, 2020 | DecoderQuestion Answering | CodeCode Available | 1 | 5 |
| Agent-X: Evaluating Deep Multimodal Reasoning in Vision-Centric Agentic Tasks | May 30, 2025 | Autonomous DrivingMath | CodeCode Available | 1 | 5 |
| Forward Prediction for Physical Reasoning | Jun 18, 2020 | PredictionVisual Reasoning | CodeCode Available | 1 | 5 |
| Going Beyond Nouns With Vision & Language Models Using Synthetic Data | Mar 30, 2023 | SentenceVisual Reasoning | CodeCode Available | 1 | 5 |
| FiLM: Visual Reasoning with a General Conditioning Layer | Sep 22, 2017 | Image Retrieval with Multi-Modal QueryVisual Question Answering (VQA) | CodeCode Available | 1 | 5 |
| Agentic Keyframe Search for Video Question Answering | Mar 20, 2025 | EgoSchemaQuestion Answering | CodeCode Available | 1 | 5 |
| FineCops-Ref: A new Dataset and Task for Fine-Grained Compositional Referring Expression Comprehension | Sep 23, 2024 | Image ComprehensionReferring Expression | CodeCode Available | 1 | 5 |
| NS3D: Neuro-Symbolic Grounding of 3D Objects and Relations | Mar 23, 2023 | Question AnsweringReferring Expression | CodeCode Available | 1 | 5 |
| Super-CLEVR: A Virtual Benchmark to Diagnose Domain Robustness in Visual Reasoning | Dec 1, 2022 | Domain GeneralizationQuestion Answering | CodeCode Available | 1 | 5 |
| Beyond Embeddings: The Promise of Visual Table in Visual Reasoning | Mar 27, 2024 | Representation LearningVisual Question Answering | CodeCode Available | 1 | 5 |
| OpenSeg-R: Improving Open-Vocabulary Segmentation via Step-by-Step Visual Reasoning | May 22, 2025 | Open Vocabulary Panoptic SegmentationOpen Vocabulary Semantic Segmentation | CodeCode Available | 1 | 5 |