| VinVL: Revisiting Visual Representations in Vision-Language Models | Jan 2, 2021 | Image CaptioningImage-text matching | CodeCode Available | 2 |
| VCoder: Versatile Vision Encoders for Multimodal Large Language Models | Dec 21, 2023 | Image CaptioningImage Generation | CodeCode Available | 2 |
| SFT or RL? An Early Investigation into Training R1-Like Reasoning Large Vision-Language Models | Apr 10, 2025 | Reinforcement Learning (RL)Visual Reasoning | CodeCode Available | 2 |
| Q-Insight: Understanding Image Quality via Visual Reinforcement Learning | Mar 28, 2025 | DescriptiveImage Quality Assessment | CodeCode Available | 2 |
| AlphaMaze: Enhancing Large Language Models' Spatial Intelligence via GRPO | Feb 20, 2025 | Autonomous NavigationNavigate | CodeCode Available | 2 |
| Open CaptchaWorld: A Comprehensive Web-based Platform for Testing and Benchmarking Multimodal LLM Agents | May 30, 2025 | BenchmarkingBlocking | CodeCode Available | 2 |
| 3D-VisTA: Pre-trained Transformer for 3D Vision and Text Alignment | Aug 8, 2023 | 3D Question Answering (3D-QA)Dense Captioning | CodeCode Available | 2 |
| OmniCaptioner: One Captioner to Rule Them All | Apr 9, 2025 | AllImage Captioning | CodeCode Available | 2 |
| Multimodal Self-Instruct: Synthetic Abstract Image and Visual Reasoning Instruction Using Language Model | Jul 9, 2024 | Chart UnderstandingLanguage Modeling | CodeCode Available | 2 |
| MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning | Sep 14, 2023 | HallucinationIn-Context Learning | CodeCode Available | 2 |
| Neurosymbolic Diffusion Models | May 19, 2025 | Autonomous DrivingUncertainty Quantification | CodeCode Available | 2 |
| LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models | Mar 22, 2024 | Language ModellingLarge Language Model | CodeCode Available | 2 |
| Learning Transferable Visual Models From Natural Language Supervision | Feb 26, 2021 | Action RecognitionBenchmarking | CodeCode Available | 2 |
| Learning to Compose Dynamic Tree Structures for Visual Contexts | Dec 5, 2018 | Graph GenerationPanoptic Scene Graph Generation | CodeCode Available | 2 |
| List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs | Apr 25, 2024 | Visual GroundingVisual Question Answering | CodeCode Available | 2 |
| NoisyRollout: Reinforcing Visual Reasoning with Data Augmentation | Apr 17, 2025 | Data AugmentationDiversity | CodeCode Available | 2 |
| Belief Revision based Caption Re-ranker with Visual Semantic Information | Sep 16, 2022 | Caption GenerationImage Captioning | CodeCode Available | 1 |
| Is BERT Blind? Exploring the Effect of Vision-and-Language Pretraining on Visual Language Understanding | Mar 21, 2023 | Knowledge ProbingLanguage Modelling | CodeCode Available | 1 |
| Just Shift It: Test-Time Prototype Shifting for Zero-Shot Generalization with Vision-Language Models | Mar 19, 2024 | image-classificationImage Classification | CodeCode Available | 1 |
| Interpreting and Controlling Vision Foundation Models via Text Explanations | Oct 16, 2023 | Model EditingVisual Reasoning | CodeCode Available | 1 |
| Interpretable Image Classification via Non-parametric Part Prototype Learning | Mar 13, 2025 | image-classificationImage Classification | CodeCode Available | 1 |
| IRFL: Image Recognition of Figurative Language | Mar 27, 2023 | ClassificationVisual Reasoning | CodeCode Available | 1 |
| KiVA: Kid-inspired Visual Analogies for Testing Large Multimodal Models | Jul 25, 2024 | Visual AnalogiesVisual Reasoning | CodeCode Available | 1 |
| Attention-Based Context Aware Reasoning for Situation Recognition | Jun 1, 2020 | Action RecognitionFine-grained Action Recognition | CodeCode Available | 1 |
| INS-MMBench: A Comprehensive Benchmark for Evaluating LVLMs' Performance in Insurance | Jun 13, 2024 | Multiple-choiceVisual Reasoning | CodeCode Available | 1 |
| Abstract Visual Reasoning: An Algebraic Approach for Solving Raven's Progressive Matrices | Mar 21, 2023 | Visual Reasoning | CodeCode Available | 1 |
| Inference Optimal VLMs Need Fewer Visual Tokens and More Parameters | Nov 5, 2024 | Token ReductionVisual Reasoning | CodeCode Available | 1 |
| Inter-GPS: Interpretable Geometry Problem Solving with Formal Language and Symbolic Reasoning | May 10, 2021 | Arithmetic ReasoningGeometry Problem Solving | CodeCode Available | 1 |
| Large-Scale Adversarial Training for Vision-and-Language Representation Learning | Jun 11, 2020 | Image-text RetrievalQuestion Answering | CodeCode Available | 1 |
| A Survey on Interpretable Cross-modal Reasoning | Sep 5, 2023 | Cross-Modal RetrievalDecision Making | CodeCode Available | 1 |
| HumanEval-V: Evaluating Visual Understanding and Reasoning Abilities of Large Multimodal Models Through Coding Tasks | Oct 16, 2024 | Code GenerationHumanEval | CodeCode Available | 1 |
| HYDRA: A Hyper Agent for Dynamic Compositional Visual Reasoning | Mar 19, 2024 | Reinforcement Learning (RL)Visual Grounding | CodeCode Available | 1 |
| CoCa: Contrastive Captioners are Image-Text Foundation Models | May 4, 2022 | Action ClassificationDecoder | CodeCode Available | 1 |
| Are Vision Language Models Ready for Clinical Diagnosis? A 3D Medical Benchmark for Tumor-centric Visual Question Answering | May 25, 2025 | AnatomyBenchmarking | CodeCode Available | 1 |
| How Many Unicorns Are in This Image? A Safety Evaluation Benchmark for Vision LLMs | Nov 27, 2023 | Adversarial RobustnessVisual Question Answering (VQA) | CodeCode Available | 1 |
| Image Retrieval on Real-life Images with Pre-trained Vision-and-Language Models | Aug 9, 2021 | Composed Image Retrieval (CoIR)Image Retrieval | CodeCode Available | 1 |
| How Do Multimodal Large Language Models Handle Complex Multimodal Reasoning? Placing Them in An Extensible Escape Game | Mar 13, 2025 | Multimodal ReasoningQuestion Answering | CodeCode Available | 1 |
| Benchmarking Robustness of Multimodal Image-Text Models under Distribution Shift | Dec 15, 2022 | BenchmarkingImage Captioning | CodeCode Available | 1 |
| How Far Are We from Intelligent Visual Deductive Reasoning? | Mar 7, 2024 | In-Context LearningVisual Reasoning | CodeCode Available | 1 |
| Grounded Situation Recognition with Transformers | Nov 19, 2021 | DecoderGrounded Situation Recognition | CodeCode Available | 1 |
| Collaborative Transformers for Grounded Situation Recognition | Mar 30, 2022 | Grounded Situation RecognitionImage Classification | CodeCode Available | 1 |
| Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone | Jun 15, 2022 | Described Object DetectionImage Captioning | CodeCode Available | 1 |
| Going Beyond Nouns With Vision & Language Models Using Synthetic Data | Mar 30, 2023 | SentenceVisual Reasoning | CodeCode Available | 1 |
| Immune: Improving Safety Against Jailbreaks in Multi-modal LLMs via Inference-Time Alignment | Nov 27, 2024 | Safety AlignmentVisual Reasoning | CodeCode Available | 1 |
| Forward Prediction for Physical Reasoning | Jun 18, 2020 | PredictionVisual Reasoning | CodeCode Available | 1 |
| ARB: A Comprehensive Arabic Multimodal Reasoning Benchmark | May 22, 2025 | document understandingMultimodal Reasoning | CodeCode Available | 1 |
| GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering | Feb 25, 2019 | Question AnsweringVisual Question Answering (VQA) | CodeCode Available | 1 |
| GRE Suite: Geo-localization Inference via Fine-Tuned Vision-Language Models and Enhanced Reasoning Chains | May 24, 2025 | geo-localizationVisual Reasoning | CodeCode Available | 1 |
| From EduVisBench to EduVisAgent: A Benchmark and Multi-Agent Framework for Pedagogical Visualization | May 22, 2025 | Visual Reasoning | CodeCode Available | 1 |
| FLAVA: A Foundational Language And Vision Alignment Model | Dec 8, 2021 | Image RetrievalImage-to-Text Retrieval | CodeCode Available | 1 |