| A Survey on Efficient Vision-Language Models | Apr 13, 2025 | Image CaptioningQuestion Answering | CodeCode Available | 1 | 5 |
| FaceBench: A Multi-View Multi-Level Facial Attribute VQA Dataset for Benchmarking Face Perception MLLMs | Mar 27, 2025 | AttributeBenchmarking | CodeCode Available | 1 | 5 |
| Attention in Reasoning: Dataset, Analysis, and Modeling | Apr 20, 2022 | Question AnsweringVisual Question Answering | CodeCode Available | 1 | 5 |
| Florence: A New Foundation Model for Computer Vision | Nov 22, 2021 | Action ClassificationAction Recognition | CodeCode Available | 1 | 5 |
| MAP: Multimodal Uncertainty-Aware Vision-Language Pre-training Model | Oct 11, 2022 | Contrastive LearningImage-text matching | CodeCode Available | 1 | 5 |
| COBRA: Contrastive Bi-Modal Representation Algorithm | May 7, 2020 | Cross-Modal RetrievalImage Captioning | CodeCode Available | 1 | 5 |
| CAT-ViL: Co-Attention Gated Vision-Language Embedding for Visual Question Localized-Answering in Robotic Surgery | Jul 11, 2023 | Question AnsweringScene Understanding | CodeCode Available | 1 | 5 |
| Explaining Autonomous Driving Actions with Visual Question Answering | Jul 19, 2023 | Autonomous DrivingAutonomous Vehicles | CodeCode Available | 1 | 5 |
| A Survey of Medical Vision-and-Language Applications and Their Techniques | Nov 19, 2024 | Decision MakingDiagnostic | CodeCode Available | 1 | 5 |
| Look, Listen, and Answer: Overcoming Biases for Audio-Visual Question Answering | Apr 18, 2024 | Audio-visual Question AnsweringAudio-Visual Question Answering (AVQA) | CodeCode Available | 1 | 5 |
| Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone | Jun 15, 2022 | Described Object DetectionImage Captioning | CodeCode Available | 1 | 5 |
| Coarse-to-Fine Reasoning for Visual Question Answering | Oct 6, 2021 | Question AnsweringVisual Question Answering | CodeCode Available | 1 | 5 |
| GPT-4V-AD: Exploring Grounding Potential of VQA-oriented GPT-4V for Zero-shot Anomaly Detection | Nov 5, 2023 | Anomaly DetectionQuestion Answering | CodeCode Available | 1 | 5 |
| Consistency-preserving Visual Question Answering in Medical Imaging | Jun 27, 2022 | Question AnsweringVisual Question Answering | CodeCode Available | 1 | 5 |
| ActiView: Evaluating Active Perception Ability for Multimodal Large Language Models | Oct 7, 2024 | Question AnsweringVisual Question Answering | CodeCode Available | 1 | 5 |
| ConTEXTual Net: A Multimodal Vision-Language Model for Segmentation of Pneumothorax | Mar 2, 2023 | DescriptiveImage Captioning | CodeCode Available | 1 | 5 |
| Expert Knowledge-Aware Image Difference Graph Representation Learning for Difference-Aware Medical Visual Question Answering | Jul 22, 2023 | Graph Representation LearningLanguage Modeling | CodeCode Available | 1 | 5 |
| Location-Free Scene Graph Generation | Mar 20, 2023 | Graph GenerationImage Retrieval | CodeCode Available | 1 | 5 |
| LRTA: A Transparent Neural-Symbolic Reasoning Framework with Modular Supervision for Visual Question Answering | Nov 21, 2020 | Answer GenerationQuestion Answering | CodeCode Available | 1 | 5 |
| Closed Loop Neural-Symbolic Learning via Integrating Neural Perception, Grammar Parsing, and Symbolic Reasoning | Jun 11, 2020 | Question AnsweringReinforcement Learning (RL) | CodeCode Available | 1 | 5 |
| Evaluating Image Hallucination in Text-to-Image Generation with Question-Answering | Sep 19, 2024 | HallucinationHallucination Evaluation | CodeCode Available | 1 | 5 |
| A Stitch in Time Saves Nine: Small VLM is a Precise Guidance for Accelerating Large VLMs | Dec 4, 2024 | Visual Question Answering | CodeCode Available | 1 | 5 |
| Enhancing Visual Question Answering through Question-Driven Image Captions as Prompts | Apr 12, 2024 | Image CaptioningQuestion Answering | CodeCode Available | 1 | 5 |
| Counterfactual Samples Synthesizing and Training for Robust Visual Question Answering | Oct 3, 2021 | counterfactualDiagnostic | CodeCode Available | 1 | 5 |
| A Stitch in Time Saves Nine: A Train-Time Regularizing Loss for Improved Neural Network Calibration | Mar 25, 2022 | image-classificationImage Classification | CodeCode Available | 1 | 5 |
| Evaluating Multimodal Representations on Visual Semantic Textual Similarity | Apr 4, 2020 | BenchmarkingImage Captioning | CodeCode Available | 1 | 5 |
| Automated Generation of Challenging Multiple-Choice Questions for Vision Language Model Evaluation | Jan 6, 2025 | Language Model EvaluationLanguage Modeling | CodeCode Available | 1 | 5 |
| CLIP-Guided Vision-Language Pre-training for Question Answering in 3D Scenes | Apr 12, 2023 | Question AnsweringVisual Question Answering | CodeCode Available | 1 | 5 |
| ERNIE-Layout: Layout Knowledge Enhanced Pre-training for Visually-rich Document Understanding | Oct 12, 2022 | document-image-classificationDocument Image Classification | CodeCode Available | 1 | 5 |
| Fine-Grained Evaluation of Large Vision-Language Models in Autonomous Driving | Mar 27, 2025 | AttributeAutonomous Driving | CodeCode Available | 1 | 5 |
| Localized Questions in Medical Visual Question Answering | Jul 3, 2023 | Medical Visual Question AnsweringQuestion Answering | CodeCode Available | 1 | 5 |
| MapQA: A Dataset for Question Answering on Choropleth Maps | Nov 15, 2022 | ArticlesQuestion Answering | CodeCode Available | 1 | 5 |
| Mitigating Hallucinations in Vision-Language Models through Image-Guided Head Suppression | May 22, 2025 | HallucinationImage Description | CodeCode Available | 1 | 5 |
| CLEVR-X: A Visual Reasoning Dataset for Natural Language Explanations | Apr 5, 2022 | Explanation GenerationQuestion Answering | CodeCode Available | 1 | 5 |
| CLEVR-Math: A Dataset for Compositional Language, Visual and Mathematical Reasoning | Aug 10, 2022 | MathMathematical Reasoning | CodeCode Available | 1 | 5 |
| LIME: Less Is More for MLLM Evaluation | Sep 10, 2024 | Image CaptioningQuestion Answering | CodeCode Available | 1 | 5 |
| Genixer: Empowering Multimodal Large Language Models as a Powerful Data Generator | Dec 11, 2023 | Image CaptioningQuestion Answering | CodeCode Available | 1 | 5 |
| Less is More: A Simple yet Effective Token Reduction Method for Efficient Multi-modal LLMs | Sep 17, 2024 | Question AnsweringToken Reduction | CodeCode Available | 1 | 5 |
| Linearly Mapping from Image to Text Space | Sep 30, 2022 | Image CaptioningImage to text | CodeCode Available | 1 | 5 |
| CrossGET: Cross-Guided Ensemble of Tokens for Accelerating Vision-Language Transformers | May 27, 2023 | Image CaptioningImage Retrieval | CodeCode Available | 1 | 5 |
| CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning | Dec 20, 2016 | DiagnosticQuestion Answering | CodeCode Available | 1 | 5 |
| An Approach to Solving the Abstraction and Reasoning Corpus (ARC) Challenge | Jun 6, 2023 | ARCQuestion Answering | CodeCode Available | 1 | 5 |
| Comprehensive Visual Question Answering on Point Clouds through Compositional Scene Manipulation | Dec 22, 2021 | Common Sense ReasoningQuestion Answering | CodeCode Available | 1 | 5 |
| MixPHM: Redundancy-Aware Parameter-Efficient Tuning for Low-Resource Visual Question Answering | Mar 2, 2023 | Mixture-of-ExpertsQuestion Answering | CodeCode Available | 1 | 5 |
| EHRXQA: A Multi-Modal Question Answering Dataset for Electronic Health Records with Chest X-ray Images | Oct 28, 2023 | Decision MakingMedical Visual Question Answering | CodeCode Available | 1 | 5 |
| Cross-modal Information Flow in Multimodal Large Language Models | Nov 27, 2024 | Question AnsweringVisual Question Answering | CodeCode Available | 1 | 5 |
| Learning to Contrast the Counterfactual Samples for Robust Visual Question Answering | Nov 1, 2020 | Contrastive Learningcounterfactual | CodeCode Available | 1 | 5 |
| BackdoorMBTI: A Backdoor Learning Multimodal Benchmark Tool Kit for Backdoor Defense Evaluation | Nov 17, 2024 | Action Recognitionbackdoor defense | CodeCode Available | 1 | 5 |
| Learning to Answer Visual Questions from Web Videos | May 10, 2022 | Dataset GenerationQuestion Answering | CodeCode Available | 1 | 5 |
| Learning to Discretely Compose Reasoning Module Networks for Video Captioning | Jul 17, 2020 | DecoderQuestion Answering | CodeCode Available | 1 | 5 |