| An Open-Source Software Toolkit & Benchmark Suite for the Evaluation and Adaptation of Multimodal Action Models | Jun 10, 2025 | Action GenerationImage Captioning | —Unverified | 0 |
| Integrating Object Detection Modality into Visual Language Model for Enhanced Autonomous Driving Agent | Nov 8, 2024 | Autonomous DrivingLanguage Modeling | —Unverified | 0 |
| InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model | Jan 29, 2024 | FormLanguage Modeling | —Unverified | 0 |
| Detecting and Evaluating Medical Hallucinations in Large Vision Language Models | Jun 14, 2024 | HallucinationMedical Visual Question Answering | —Unverified | 0 |
| Detect, Describe, Discriminate: Moving Beyond VQA for MLLM Evaluation | Sep 23, 2024 | Multiple-choiceQuestion Answering | —Unverified | 0 |
| Annotation Methodologies for Vision and Language Dataset Creation | Jul 10, 2016 | Action RecognitionImage Description | —Unverified | 0 |
| Advancing Multimodal Medical Capabilities of Gemini | May 6, 2024 | Computed Tomography (CT)image-classification | —Unverified | 0 |
| Detect2Interact: Localizing Object Key Field in Visual Question Answering (VQA) with LLMs | Apr 1, 2024 | Common Sense ReasoningObject | —Unverified | 0 |
| Designing a Robust Radiology Report Generation System | Nov 2, 2024 | Decision MakingDiagnostic | —Unverified | 0 |
| Beyond Logit Lens: Contextual Embeddings for Robust Hallucination Detection & Grounding in VLMs | Nov 28, 2024 | AttributeHallucination | —Unverified | 0 |
| Achieving Human Parity on Visual Question Answering | Nov 17, 2021 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Integrating Knowledge and Reasoning in Image Understanding | Jun 24, 2019 | Object RecognitionQuestion Answering | —Unverified | 0 |
| Interpretable Visual Question Answering via Reasoning Supervision | Sep 7, 2023 | Common Sense ReasoningQuestion Answering | —Unverified | 0 |
| Beyond Human Vision: The Role of Large Vision Language Models in Microscope Image Analysis | May 1, 2024 | Image CaptioningQuestion Answering | —Unverified | 0 |
| Advancing Medical Imaging with Language Models: A Journey from N-grams to ChatGPT | Apr 11, 2023 | DiagnosticImage Captioning | —Unverified | 0 |
| Instruction-augmented Multimodal Alignment for Image-Text and Element Matching | Apr 16, 2025 | Image AugmentationImage Generation | —Unverified | 0 |
| DeepStack: Deeply Stacking Visual Tokens is Surprisingly Simple and Effective for LMMs | Jun 6, 2024 | Language ModellingLarge Language Model | —Unverified | 0 |
| Beyond Captioning: Task-Specific Prompting for Improved VLM Performance in Mathematical Reasoning | Oct 8, 2024 | Image RetrievalMath | —Unverified | 0 |
| An experimental study of the vision-bottleneck in VQA | Feb 14, 2022 | ObjectQuestion Answering | —Unverified | 0 |
| An Evaluation of GPT-4V and Gemini in Online VQA | Dec 17, 2023 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Deep learning evaluation using deep linguistic processing | Jun 5, 2017 | Deep LearningMultimodal Deep Learning | —Unverified | 0 |
| Inquire, Interact, and Integrate: A Proactive Agent Collaborative Framework for Zero-Shot Multimodal Medical Reasoning | May 19, 2024 | Multimodal ReasoningQuestion Answering | —Unverified | 0 |
| Deep Exemplar Networks for VQA and VQG | Dec 19, 2019 | DecoderQuestion Answering | —Unverified | 0 |
| Deep Bayesian Active Learning for Multiple Correct Outputs | Dec 2, 2019 | Active LearningAnswer Generation | —Unverified | 0 |
| BESTMVQA: A Benchmark Evaluation System for Medical Visual Question Answering | Dec 13, 2023 | Medical Visual Question AnsweringQuestion Answering | —Unverified | 0 |
| Instance-Level Trojan Attacks on Visual Question Answering via Adversarial Learning in Neuron Activation Space | Apr 2, 2023 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Deep Attention Neural Tensor Network for Visual Question Answering | Sep 1, 2018 | Deep AttentionQuestion Answering | —Unverified | 0 |
| Decoupled Box Proposal and Featurization with Ultrafine-Grained Semantic Labels Improve Image Captioning and Visual Question Answering | Sep 4, 2019 | Image CaptioningObject | —Unverified | 0 |
| Benchmarking Vision Language Models for Cultural Understanding | Jul 15, 2024 | BenchmarkingQuestion Answering | —Unverified | 0 |
| Decouple Before Interact: Multi-Modal Prompt Learning for Continual Visual Question Answering | Jan 1, 2023 | Continual LearningLanguage Modelling | —Unverified | 0 |
| Accuracy vs. Complexity: A Trade-off in Visual Question Answering Models | Jan 20, 2020 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| InfographicVQA | Apr 26, 2021 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| An Empirical Study on the Language Modal in Visual Question Answering | May 17, 2023 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Debating for Better Reasoning: An Unsupervised Multimodal Approach | May 20, 2025 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| An Empirical Study on the Generalization Power of Neural Representations Learned via Visual Guessing Games | Jan 31, 2021 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| DDRprog: A CLEVR Differentiable Dynamic Reasoning Programmer | Mar 30, 2018 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Davidsonian Scene Graph: Improving Reliability in Fine-grained Evaluation for Text-to-Image Generation | Oct 27, 2023 | Image GenerationQuestion Answering | —Unverified | 0 |
| Dataset Bias Mitigation in Multiple-Choice Visual Question Answering and Beyond | Oct 23, 2023 | counterfactualMultiple-choice | —Unverified | 0 |
| Benchmarking Large Multimodal Models for Ophthalmic Visual Question Answering with OphthalWeChat | May 26, 2025 | BenchmarkingQuestion Answering | —Unverified | 0 |
| Accounting for Focus Ambiguity in Visual Questions | Jan 4, 2025 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Data Metabolism: An Efficient Data Design Schema For Vision Language Model | Apr 10, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Data-Driven Calibration of Prediction Sets in Large Vision-Language Models Based on Inductive Conformal Prediction | Apr 24, 2025 | Conformal PredictionHallucination | —Unverified | 0 |
| Data Augmentation for Visual Question Answering | Sep 1, 2017 | Data AugmentationGeneral Classification | —Unverified | 0 |
| DARE: Diverse Visual Question Answering with Robustness Evaluation | Sep 26, 2024 | image-classificationImage Classification | —Unverified | 0 |
| @Bench: Benchmarking Vision-Language Models for Human-centered Assistive Technology | Sep 21, 2024 | BenchmarkingDepth Estimation | —Unverified | 0 |
| Damage Assessment after Natural Disasters with UAVs: Semantic Feature Extraction using Deep Learning | Dec 14, 2024 | Decision MakingQuestion Answering | —Unverified | 0 |
| An Empirical Study on Leveraging Scene Graphs for Visual Question Answering | Jul 28, 2019 | Knowledge GraphsQuestion Answering | —Unverified | 0 |
| Cycle-Consistency for Robust Visual Question Answering | Feb 15, 2019 | Question AnsweringQuestion Generation | —Unverified | 0 |
| Being Negative but Constructively: Lessons Learnt from Creating Better Visual Question Answering Datasets | Apr 24, 2017 | Multiple-choiceQuestion Answering | —Unverified | 0 |
| InfiMM-HD: A Leap Forward in High-Resolution Multimodal Understanding | Mar 3, 2024 | Visual Question Answering | —Unverified | 0 |