| LLARVA: Vision-Action Instruction Tuning Enhances Robot Learning | Jun 17, 2024 | Image CaptioningQuestion Answering | —Unverified | 0 | 0 |
| Few-Shot Image Classification and Segmentation as Visual Question Answering Using Vision-Language Models | Mar 15, 2024 | Few-Shot Image Classificationimage-classification | —Unverified | 0 | 0 |
| FedPIA -- Permuting and Integrating Adapters leveraging Wasserstein Barycenters for Finetuning Foundation Models in Multi-Modal Federated Learning | Dec 19, 2024 | Federated Learningparameter-efficient fine-tuning | —Unverified | 0 | 0 |
| LLaVA-Octopus: Unlocking Instruction-Driven Adaptive Projector Fusion for Video Understanding | Jan 9, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 | 0 |
| Zero-Shot Transfer VQA Dataset | Nov 2, 2018 | Question AnsweringTransfer Learning | —Unverified | 0 | 0 |
| Feature4X: Bridging Any Monocular Video to 4D Agentic AI with Versatile Gaussian Feature Fields | Mar 26, 2025 | Question AnsweringVisual Question Answering | —Unverified | 0 | 0 |
| Fast or Slow? Integrating Fast Intuition and Deliberate Thinking for Enhancing Visual Question Answering | Jun 1, 2025 | AllMME | —Unverified | 0 | 0 |
| FashionVQA: A Domain-Specific Visual Question Answering System | Aug 24, 2022 | Question AnsweringVisual Question Answering | —Unverified | 0 | 0 |
| LLaVA-Ultra: Large Chinese Language and Vision Assistant for Ultrasound | Oct 19, 2024 | Instruction FollowingKnowledge Distillation | —Unverified | 0 | 0 |
| Face-MLLM: A Large Face Perception Model | Oct 28, 2024 | Attributemodel | —Unverified | 0 | 0 |
| VGNMN: Video-grounded Neural Module Networks for Video-Grounded Dialogue Systems | Jul 1, 2022 | Information RetrievalQuestion Answering | —Unverified | 0 | 0 |
| Look Before You Decide: Prompting Active Deduction of MLLMs for Assumptive Reasoning | Apr 19, 2024 | Benchmarkingcounterfactual | —Unverified | 0 | 0 |
| LMME3DHF: Benchmarking and Evaluating Multimodal 3D Human Face Generation with LMMs | Apr 29, 2025 | BenchmarkingFace Generation | —Unverified | 0 | 0 |
| EyeFound: A Multimodal Generalist Foundation Model for Ophthalmic Imaging | May 18, 2024 | Question AnsweringVisual Question Answering | —Unverified | 0 | 0 |
| Localize, Group, and Select: Boosting Text-VQA by Scene Text Modeling | Aug 20, 2021 | Data AblationOptical Character Recognition | —Unverified | 0 | 0 |
| VGNMN: Video-grounded Neural Module Network to Video-Grounded Language Tasks | Apr 16, 2021 | Information RetrievalQuestion Answering | —Unverified | 0 | 0 |
| Locate Then Generate: Bridging Vision and Language with Bounding Box for Scene-Text VQA | Apr 4, 2023 | Answer GenerationLanguage Modelling | —Unverified | 0 | 0 |
| Extracting Training Data from Document-Based VQA Models | Jul 11, 2024 | MemorizationQuestion Answering | —Unverified | 0 | 0 |
| Achieving Human Parity on Visual Question Answering | Nov 17, 2021 | Question AnsweringVisual Question Answering | —Unverified | 0 | 0 |
| Logically Consistent Loss for Visual Question Answering | Nov 19, 2020 | Multi-Task LearningQuestion Answering | —Unverified | 0 | 0 |
| LOIS: Looking Out of Instance Semantics for Visual Question Answering | Jul 26, 2023 | Question AnsweringVisual Question Answering | —Unverified | 0 | 0 |
| Looking Beyond Text: Reducing Language bias in Large Vision-Language Models via Multimodal Dual-Attention and Soft-Image Guidance | Nov 21, 2024 | Visual Question Answering | —Unverified | 0 | 0 |
| Look, Learn and Leverage (L^3): Mitigating Visual-Domain Shift and Discovering Intrinsic Relations via Symbolic Alignment | Aug 30, 2024 | Question AnsweringRepresentation Learning | —Unverified | 0 | 0 |
| Exploring Weaknesses of VQA Models through Attribution Driven Insights | Jun 11, 2020 | Question AnsweringVisual Question Answering | —Unverified | 0 | 0 |
| Look, Read and Ask: Learning to Ask Questions by Reading Text in Images | Nov 23, 2022 | Optical Character Recognition (OCR)Question Answering | —Unverified | 0 | 0 |
| When are Lemons Purple? The Concept Association Bias of Vision-Language Models | Dec 22, 2022 | Attributeimage-classification | —Unverified | 0 | 0 |
| Accuracy vs. Complexity: A Trade-off in Visual Question Answering Models | Jan 20, 2020 | Question AnsweringVisual Question Answering | —Unverified | 0 | 0 |
| Exploring the Frontier of Vision-Language Models: A Survey of Current Methodologies and Future Directions | Feb 20, 2024 | Image CaptioningQuestion Answering | —Unverified | 0 | 0 |
| An Empirical Study on Leveraging Scene Graphs for Visual Question Answering | Jul 28, 2019 | Knowledge GraphsQuestion Answering | —Unverified | 0 | 0 |
| LRRA:A Transparent Neural-Symbolic Reasoning Framework for Real-World Visual Question Answering | Aug 1, 2021 | Question AnsweringVisual Question Answering | —Unverified | 0 | 0 |
| Exploring the Effectiveness of Object-Centric Representations in Visual Question Answering: Comparative Insights with Foundation Models | Jul 22, 2024 | Question AnsweringRepresentation Learning | —Unverified | 0 | 0 |
| Can You Explain That? Lucid Explanations Help Human-AI Collaborative Image Retrieval | Apr 5, 2019 | Image RetrievalQuestion Answering | —Unverified | 0 | 0 |
| LVLM_CSP: Accelerating Large Vision Language Models via Clustering, Scattering, and Pruning for Reasoning Segmentation | Apr 15, 2025 | Image CaptioningQuestion Answering | —Unverified | 0 | 0 |
| Exploring Spatial Language Grounding Through Referring Expressions | Feb 4, 2025 | Image CaptioningNegation | —Unverified | 0 | 0 |
| Exploring Sparse Spatial Relation in Graph Inference for Text-Based VQA | Oct 13, 2023 | Graph LearningObject | —Unverified | 0 | 0 |
| An Empirical Study of Batch Normalization and Group Normalization in Conditional Computation | Jul 31, 2019 | Conditional Image GenerationFew-Shot Learning | —Unverified | 0 | 0 |
| Exploring Question Decomposition for Zero-Shot VQA | Oct 25, 2023 | Question AnsweringVisual Question Answering | —Unverified | 0 | 0 |
| Exploring Human-like Attention Supervision in Visual Question Answering | Sep 19, 2017 | Question AnsweringVisual Question Answering | —Unverified | 0 | 0 |
| M3DocRAG: Multi-modal Retrieval is What You Need for Multi-page Multi-document Understanding | Nov 7, 2024 | document understandingOptical Character Recognition | —Unverified | 0 | 0 |
| M4CXR: Exploring Multi-task Potentials of Multi-modal Large Language Models for Chest X-ray Interpretation | Aug 29, 2024 | Instruction FollowingMedical Report Generation | —Unverified | 0 | 0 |
| MagiC: Evaluating Multimodal Cognition Toward Grounded Visual Reasoning | Jul 9, 2025 | DiagnosticMultimodal Reasoning | —Unverified | 0 | 0 |
| MAGIC-VQA: Multimodal And Grounded Inference with Commonsense Knowledge for Visual Question Answering | Mar 24, 2025 | Graph Neural NetworkQuestion Answering | —Unverified | 0 | 0 |
| Exploring Diverse Methods in Visual Question Answering | Apr 21, 2024 | Question AnsweringVisual Question Answering | —Unverified | 0 | 0 |
| Exploring Advanced Techniques for Visual Question Answering: A Comprehensive Comparison | Feb 20, 2025 | DiversityLanguage Modeling | —Unverified | 0 | 0 |
| Making the Most of What You Have: Adapting Pre-trained Visual Language Models in the Low-data Regime | May 3, 2023 | Image CaptioningQuestion Answering | —Unverified | 0 | 0 |
| An Empirical Evaluation of Visual Question Answering for Novel Objects | Apr 8, 2017 | Question AnsweringVisual Question Answering | —Unverified | 0 | 0 |
| Explore the Hallucination on Low-level Perception for MLLMs | Sep 15, 2024 | HallucinationQuestion Answering | —Unverified | 0 | 0 |
| Video Question Answering for People with Visual Impairments Using an Egocentric 360-Degree Camera | May 30, 2024 | Question AnsweringVideo Question Answering | —Unverified | 0 | 0 |
| MAMO: Masked Multimodal Modeling for Fine-Grained Vision-Language Representation Learning | Oct 9, 2022 | Image-text Retrievalmultimodal interaction | —Unverified | 0 | 0 |
| Explicit Reasoning over End-to-End Neural Architectures for Visual Question Answering | Mar 23, 2018 | Question AnsweringVisual Question Answering | —Unverified | 0 | 0 |