| Language bias in Visual Question Answering: A Survey and Taxonomy | Nov 16, 2021 | Question AnsweringVisual Question Answering | —Unverified | 0 | 0 |
| Language Features Matter: Effective Language Representations for Vision-Language Tasks | Aug 17, 2019 | Image CaptioningLanguage Modelling | —Unverified | 0 | 0 |
| From Pixels to Prose: Advancing Multi-Modal Language Models for Remote Sensing | Nov 5, 2024 | Change DetectionContrastive Learning | —Unverified | 0 | 0 |
| Language-Image Models with 3D Understanding | May 6, 2024 | Question AnsweringVisual Question Answering | —Unverified | 0 | 0 |
| From Pixels to Objects: Cubic Visual Attention for Visual Question Answering | Jun 4, 2022 | ObjectQuestion Answering | —Unverified | 0 | 0 |
| Language Is Not All You Need: Aligning Perception with Language Models | Feb 27, 2023 | AllImage Captioning | —Unverified | 0 | 0 |
| From Known to the Unknown: Transferring Knowledge to Answer Questions about Novel Visual and Semantic Concepts | Nov 30, 2018 | Novel ConceptsQuestion Answering | —Unverified | 0 | 0 |
| From Image to Language: A Critical Analysis of Visual Question Answering (VQA) Approaches, Challenges, and Opportunities | Nov 1, 2023 | NavigateQuestion Answering | —Unverified | 0 | 0 |
| From Images to Textual Prompts: Zero-Shot Visual Question Answering With Frozen Large Language Models | Jan 1, 2023 | Question AnsweringVisual Question Answering | —Unverified | 0 | 0 |
| From Head to Tail: Towards Balanced Representation in Large Vision-Language Models through Adaptive Data Calibration | Mar 17, 2025 | DenoisingQuestion Answering | —Unverified | 0 | 0 |
| Zero-Shot Visual Reasoning by Vision-Language Models: Benchmarking and Analysis | Aug 27, 2024 | BenchmarkingLarge Language Model | —Unverified | 0 | 0 |
| UPME: An Unsupervised Peer Review Framework for Multimodal Large Language Model Evaluation | Mar 19, 2025 | Language Model EvaluationLanguage Modeling | —Unverified | 0 | 0 |
| From Easy to Hard: Learning Language-guided Curriculum for Visual Question Answering on Remote Sensing Data | May 6, 2022 | Question AnsweringVisual Question Answering | —Unverified | 0 | 0 |
| freePruner: A Training-free Approach for Large Multimodal Model Acceleration | Nov 23, 2024 | QuantizationQuestion Answering | —Unverified | 0 | 0 |
| Free Form Medical Visual Question Answering in Radiology | Jan 23, 2024 | DiagnosticForm | —Unverified | 0 | 0 |
| Foundational Model for Electron Micrograph Analysis: Instruction-Tuning Small-Scale Language-and-Vision Assistant for Enterprise Adoption | Aug 23, 2024 | Instruction FollowingKnowledge Distillation | —Unverified | 0 | 0 |
| Large Scale Scene Text Verification with Guided Attention | Apr 23, 2018 | Question AnsweringScene Text Detection | —Unverified | 0 | 0 |
| Large Vision-Language Models for Remote Sensing Visual Question Answering | Nov 16, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 | 0 |
| Using Visual Cropping to Enhance Fine-Detail Question Answering of BLIP-Family Models | May 31, 2023 | Question AnsweringVisual Question Answering | —Unverified | 0 | 0 |
| Latent Variable Models for Visual Question Answering | Jan 16, 2021 | BenchmarkingQuestion Answering | —Unverified | 0 | 0 |
| Fooling Vision and Language Models Despite Localization and Attention Mechanism | Sep 25, 2017 | Dense CaptioningNatural Language Understanding | —Unverified | 0 | 0 |
| LaVida Drive: Vision-Text Interaction VLM for Autonomous Driving with Token Selection, Recovery and Enhancement | Nov 20, 2024 | Autonomous DrivingComputational Efficiency | —Unverified | 0 | 0 |
| LAVIS: A Library for Language-Vision Intelligence | Sep 15, 2022 | BenchmarkingImage Captioning | —Unverified | 0 | 0 |
| VALSE: A Task-Independent Benchmark for Vision and Language Models centered on Linguistic Phenomena | Aug 17, 2021 | Question AnsweringVisual Question Answering | —Unverified | 0 | 0 |
| Answer-Me: Multi-Task Open-Vocabulary Visual Question Answering | May 2, 2022 | DecoderImage Captioning | —Unverified | 0 | 0 |
| LCV2: An Efficient Pretraining-Free Framework for Grounded Visual Question Answering | Jan 29, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 | 0 |
| FocusLLaVA: A Coarse-to-Fine Approach for Efficient and Effective Visual Token Compression | Nov 21, 2024 | Visual Question Answering | —Unverified | 0 | 0 |
| Answer-checking in Context: A Multi-modal FullyAttention Network for Visual Question Answering | Oct 17, 2020 | Question AnsweringVisual Question Answering | —Unverified | 0 | 0 |
| Learning Answer Embeddings for Visual Question Answering | Jun 10, 2018 | Question AnsweringTransfer Learning | —Unverified | 0 | 0 |
| Learning by Asking Questions | Dec 4, 2017 | Question AnsweringVisual Question Answering | —Unverified | 0 | 0 |
| A Novel Framework for Robustness Analysis of Visual QA Models | Nov 16, 2017 | Question AnsweringVisual Question Answering | —Unverified | 0 | 0 |
| Learning by Hallucinating: Vision-Language Pre-training with Weak Supervision | Oct 24, 2022 | cross-modal alignmentCross-Modal Retrieval | —Unverified | 0 | 0 |
| Learning Compositional Representation for Few-shot Visual Question Answering | Feb 21, 2021 | AttributeQuestion Answering | —Unverified | 0 | 0 |
| Variational Disentangled Attention for Regularized Visual Dialog | Sep 29, 2021 | Question AnsweringVisual Dialog | —Unverified | 0 | 0 |
| Variational Visual Question Answering | May 14, 2025 | Question AnsweringVisual Question Answering | —Unverified | 0 | 0 |
| A Novel Attention-based Aggregation Function to Combine Vision and Language | Apr 27, 2020 | General ClassificationImage Captioning | —Unverified | 0 | 0 |
| FOCUS: Internal MLLM Representations for Efficient Fine-Grained Visual Question Answering | Jun 25, 2025 | Question AnsweringVisual Question Answering | —Unverified | 0 | 0 |
| VCD: Knowledge Base Guided Visual Commonsense Discovery in Images | Feb 27, 2024 | Decision MakingLanguage Modelling | —Unverified | 0 | 0 |
| Learning How To Ask: Cycle-Consistency Refines Prompts in Multimodal Foundation Models | Feb 13, 2024 | Code GenerationHumanEval | —Unverified | 0 | 0 |
| Learning Models for Actions and Person-Object Interactions with Transfer to Question Answering | Apr 16, 2016 | General ClassificationHuman-Object Interaction Detection | —Unverified | 0 | 0 |
| Learning Reasoning Paths over Semantic Graphs for Video-grounded Dialogues | Mar 1, 2021 | Question AnsweringVisual Question Answering | —Unverified | 0 | 0 |
| An Open-Source Software Toolkit & Benchmark Suite for the Evaluation and Adaptation of Multimodal Action Models | Jun 10, 2025 | Action GenerationImage Captioning | —Unverified | 0 | 0 |
| Learning Rich Image Region Representation for Visual Question Answering | Oct 29, 2019 | Language ModelingLanguage Modelling | —Unverified | 0 | 0 |
| FMBench: Benchmarking Fairness in Multimodal Large Language Models on Medical Tasks | Oct 1, 2024 | BenchmarkingFairness | —Unverified | 0 | 0 |
| Learning Sparse Mixture of Experts for Visual Question Answering | Sep 19, 2019 | Mixture-of-ExpertsQuestion Answering | —Unverified | 0 | 0 |
| Learning Sparsity for Effective and Efficient Music Performance Question Answering | Jun 2, 2025 | Audio-visual Question AnsweringQuestion Answering | —Unverified | 0 | 0 |
| Annotation Methodologies for Vision and Language Dataset Creation | Jul 10, 2016 | Action RecognitionImage Description | —Unverified | 0 | 0 |
| FlowVQA: Mapping Multimodal Logic in Visual Question Answering with Flowcharts | Jun 27, 2024 | Decision MakingLogical Reasoning | —Unverified | 0 | 0 |
| FlexCap: Describe Anything in Images in Controllable Detail | Mar 18, 2024 | AttributeDense Captioning | —Unverified | 0 | 0 |
| Learning to Compose Diversified Prompts for Image Emotion Classification | Jan 26, 2022 | ClassificationEmotion Classification | —Unverified | 0 | 0 |