| Why Does the VQA Model Answer No?: Improving Reasoning through Visual and Linguistic Inference | Sep 25, 2019 | Common Sense ReasoningQuestion Answering | —Unverified | 0 |
| Wiki-LLaVA: Hierarchical Retrieval-Augmented Generation for Multimodal LLMs | Apr 23, 2024 | Question AnsweringRetrieval | —Unverified | 0 |
| WoLF: Wide-scope Large Language Model Framework for CXR Understanding | Mar 19, 2024 | AnatomyInstruction Following | —Unverified | 0 |
| xGQA: Cross-Lingual Visual Question Answering | Oct 16, 2021 | Cross-Lingual TransferLanguage Modeling | —Unverified | 0 |
| Yin and Yang: Balancing and Answering Binary Visual Questions | Nov 16, 2015 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| YouMakeup: A Large-Scale Domain-Specific Multimodal Dataset for Fine-Grained Semantic Comprehension | Nov 1, 2019 | Caption GenerationQuestion Answering | —Unverified | 0 |
| ZALM3: Zero-Shot Enhancement of Vision-Language Alignment via In-Context Information in Multi-Turn Multimodal Medical Dialogue | Sep 26, 2024 | Medical Visual Question AnsweringQuestion Answering | —Unverified | 0 |
| Zero-shot Action Localization via the Confidence of Large Vision-Language Models | Oct 18, 2024 | Action LocalizationLanguage Modelling | —Unverified | 0 |
| Zero-Shot Anomaly Detection in Battery Thermal Images Using Visual Question Answering with Prior Knowledge | May 22, 2025 | Anomaly DetectionQuestion Answering | —Unverified | 0 |
| Zero-Shot Transfer VQA Dataset | Nov 2, 2018 | Question AnsweringTransfer Learning | —Unverified | 0 |
| Zero-Shot Visual Question Answering | Nov 17, 2016 | Question AnsweringRetrieval | —Unverified | 0 |
| Zero-Shot Visual Reasoning by Vision-Language Models: Benchmarking and Analysis | Aug 27, 2024 | BenchmarkingLarge Language Model | —Unverified | 0 |
| Ziya-Visual: Bilingual Large Vision-Language Model via Multi-Task Instruction Tuning | Oct 12, 2023 | Image CaptioningImage-text Retrieval | —Unverified | 0 |
| Natural Language Understanding and Inference with MLLM in Visual Question Answering: A Survey | Nov 26, 2024 | Natural Language UnderstandingQuestion Answering | —Unverified | 0 |
| Natural Reflection Backdoor Attack on Vision Language Model for Autonomous Driving | May 9, 2025 | Autonomous DrivingBackdoor Attack | —Unverified | 0 |
| Negative Object Presence Evaluation (NOPE) to Measure Object Hallucination in Vision-Language Models | Oct 9, 2023 | HallucinationObject | —Unverified | 0 |
| Neglected Risks: The Disturbing Reality of Children's Images in Datasets and the Urgent Call for Accountability | Apr 20, 2025 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| NegVQA: Can Vision Language Models Understand Negation? | May 28, 2025 | NegationQuestion Answering | —Unverified | 0 |
| Neural Attention Models for Sequence Classification: Analysis and Application to Key Term Extraction and Dialogue Act Detection | Mar 31, 2016 | Caption GenerationClassification | —Unverified | 0 |
| Neural Memory Plasticity for Anomaly Detection | Oct 12, 2019 | Anomaly DetectionEEG | —Unverified | 0 |
| Neural Self Talk: Image Understanding via Continuous Questioning and Answering | Dec 10, 2015 | Question AnsweringQuestion Generation | —Unverified | 0 |
| NeurIPS 2023 Competition: Privacy Preserving Federated Learning Document VQA | Nov 6, 2024 | Federated LearningLanguage Modelling | —Unverified | 0 |
| Neuro-Symbolic Spatio-Temporal Reasoning | Nov 28, 2022 | AI AgentImage Segmentation | —Unverified | 0 |
| Neuro-Symbolic Visual Reasoning: Disentangling "Visual" from "Reasoning" | Jun 20, 2020 | Graph GenerationQuestion Answering | —Unverified | 0 |
| Neuro-Symbolic VQA: A review from the perspective of AGI desiderata | Apr 13, 2021 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| NEVLP: Noise-Robust Framework for Efficient Vision-Language Pre-training | Sep 15, 2024 | Contrastive Learningcross-modal alignment | —Unverified | 0 |
| New Ideas and Trends in Deep Multimodal Content Understanding: A Review | Oct 16, 2020 | Cross-Modal RetrievalDeep Learning | —Unverified | 0 |
| NEWSKVQA: Knowledge-Aware News Video Question Answering | Feb 8, 2022 | Common Sense ReasoningManagement | —Unverified | 0 |
| NMT-Keras: a Very Flexible Toolkit with a Focus on Interactive NMT and Online Learning | Jul 9, 2018 | General ClassificationMachine Translation | —Unverified | 0 |
| Non-monotonic Logical Reasoning Guiding Deep Learning for Explainable Visual Question Answering | Sep 23, 2019 | Inductive LearningLogical Reasoning | —Unverified | 0 |
| Normalized and Geometry-Aware Self-Attention Network for Image Captioning | Mar 19, 2020 | Image CaptioningMachine Translation | —Unverified | 0 |
| NoTeS-Bank: Benchmarking Neural Transcription and Search for Scientific Notes Understanding | Apr 12, 2025 | BenchmarkingDocument AI | —Unverified | 0 |
| Not-So-CLEVR: Visual Relations Strain Feedforward Neural Networks | Jan 1, 2018 | MemorizationQuestion Answering | —Unverified | 0 |
| Object-based reasoning in VQA | Jan 29, 2018 | Objectobject-detection | —Unverified | 0 |
| Object-Centric Diagnosis of Visual Reasoning | Dec 21, 2020 | DiagnosticObject | —Unverified | 0 |
| Object-Centric Temporal Consistency via Conditional Autoregressive Inductive Biases | Oct 21, 2024 | ObjectQuestion Answering | —Unverified | 0 |
| OccLLaMA: An Occupancy-Language-Action Generative World Model for Autonomous Driving | Sep 5, 2024 | Autonomous DrivingMotion Planning | —Unverified | 0 |
| OMCAT: Omni Context Aware Transformer | Oct 15, 2024 | Audio-visual Question AnsweringAudio-Visual Question Answering (AVQA) | —Unverified | 0 |
| OMGM: Orchestrate Multiple Granularities and Modalities for Efficient Multimodal Retrieval | May 10, 2025 | Cross-Modal RetrievalQuestion Answering | —Unverified | 0 |
| On Advances in Text Generation from Images Beyond Captioning: A Case Study in Self-Rationalization | May 24, 2022 | DescriptiveImage Captioning | —Unverified | 0 |
| OneEncoder: A Lightweight Framework for Progressive Alignment of Modalities | Sep 17, 2024 | cross-modal alignmentQuestion Answering | —Unverified | 0 |
| One VLM to Keep it Learning: Generation and Balancing for Data-free Continual Visual Question Answering | Nov 4, 2024 | Continual LearningQuestion Answering | —Unverified | 0 |
| On Incorporating Semantic Prior Knowlegde in Deep Learning Through Embedding-Space Constraints | Sep 25, 2019 | Data AugmentationQuestion Answering | —Unverified | 0 |
| On Incorporating Semantic Prior Knowledge in Deep Learning Through Embedding-Space Constraints | Sep 30, 2019 | Data AugmentationQuestion Answering | —Unverified | 0 |
| On the Cognition of Visual Question Answering Models and Human Intelligence: A Comparative Study | Oct 4, 2023 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| On the Effects of Video Grounding on Language Models | Oct 1, 2022 | Image CaptioningQuestion Answering | —Unverified | 0 |
| On the Efficacy of Co-Attention Transformer Layers in Visual Question Answering | Jan 11, 2022 | POSQuestion Answering | —Unverified | 0 |
| On the Flip Side: Identifying Counterexamples in Visual Question Answering | Jun 3, 2018 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| On the General Value of Evidence, and Bilingual Scene-Text Visual Question Answering | Feb 24, 2020 | Question AnsweringReferring Expression | —Unverified | 0 |
| On the Limitations of Vision-Language Models in Understanding Image Transforms | Mar 12, 2025 | Question AnsweringVideo Generation | —Unverified | 0 |