| Normalized and Geometry-Aware Self-Attention Network for Image Captioning | Mar 19, 2020 | Image CaptioningMachine Translation | —Unverified | 0 | 0 |
| NoTeS-Bank: Benchmarking Neural Transcription and Search for Scientific Notes Understanding | Apr 12, 2025 | BenchmarkingDocument AI | —Unverified | 0 | 0 |
| Detect, Describe, Discriminate: Moving Beyond VQA for MLLM Evaluation | Sep 23, 2024 | Multiple-choiceQuestion Answering | —Unverified | 0 | 0 |
| Not-So-CLEVR: Visual Relations Strain Feedforward Neural Networks | Jan 1, 2018 | MemorizationQuestion Answering | —Unverified | 0 | 0 |
| Detect2Interact: Localizing Object Key Field in Visual Question Answering (VQA) with LLMs | Apr 1, 2024 | Common Sense ReasoningObject | —Unverified | 0 | 0 |
| Designing a Robust Radiology Report Generation System | Nov 2, 2024 | Decision MakingDiagnostic | —Unverified | 0 | 0 |
| DeepStack: Deeply Stacking Visual Tokens is Surprisingly Simple and Effective for LMMs | Jun 6, 2024 | Language ModellingLarge Language Model | —Unverified | 0 | 0 |
| Aligned Vector Quantization for Edge-Cloud Collabrative Vision-Language Models | Nov 8, 2024 | QuantizationQuestion Answering | —Unverified | 0 | 0 |
| Visual Commonsense based Heterogeneous Graph Contrastive Learning | Nov 11, 2023 | Contrastive LearningQuestion Answering | —Unverified | 0 | 0 |
| Object-based reasoning in VQA | Jan 29, 2018 | Objectobject-detection | —Unverified | 0 | 0 |
| Object-Centric Diagnosis of Visual Reasoning | Dec 21, 2020 | DiagnosticObject | —Unverified | 0 | 0 |
| Object-Centric Temporal Consistency via Conditional Autoregressive Inductive Biases | Oct 21, 2024 | ObjectQuestion Answering | —Unverified | 0 | 0 |
| OccLLaMA: An Occupancy-Language-Action Generative World Model for Autonomous Driving | Sep 5, 2024 | Autonomous DrivingMotion Planning | —Unverified | 0 | 0 |
| Ziya-Visual: Bilingual Large Vision-Language Model via Multi-Task Instruction Tuning | Oct 12, 2023 | Image CaptioningImage-text Retrieval | —Unverified | 0 | 0 |
| OMCAT: Omni Context Aware Transformer | Oct 15, 2024 | Audio-visual Question AnsweringAudio-Visual Question Answering (AVQA) | —Unverified | 0 | 0 |
| OMGM: Orchestrate Multiple Granularities and Modalities for Efficient Multimodal Retrieval | May 10, 2025 | Cross-Modal RetrievalQuestion Answering | —Unverified | 0 | 0 |
| Deep learning evaluation using deep linguistic processing | Jun 5, 2017 | Deep LearningMultimodal Deep Learning | —Unverified | 0 | 0 |
| Why context matters in VQA and Reasoning: Semantic interventions for VLM input modalities | Oct 2, 2024 | Question AnsweringVisual Question Answering | —Unverified | 0 | 0 |
| Deep Exemplar Networks for VQA and VQG | Dec 19, 2019 | DecoderQuestion Answering | —Unverified | 0 | 0 |
| Aligned Image-Word Representations Improve Inductive Transfer Across Vision-Language Tasks | Apr 2, 2017 | Multi-Task LearningQuestion Answering | —Unverified | 0 | 0 |
| On Advances in Text Generation from Images Beyond Captioning: A Case Study in Self-Rationalization | May 24, 2022 | DescriptiveImage Captioning | —Unverified | 0 | 0 |
| OneEncoder: A Lightweight Framework for Progressive Alignment of Modalities | Sep 17, 2024 | cross-modal alignmentQuestion Answering | —Unverified | 0 | 0 |
| Deep Bayesian Active Learning for Multiple Correct Outputs | Dec 2, 2019 | Active LearningAnswer Generation | —Unverified | 0 | 0 |
| Deep Attention Neural Tensor Network for Visual Question Answering | Sep 1, 2018 | Deep AttentionQuestion Answering | —Unverified | 0 | 0 |
| One VLM to Keep it Learning: Generation and Balancing for Data-free Continual Visual Question Answering | Nov 4, 2024 | Continual LearningQuestion Answering | —Unverified | 0 | 0 |