| Breaking Neural Network Scaling Laws with Modularity | Sep 9, 2024 | Question AnsweringVisual Question Answering | —Unverified | 0 | 0 |
| Spatial Attention as an Interface for Image Captioning Models | Sep 29, 2020 | Image CaptioningQuestion Answering | —Unverified | 0 | 0 |
| Spatial Knowledge Distillation to aid Visual Reasoning | Dec 10, 2018 | DiagnosticKnowledge Distillation | —Unverified | 0 | 0 |
| SpatialReasoner: Towards Explicit and Generalizable 3D Spatial Reasoning | Apr 28, 2025 | Question AnsweringSpatial Reasoning | —Unverified | 0 | 0 |
| SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities | Jan 22, 2024 | Question AnsweringSpatial Reasoning | —Unverified | 0 | 0 |
| Advancing Surgical VQA with Scene Graph Knowledge | Dec 15, 2023 | Question AnsweringVisual Question Answering | —Unverified | 0 | 0 |
| Breaking Down Questions for Outside-Knowledge Visual Question Answering | Nov 16, 2021 | Graph Neural NetworkQuestion Answering | —Unverified | 0 | 0 |
| Breaking Down Questions for Outside-Knowledge VQA | Sep 29, 2021 | Graph Neural NetworkQuestion Answering | —Unverified | 0 | 0 |
| SplatTalk: 3D VQA with Gaussian Splatting | Mar 8, 2025 | 3DGSQuestion Answering | —Unverified | 0 | 0 |
| Breaking Common Sense: WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images | Mar 13, 2023 | Common Sense ReasoningExplanation Generation | —Unverified | 0 | 0 |
| Boosting Cross-task Transferability of Adversarial Patches with Visual Relations | Apr 11, 2023 | Image CaptioningObject Recognition | —Unverified | 0 | 0 |
| Stacked Latent Attention for Multimodal Reasoning | Jun 1, 2018 | Image CaptioningMultimodal Reasoning | —Unverified | 0 | 0 |
| Stacking with Auxiliary Features for Visual Question Answering | Jun 1, 2018 | Common Sense ReasoningQuestion Answering | —Unverified | 0 | 0 |
| StackOverflowVQA: Stack Overflow Visual Question Answering Dataset | May 17, 2024 | Question AnsweringSentence | —Unverified | 0 | 0 |
| Steering LVLMs via Sparse Autoencoder for Hallucination Mitigation | May 22, 2025 | HallucinationImage Captioning | —Unverified | 0 | 0 |
| BOK-VQA: Bilingual outside Knowledge-Based Visual Question Answering via Graph Representation Pretraining | Jan 12, 2024 | Question AnsweringVisual Question Answering | —Unverified | 0 | 0 |
| Blocks as Probes: Dissecting Categorization Ability of Large Multimodal Models | Sep 3, 2024 | Question AnsweringVisual Question Answering | —Unverified | 0 | 0 |
| Story Generation from Visual Inputs: Techniques, Related Tasks, and Challenges | Jun 4, 2024 | Question AnsweringStory Generation | —Unverified | 0 | 0 |
| Straight to the Facts: Learning Knowledge Base Retrieval for Factual Visual Question Answering | Sep 4, 2018 | Factual Visual Question AnsweringGeneral Knowledge | —Unverified | 0 | 0 |
| Strengthening Multimodal Large Language Model with Bootstrapped Preference Optimization | Mar 13, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 | 0 |
| StructuralLM: Structural Pre-training for Form Understanding | May 24, 2021 | document-image-classificationDocument Image Classification | —Unverified | 0 | 0 |
| Structure Causal Models and LLMs Integration in Medical Visual Question Answering | May 5, 2025 | Causal InferenceMedical Visual Question Answering | —Unverified | 0 | 0 |
| Advancing Multimodal Medical Capabilities of Gemini | May 6, 2024 | Computed Tomography (CT)image-classification | —Unverified | 0 | 0 |
| xGQA: Cross-Lingual Visual Question Answering | Oct 16, 2021 | Cross-Lingual TransferLanguage Modeling | —Unverified | 0 | 0 |
| Structured Two-stream Attention Network for Video Question Answering | Jun 2, 2022 | Question AnsweringVideo Question Answering | —Unverified | 0 | 0 |