| Breaking Neural Network Scaling Laws with Modularity | Sep 9, 2024 | Question AnsweringVisual Question Answering | —Unverified | 0 | 0 |
| Spatial Attention as an Interface for Image Captioning Models | Sep 29, 2020 | Image CaptioningQuestion Answering | —Unverified | 0 | 0 |
| Spatial Knowledge Distillation to aid Visual Reasoning | Dec 10, 2018 | DiagnosticKnowledge Distillation | —Unverified | 0 | 0 |
| SpatialReasoner: Towards Explicit and Generalizable 3D Spatial Reasoning | Apr 28, 2025 | Question AnsweringSpatial Reasoning | —Unverified | 0 | 0 |
| SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities | Jan 22, 2024 | Question AnsweringSpatial Reasoning | —Unverified | 0 | 0 |
| Advancing Surgical VQA with Scene Graph Knowledge | Dec 15, 2023 | Question AnsweringVisual Question Answering | —Unverified | 0 | 0 |
| Breaking Down Questions for Outside-Knowledge Visual Question Answering | Nov 16, 2021 | Graph Neural NetworkQuestion Answering | —Unverified | 0 | 0 |
| Breaking Down Questions for Outside-Knowledge VQA | Sep 29, 2021 | Graph Neural NetworkQuestion Answering | —Unverified | 0 | 0 |
| SplatTalk: 3D VQA with Gaussian Splatting | Mar 8, 2025 | 3DGSQuestion Answering | —Unverified | 0 | 0 |
| Breaking Common Sense: WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images | Mar 13, 2023 | Common Sense ReasoningExplanation Generation | —Unverified | 0 | 0 |
| Boosting Cross-task Transferability of Adversarial Patches with Visual Relations | Apr 11, 2023 | Image CaptioningObject Recognition | —Unverified | 0 | 0 |
| Stacked Latent Attention for Multimodal Reasoning | Jun 1, 2018 | Image CaptioningMultimodal Reasoning | —Unverified | 0 | 0 |
| Stacking with Auxiliary Features for Visual Question Answering | Jun 1, 2018 | Common Sense ReasoningQuestion Answering | —Unverified | 0 | 0 |
| StackOverflowVQA: Stack Overflow Visual Question Answering Dataset | May 17, 2024 | Question AnsweringSentence | —Unverified | 0 | 0 |
| Steering LVLMs via Sparse Autoencoder for Hallucination Mitigation | May 22, 2025 | HallucinationImage Captioning | —Unverified | 0 | 0 |
| BOK-VQA: Bilingual outside Knowledge-Based Visual Question Answering via Graph Representation Pretraining | Jan 12, 2024 | Question AnsweringVisual Question Answering | —Unverified | 0 | 0 |
| Blocks as Probes: Dissecting Categorization Ability of Large Multimodal Models | Sep 3, 2024 | Question AnsweringVisual Question Answering | —Unverified | 0 | 0 |
| Story Generation from Visual Inputs: Techniques, Related Tasks, and Challenges | Jun 4, 2024 | Question AnsweringStory Generation | —Unverified | 0 | 0 |
| Straight to the Facts: Learning Knowledge Base Retrieval for Factual Visual Question Answering | Sep 4, 2018 | Factual Visual Question AnsweringGeneral Knowledge | —Unverified | 0 | 0 |
| Strengthening Multimodal Large Language Model with Bootstrapped Preference Optimization | Mar 13, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 | 0 |
| StructuralLM: Structural Pre-training for Form Understanding | May 24, 2021 | document-image-classificationDocument Image Classification | —Unverified | 0 | 0 |
| Structure Causal Models and LLMs Integration in Medical Visual Question Answering | May 5, 2025 | Causal InferenceMedical Visual Question Answering | —Unverified | 0 | 0 |
| Advancing Multimodal Medical Capabilities of Gemini | May 6, 2024 | Computed Tomography (CT)image-classification | —Unverified | 0 | 0 |
| xGQA: Cross-Lingual Visual Question Answering | Oct 16, 2021 | Cross-Lingual TransferLanguage Modeling | —Unverified | 0 | 0 |
| Structured Two-stream Attention Network for Video Question Answering | Jun 2, 2022 | Question AnsweringVideo Question Answering | —Unverified | 0 | 0 |
| Structure Guided Multi-modal Pre-trained Transformer for Knowledge Graph Reasoning | Jul 6, 2023 | Knowledge GraphsQuestion Answering | —Unverified | 0 | 0 |
| Structure Learning for Neural Module Networks | May 27, 2019 | Question AnsweringVisual Question Answering | —Unverified | 0 | 0 |
| Sunny and Dark Outside?! Improving Answer Consistency in VQA through Entailed Question Generation | Sep 10, 2019 | Common Sense ReasoningData Augmentation | —Unverified | 0 | 0 |
| Beyond VQA: Generating Multi-word Answer and Rationale to Visual Questions | Oct 24, 2020 | General ClassificationMultiple-choice | —Unverified | 0 | 0 |
| Feedback-Driven Vision-Language Alignment with Minimal Human Supervision | Jan 8, 2025 | HallucinationQuestion Answering | —Unverified | 0 | 0 |
| VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks | Oct 7, 2024 | Information RetrievalLanguage Modeling | —Unverified | 0 | 0 |
| Beyond the Hype: A dispassionate look at vision-language models in medical scenario | Aug 16, 2024 | Question AnsweringSpatial Reasoning | —Unverified | 0 | 0 |
| Surgical-LLaVA: Toward Surgical Scenario Understanding via Large Language and Vision Models | Oct 13, 2024 | Instruction FollowingQuestion Answering | —Unverified | 0 | 0 |
| Surgical-LVLM: Learning to Adapt Large Vision-Language Model for Grounded Visual Question Answering in Robotic Surgery | Mar 22, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 | 0 |
| SurgicalVLM-Agent: Towards an Interactive AI Co-Pilot for Pituitary Surgery | Mar 12, 2025 | Activity RecognitionAnatomy | —Unverified | 0 | 0 |
| Beyond the Frame: Generating 360° Panoramic Videos from Perspective Videos | Apr 10, 2025 | Question AnsweringVideo Generation | —Unverified | 0 | 0 |
| Beyond Logit Lens: Contextual Embeddings for Robust Hallucination Detection & Grounding in VLMs | Nov 28, 2024 | AttributeHallucination | —Unverified | 0 | 0 |
| Survey of Large Multimodal Model Datasets, Application Categories and Taxonomy | Dec 23, 2024 | Image CaptioningQuestion Answering | —Unverified | 0 | 0 |
| Survey of Recent Advances in Visual Question Answering | Sep 24, 2017 | Question AnsweringSurvey | —Unverified | 0 | 0 |
| Survey of Visual Question Answering: Datasets and Techniques | May 10, 2017 | Deep LearningQuestion Answering | —Unverified | 0 | 0 |
| Survey of Visual-Semantic Embedding Methods for Zero-Shot Image Retrieval | May 16, 2021 | Graph GenerationImage Captioning | —Unverified | 0 | 0 |
| SViQA: A Unified Speech-Vision Multimodal Model for Textless Visual Question Answering | Apr 1, 2025 | cross-modal alignmentQuestion Answering | —Unverified | 0 | 0 |
| Beyond Human Vision: The Role of Large Vision Language Models in Microscope Image Analysis | May 1, 2024 | Image CaptioningQuestion Answering | —Unverified | 0 | 0 |
| Swarm Intelligence in Geo-Localization: A Multi-Agent Large Vision-Language Model Collaborative Framework | Aug 21, 2024 | geo-localizationLanguage Modeling | —Unverified | 0 | 0 |
| Beyond Captioning: Task-Specific Prompting for Improved VLM Performance in Mathematical Reasoning | Oct 8, 2024 | Image RetrievalMath | —Unverified | 0 | 0 |
| Switch-BERT: Learning to Model Multimodal Interactions by Switching Attention and Input | Jun 25, 2023 | DiversityImage-text Retrieval | —Unverified | 0 | 0 |
| SyCoCa: Symmetrizing Contrastive Captioners with Attentive Masking for Multimodal Alignment | Jan 4, 2024 | Image Captioningimage-classification | —Unverified | 0 | 0 |
| BESTMVQA: A Benchmark Evaluation System for Medical Visual Question Answering | Dec 13, 2023 | Medical Visual Question AnsweringQuestion Answering | —Unverified | 0 | 0 |
| Syntax Tree Constrained Graph Network for Visual Question Answering | Sep 17, 2023 | Question AnsweringVisual Question Answering | —Unverified | 0 | 0 |
| Synthesize Step-by-Step: Tools, Templates and LLMs as Data Generators for Reasoning-Based Chart VQA | Mar 25, 2024 | Chart Question AnsweringData Augmentation | —Unverified | 0 | 0 |