| SimpleLLM4AD: An End-to-End Vision-Language Model with Graph Visual Question Answering for Autonomous Driving | Jul 31, 2024 | Autonomous DrivingLanguage Modeling | —Unverified | 0 |
| SimpleVQA: Multimodal Factuality Evaluation for Multimodal Large Language Models | Feb 18, 2025 | Image ComprehensionQuestion Answering | —Unverified | 0 |
| SimpsonsVQA: Enhancing Inquiry-Based Learning with a Tailored Dataset | Oct 30, 2024 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| SimVQA: Exploring Simulated Environments for Visual Question Answering | Mar 31, 2022 | Data AugmentationDiversity | —Unverified | 0 |
| Single-Modal Entropy based Active Learning for Visual Question Answering | Oct 21, 2021 | Active LearningQuestion Answering | —Unverified | 0 |
| SITE: towards Spatial Intelligence Thorough Evaluation | May 8, 2025 | Question AnsweringSpatial Reasoning | —Unverified | 0 |
| Small Language Model Meets with Reinforced Vision Vocabulary | Jan 23, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| SMMILE: An Expert-Driven Benchmark for Multimodal Medical In-Context Learning | Jun 26, 2025 | In-Context LearningMedical Visual Question Answering | —Unverified | 0 |
| SnapNTell: Enhancing Entity-Centric Visual Question Answering with Retrieval Augmented Multimodal LLM | Mar 7, 2024 | Question AnsweringRetrieval | —Unverified | 0 |
| SocialGesture: Delving into Multi-person Gesture Understanding | Apr 3, 2025 | Gesture RecognitionQuestion Answering | —Unverified | 0 |
| Solution for SMART-101 Challenge of CVPR Multi-modal Algorithmic Reasoning Task 2024 | Jun 10, 2024 | Language Modellingobject-detection | —Unverified | 0 |
| Solution for SMART-101 Challenge of ICCV Multi-modal Algorithmic Reasoning Task 2023 | Oct 10, 2023 | Decoderobject-detection | —Unverified | 0 |
| Solving Visual Madlibs with Multiple Cues | Aug 11, 2016 | Activity PredictionAttribute | —Unverified | 0 |
| Sparks of Artificial General Intelligence(AGI) in Semiconductor Material Science: Early Explorations into the Next Frontier of Generative AI-Assisted Electron Micrograph Analysis | Sep 17, 2024 | In-Context LearningQuestion Answering | —Unverified | 0 |
| Sparse Attention Vectors: Generative Multimodal Model Features Are Discriminative Vision-Language Classifiers | Nov 28, 2024 | Image Captioningimage-classification | —Unverified | 0 |
| Spatial Attention as an Interface for Image Captioning Models | Sep 29, 2020 | Image CaptioningQuestion Answering | —Unverified | 0 |
| Spatial Knowledge Distillation to aid Visual Reasoning | Dec 10, 2018 | DiagnosticKnowledge Distillation | —Unverified | 0 |
| SpatialReasoner: Towards Explicit and Generalizable 3D Spatial Reasoning | Apr 28, 2025 | Question AnsweringSpatial Reasoning | —Unverified | 0 |
| SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities | Jan 22, 2024 | Question AnsweringSpatial Reasoning | —Unverified | 0 |
| SplatTalk: 3D VQA with Gaussian Splatting | Mar 8, 2025 | 3DGSQuestion Answering | —Unverified | 0 |
| Stacked Latent Attention for Multimodal Reasoning | Jun 1, 2018 | Image CaptioningMultimodal Reasoning | —Unverified | 0 |
| Stacking with Auxiliary Features for Visual Question Answering | Jun 1, 2018 | Common Sense ReasoningQuestion Answering | —Unverified | 0 |
| StackOverflowVQA: Stack Overflow Visual Question Answering Dataset | May 17, 2024 | Question AnsweringSentence | —Unverified | 0 |
| Steering LVLMs via Sparse Autoencoder for Hallucination Mitigation | May 22, 2025 | HallucinationImage Captioning | —Unverified | 0 |
| Story Generation from Visual Inputs: Techniques, Related Tasks, and Challenges | Jun 4, 2024 | Question AnsweringStory Generation | —Unverified | 0 |
| Straight to the Facts: Learning Knowledge Base Retrieval for Factual Visual Question Answering | Sep 4, 2018 | Factual Visual Question AnsweringGeneral Knowledge | —Unverified | 0 |
| Strengthening Multimodal Large Language Model with Bootstrapped Preference Optimization | Mar 13, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| StructuralLM: Structural Pre-training for Form Understanding | May 24, 2021 | document-image-classificationDocument Image Classification | —Unverified | 0 |
| Structure Causal Models and LLMs Integration in Medical Visual Question Answering | May 5, 2025 | Causal InferenceMedical Visual Question Answering | —Unverified | 0 |
| Structured Two-stream Attention Network for Video Question Answering | Jun 2, 2022 | Question AnsweringVideo Question Answering | —Unverified | 0 |
| Structure Guided Multi-modal Pre-trained Transformer for Knowledge Graph Reasoning | Jul 6, 2023 | Knowledge GraphsQuestion Answering | —Unverified | 0 |
| Structure Learning for Neural Module Networks | May 27, 2019 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Sunny and Dark Outside?! Improving Answer Consistency in VQA through Entailed Question Generation | Sep 10, 2019 | Common Sense ReasoningData Augmentation | —Unverified | 0 |
| Feedback-Driven Vision-Language Alignment with Minimal Human Supervision | Jan 8, 2025 | HallucinationQuestion Answering | —Unverified | 0 |
| Surgical-LLaVA: Toward Surgical Scenario Understanding via Large Language and Vision Models | Oct 13, 2024 | Instruction FollowingQuestion Answering | —Unverified | 0 |
| Surgical-LVLM: Learning to Adapt Large Vision-Language Model for Grounded Visual Question Answering in Robotic Surgery | Mar 22, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| SurgicalVLM-Agent: Towards an Interactive AI Co-Pilot for Pituitary Surgery | Mar 12, 2025 | Activity RecognitionAnatomy | —Unverified | 0 |
| Survey of Large Multimodal Model Datasets, Application Categories and Taxonomy | Dec 23, 2024 | Image CaptioningQuestion Answering | —Unverified | 0 |
| Survey of Recent Advances in Visual Question Answering | Sep 24, 2017 | Question AnsweringSurvey | —Unverified | 0 |
| Survey of Visual Question Answering: Datasets and Techniques | May 10, 2017 | Deep LearningQuestion Answering | —Unverified | 0 |
| Survey of Visual-Semantic Embedding Methods for Zero-Shot Image Retrieval | May 16, 2021 | Graph GenerationImage Captioning | —Unverified | 0 |
| SViQA: A Unified Speech-Vision Multimodal Model for Textless Visual Question Answering | Apr 1, 2025 | cross-modal alignmentQuestion Answering | —Unverified | 0 |
| Swarm Intelligence in Geo-Localization: A Multi-Agent Large Vision-Language Model Collaborative Framework | Aug 21, 2024 | geo-localizationLanguage Modeling | —Unverified | 0 |
| Switch-BERT: Learning to Model Multimodal Interactions by Switching Attention and Input | Jun 25, 2023 | DiversityImage-text Retrieval | —Unverified | 0 |
| SyCoCa: Symmetrizing Contrastive Captioners with Attentive Masking for Multimodal Alignment | Jan 4, 2024 | Image Captioningimage-classification | —Unverified | 0 |
| Syntax Tree Constrained Graph Network for Visual Question Answering | Sep 17, 2023 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Synthesize Step-by-Step: Tools, Templates and LLMs as Data Generators for Reasoning-Based Chart VQA | Mar 25, 2024 | Chart Question AnsweringData Augmentation | —Unverified | 0 |
| Synthesize Step-by-Step: Tools Templates and LLMs as Data Generators for Reasoning-Based Chart VQA | Jan 1, 2024 | Chart Question AnsweringData Augmentation | —Unverified | 0 |
| T2I-FactualBench: Benchmarking the Factuality of Text-to-Image Models with Knowledge-Intensive Concepts | Dec 5, 2024 | BenchmarkingImage Generation | —Unverified | 0 |
| Tackling VQA with Pretrained Foundation Models without Further Training | Sep 27, 2023 | Question AnsweringVisual Question Answering | —Unverified | 0 |