| Separation of Powers: On Segregating Knowledge from Observation in LLM-enabled Knowledge-based Visual Question Answering | Jan 1, 2025 | Multiple-choiceQuestion Answering | —Unverified | 0 | 0 |
| Is the House Ready For Sleeptime? Generating and Evaluating Situational Queries for Embodied Question Answering | May 8, 2024 | 2kEmbodied Question Answering | —Unverified | 0 | 0 |
| Serving and Optimizing Machine Learning Workflows on Heterogeneous Infrastructures | May 10, 2022 | AutoMLBIG-bench Machine Learning | —Unverified | 0 | 0 |
| Visual Superordinate Abstraction for Robust Concept Learning | May 28, 2022 | AttributeQuestion Answering | —Unverified | 0 | 0 |
| 3D Question Answering | Dec 15, 2021 | 3D geometryQuestion Answering | —Unverified | 0 | 0 |
| Can Pre-training help VQA with Lexical Variations? | Nov 1, 2020 | Question AnsweringVisual Question Answering | —Unverified | 0 | 0 |
| SHMamba: Structured Hyperbolic State Space Model for Audio-Visual Question Answering | Jun 14, 2024 | Audio-visual Question AnsweringAudio-Visual Question Answering (AVQA) | —Unverified | 0 | 0 |
| Visual TTR - Modelling Visual Question Answering in Type Theory with Records | May 1, 2019 | Question AnsweringVisual Question Answering | —Unverified | 0 | 0 |
| Can Open Domain Question Answering Systems Answer Visual Knowledge Questions? | Feb 9, 2022 | Open-Domain Question AnsweringQuestion Answering | —Unverified | 0 | 0 |
| Can Multimodal LLMs do Visual Temporal Understanding and Reasoning? The answer is No! | Jan 18, 2025 | Multiple-choiceQuestion Answering | —Unverified | 0 | 0 |
| Show Why the Answer is Correct! Towards Explainable AI using Compositional Temporal Attention | May 15, 2021 | Question AnsweringVisual Question Answering | —Unverified | 0 | 0 |
| ViT3D Alignment of LLaMA3: 3D Medical Image Report Generation | Oct 11, 2024 | DiagnosticLanguage Modeling | —Unverified | 0 | 0 |
| SILC: Improving Vision Language Pretraining with Self-Distillation | Oct 20, 2023 | ClassificationContrastive Learning | —Unverified | 0 | 0 |
| Silkie: Preference Distillation for Large Visual Language Models | Dec 17, 2023 | HallucinationMME | —Unverified | 0 | 0 |
| Generating Question Relevant Captions to Aid Visual Question Answering | Jun 3, 2019 | General KnowledgeImage Captioning | —Unverified | 0 | 0 |
| ViUniT: Visual Unit Tests for More Robust Visual Programming | Dec 12, 2024 | Image GenerationImage-text matching | —Unverified | 0 | 0 |
| Can LLMs Generate Human-Like Wayfinding Instructions? Towards Platform-Agnostic Embodied Instruction Synthesis | Mar 18, 2024 | In-Context LearningQuestion Answering | —Unverified | 0 | 0 |
| Adversarial Representation Learning for Text-to-Image Matching | Aug 28, 2019 | Image CaptioningLanguage Modeling | —Unverified | 0 | 0 |
| Can Large Language Models Unveil the Mysteries? An Exploration of Their Ability to Unlock Information in Complex Scenarios | Feb 27, 2025 | Data IntegrationQuestion Answering | —Unverified | 0 | 0 |
| Simple is not Easy: A Simple Strong Baseline for TextVQA and TextCaps | Dec 9, 2020 | DecoderImage Captioning | —Unverified | 0 | 0 |
| SimpleLLM4AD: An End-to-End Vision-Language Model with Graph Visual Question Answering for Autonomous Driving | Jul 31, 2024 | Autonomous DrivingLanguage Modeling | —Unverified | 0 | 0 |
| SimpleVQA: Multimodal Factuality Evaluation for Multimodal Large Language Models | Feb 18, 2025 | Image ComprehensionQuestion Answering | —Unverified | 0 | 0 |
| SimpsonsVQA: Enhancing Inquiry-Based Learning with a Tailored Dataset | Oct 30, 2024 | Question AnsweringVisual Question Answering | —Unverified | 0 | 0 |
| Can Common VLMs Rival Medical VLMs? Evaluation and Strategic Insights | Jun 19, 2025 | Question AnsweringVisual Question Answering | —Unverified | 0 | 0 |
| SimVQA: Exploring Simulated Environments for Visual Question Answering | Mar 31, 2022 | Data AugmentationDiversity | —Unverified | 0 | 0 |
| Single-Modal Entropy based Active Learning for Visual Question Answering | Oct 21, 2021 | Active LearningQuestion Answering | —Unverified | 0 | 0 |
| Adversarial Regularization for Visual Question Answering: Strengths, Shortcomings, and Side Effects | Jun 20, 2019 | Question AnsweringVisual Question Answering | —Unverified | 0 | 0 |
| SITE: towards Spatial Intelligence Thorough Evaluation | May 8, 2025 | Question AnsweringSpatial Reasoning | —Unverified | 0 | 0 |
| Can CLIP Count Stars? An Empirical Study on Quantity Bias in CLIP | Sep 23, 2024 | Image GenerationQuestion Answering | —Unverified | 0 | 0 |
| Calibrating Uncertainty Quantification of Multi-Modal LLMs using Grounding | Apr 30, 2025 | Medical Question AnsweringQuestion Answering | —Unverified | 0 | 0 |
| CAD -- Contextual Multi-modal Alignment for Dynamic AVQA | Oct 25, 2023 | Audio-visual Question AnsweringAudio-Visual Question Answering (AVQA) | —Unverified | 0 | 0 |
| Building Trustworthy Multimodal AI: A Review of Fairness, Transparency, and Ethics in Vision-Language Tasks | Apr 14, 2025 | EthicsFairness | —Unverified | 0 | 0 |
| BuDDIE: A Business Document Dataset for Multi-task Information Extraction | Apr 5, 2024 | Document Classificationdocument understanding | —Unverified | 0 | 0 |
| 3D-CT-GPT: Generating 3D Radiology Reports through Integration of Large Vision-Language Models | Sep 28, 2024 | DiagnosticLanguage Modeling | —Unverified | 0 | 0 |
| Small Language Model Meets with Reinforced Vision Vocabulary | Jan 23, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 | 0 |
| SMMILE: An Expert-Driven Benchmark for Multimodal Medical In-Context Learning | Jun 26, 2025 | In-Context LearningMedical Visual Question Answering | —Unverified | 0 | 0 |
| Adversarial Multimodal Network for Movie Question Answering | Jun 24, 2019 | Question AnsweringVideo Question Answering | —Unverified | 0 | 0 |
| SnapNTell: Enhancing Entity-Centric Visual Question Answering with Retrieval Augmented Multimodal LLM | Mar 7, 2024 | Question AnsweringRetrieval | —Unverified | 0 | 0 |
| SocialGesture: Delving into Multi-person Gesture Understanding | Apr 3, 2025 | Gesture RecognitionQuestion Answering | —Unverified | 0 | 0 |
| Bridging the Semantic Gaps: Improving Medical VQA Consistency with LLM-Augmented Question Sets | Apr 16, 2025 | DiversityMedical Visual Question Answering | —Unverified | 0 | 0 |
| VL-BEiT: Generative Vision-Language Pretraining | Jun 2, 2022 | image-classificationImage Classification | —Unverified | 0 | 0 |
| Solution for SMART-101 Challenge of CVPR Multi-modal Algorithmic Reasoning Task 2024 | Jun 10, 2024 | Language Modellingobject-detection | —Unverified | 0 | 0 |
| Solution for SMART-101 Challenge of ICCV Multi-modal Algorithmic Reasoning Task 2023 | Oct 10, 2023 | Decoderobject-detection | —Unverified | 0 | 0 |
| Solving Visual Madlibs with Multiple Cues | Aug 11, 2016 | Activity PredictionAttribute | —Unverified | 0 | 0 |
| Adversarial Attacks Beyond the Image Space | Nov 20, 2017 | Question AnsweringVisual Question Answering | —Unverified | 0 | 0 |
| Sparks of Artificial General Intelligence(AGI) in Semiconductor Material Science: Early Explorations into the Next Frontier of Generative AI-Assisted Electron Micrograph Analysis | Sep 17, 2024 | In-Context LearningQuestion Answering | —Unverified | 0 | 0 |
| Adventurer's Treasure Hunt: A Transparent System for Visually Grounded Compositional Visual Question Answering based on Scene Graphs | Jun 28, 2021 | Question AnsweringTask 2 | —Unverified | 0 | 0 |
| Bridge Damage Cause Estimation Using Multiple Images Based on Visual Question Answering | Feb 18, 2023 | Question AnsweringVisual Question Answering | —Unverified | 0 | 0 |
| VLFeedback: A Large-Scale AI Feedback Dataset for Large Vision-Language Models Alignment | Oct 12, 2024 | DiversityHallucination | —Unverified | 0 | 0 |
| Sparse Attention Vectors: Generative Multimodal Model Features Are Discriminative Vision-Language Classifiers | Nov 28, 2024 | Image Captioningimage-classification | —Unverified | 0 | 0 |