| Single-Modal Entropy based Active Learning for Visual Question Answering | Oct 21, 2021 | Active LearningQuestion Answering | —Unverified | 0 | 0 |
| Adversarial Regularization for Visual Question Answering: Strengths, Shortcomings, and Side Effects | Jun 20, 2019 | Question AnsweringVisual Question Answering | —Unverified | 0 | 0 |
| SITE: towards Spatial Intelligence Thorough Evaluation | May 8, 2025 | Question AnsweringSpatial Reasoning | —Unverified | 0 | 0 |
| Can CLIP Count Stars? An Empirical Study on Quantity Bias in CLIP | Sep 23, 2024 | Image GenerationQuestion Answering | —Unverified | 0 | 0 |
| Calibrating Uncertainty Quantification of Multi-Modal LLMs using Grounding | Apr 30, 2025 | Medical Question AnsweringQuestion Answering | —Unverified | 0 | 0 |
| CAD -- Contextual Multi-modal Alignment for Dynamic AVQA | Oct 25, 2023 | Audio-visual Question AnsweringAudio-Visual Question Answering (AVQA) | —Unverified | 0 | 0 |
| Building Trustworthy Multimodal AI: A Review of Fairness, Transparency, and Ethics in Vision-Language Tasks | Apr 14, 2025 | EthicsFairness | —Unverified | 0 | 0 |
| BuDDIE: A Business Document Dataset for Multi-task Information Extraction | Apr 5, 2024 | Document Classificationdocument understanding | —Unverified | 0 | 0 |
| 3D-CT-GPT: Generating 3D Radiology Reports through Integration of Large Vision-Language Models | Sep 28, 2024 | DiagnosticLanguage Modeling | —Unverified | 0 | 0 |
| Small Language Model Meets with Reinforced Vision Vocabulary | Jan 23, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 | 0 |
| SMMILE: An Expert-Driven Benchmark for Multimodal Medical In-Context Learning | Jun 26, 2025 | In-Context LearningMedical Visual Question Answering | —Unverified | 0 | 0 |
| Adversarial Multimodal Network for Movie Question Answering | Jun 24, 2019 | Question AnsweringVideo Question Answering | —Unverified | 0 | 0 |
| SnapNTell: Enhancing Entity-Centric Visual Question Answering with Retrieval Augmented Multimodal LLM | Mar 7, 2024 | Question AnsweringRetrieval | —Unverified | 0 | 0 |
| SocialGesture: Delving into Multi-person Gesture Understanding | Apr 3, 2025 | Gesture RecognitionQuestion Answering | —Unverified | 0 | 0 |
| Bridging the Semantic Gaps: Improving Medical VQA Consistency with LLM-Augmented Question Sets | Apr 16, 2025 | DiversityMedical Visual Question Answering | —Unverified | 0 | 0 |
| VL-BEiT: Generative Vision-Language Pretraining | Jun 2, 2022 | image-classificationImage Classification | —Unverified | 0 | 0 |
| Solution for SMART-101 Challenge of CVPR Multi-modal Algorithmic Reasoning Task 2024 | Jun 10, 2024 | Language Modellingobject-detection | —Unverified | 0 | 0 |
| Solution for SMART-101 Challenge of ICCV Multi-modal Algorithmic Reasoning Task 2023 | Oct 10, 2023 | Decoderobject-detection | —Unverified | 0 | 0 |
| Solving Visual Madlibs with Multiple Cues | Aug 11, 2016 | Activity PredictionAttribute | —Unverified | 0 | 0 |
| Adversarial Attacks Beyond the Image Space | Nov 20, 2017 | Question AnsweringVisual Question Answering | —Unverified | 0 | 0 |
| Sparks of Artificial General Intelligence(AGI) in Semiconductor Material Science: Early Explorations into the Next Frontier of Generative AI-Assisted Electron Micrograph Analysis | Sep 17, 2024 | In-Context LearningQuestion Answering | —Unverified | 0 | 0 |
| Adventurer's Treasure Hunt: A Transparent System for Visually Grounded Compositional Visual Question Answering based on Scene Graphs | Jun 28, 2021 | Question AnsweringTask 2 | —Unverified | 0 | 0 |
| Bridge Damage Cause Estimation Using Multiple Images Based on Visual Question Answering | Feb 18, 2023 | Question AnsweringVisual Question Answering | —Unverified | 0 | 0 |
| VLFeedback: A Large-Scale AI Feedback Dataset for Large Vision-Language Models Alignment | Oct 12, 2024 | DiversityHallucination | —Unverified | 0 | 0 |
| Sparse Attention Vectors: Generative Multimodal Model Features Are Discriminative Vision-Language Classifiers | Nov 28, 2024 | Image Captioningimage-classification | —Unverified | 0 | 0 |