| GPT-4V Explorations: Mining Autonomous Driving | Jun 24, 2024 | Autonomous DrivingDecision Making | —Unverified | 0 |
| Can You Explain That? Lucid Explanations Help Human-AI Collaborative Image Retrieval | Apr 5, 2019 | Image RetrievalQuestion Answering | —Unverified | 0 |
| DLIP: Distilling Language-Image Pre-training | Aug 24, 2023 | Image CaptioningImage-text Retrieval | —Unverified | 0 |
| Generating Question Relevant Captions to Aid Visual Question Answering | Jun 3, 2019 | General KnowledgeImage Captioning | —Unverified | 0 |
| M4CXR: Exploring Multi-task Potentials of Multi-modal Large Language Models for Chest X-ray Interpretation | Aug 29, 2024 | Instruction FollowingMedical Report Generation | —Unverified | 0 |
| Good, Better, Best: Textual Distractors Generation for Multiple-Choice Visual Question Answering via Reinforcement Learning | Oct 21, 2019 | Data AugmentationDecision Making | —Unverified | 0 |
| Look, Read and Ask: Learning to Ask Questions by Reading Text in Images | Nov 23, 2022 | Optical Character Recognition (OCR)Question Answering | —Unverified | 0 |
| Language bias in Visual Question Answering: A Survey and Taxonomy | Nov 16, 2021 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Goal-Oriented Semantic Communication for Wireless Visual Question Answering | Nov 3, 2024 | Edge-computingQuestion Answering | —Unverified | 0 |
| γ-MoD: Exploring Mixture-of-Depth Adaptation for Multimodal Large Language Models | Oct 17, 2024 | Visual Question Answering | —Unverified | 0 |
| Language-Image Models with 3D Understanding | May 6, 2024 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| A Multimodal Social Agent | Dec 11, 2024 | Common Sense ReasoningDecision Making | —Unverified | 0 |
| Does CLIP Benefit Visual Question Answering in the Medical Domain as Much as it Does in the General Domain? | Dec 27, 2021 | ArticlesMedical Visual Question Answering | —Unverified | 0 |
| Consistency and Uncertainty: Identifying Unreliable Responses From Black-Box Vision-Language Models for Selective Visual Question Answering | Apr 16, 2024 | Language ModellingPrediction | —Unverified | 0 |
| GiVE: Guiding Visual Encoder to Perceive Overlooked Information | Oct 26, 2024 | ObjectQuestion Answering | —Unverified | 0 |
| Connecting Language and Vision to Actions | Jul 1, 2018 | Image CaptioningLanguage Modeling | —Unverified | 0 |
| Attentive Explanations: Justifying Decisions and Pointing to the Evidence | Dec 14, 2016 | Decision MakingQuestion Answering | —Unverified | 0 |
| GeoRSMLLM: A Multimodal Large Language Model for Vision-Language Tasks in Geoscience and Remote Sensing | Mar 16, 2025 | Change DetectionImage Captioning | —Unverified | 0 |
| Does my multimodal model learn cross-modal interactions? It's harder to tell than you might think! | Oct 13, 2020 | DiagnosticImage-text Classification | —Unverified | 0 |
| GeoPix: Multi-Modal Large Language Model for Pixel-level Image Understanding in Remote Sensing | Jan 12, 2025 | Image CaptioningLanguage Modeling | —Unverified | 0 |
| Attention Overlap Is Responsible for The Entity Missing Problem in Text-to-image Diffusion Models! | Oct 28, 2024 | DenoisingQuestion Answering | —Unverified | 0 |
| A Multimodal Memes Classification: A Survey and Open Research Issues | Sep 17, 2020 | ClassificationGeneral Classification | —Unverified | 0 |
| LRRA:A Transparent Neural-Symbolic Reasoning Framework for Real-World Visual Question Answering | Aug 1, 2021 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Compressing Visual-linguistic Model via Knowledge Distillation | Apr 5, 2021 | Image CaptioningKnowledge Distillation | —Unverified | 0 |
| Generic Attention-model Explainability by Weighted Relevance Accumulation | Aug 20, 2023 | Image CaptioningQuestion Answering | —Unverified | 0 |
| Do Explanations make VQA Models more Predictable to a Human? | Oct 29, 2018 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Latent Variable Models for Visual Question Answering | Jan 16, 2021 | BenchmarkingQuestion Answering | —Unverified | 0 |
| Generative Visual Question Answering | Jul 18, 2023 | Generative Visual Question AnsweringQuestion Answering | —Unverified | 0 |
| American == White in Multimodal Language-and-Image AI | Jul 1, 2022 | Image CaptioningQuestion Answering | —Unverified | 0 |
| Abduction of Domain Relationships from Data for VQA | Feb 13, 2025 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Compound Tokens: Channel Fusion for Vision-Language Representation Learning | Dec 2, 2022 | DecoderLanguage Modeling | —Unverified | 0 |
| MM-SpuBench: Towards Better Understanding of Spurious Biases in Multimodal LLMs | Jun 24, 2024 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Generating Triples with Adversarial Networks for Scene Graph Construction | Feb 7, 2018 | Attributegraph construction | —Unverified | 0 |
| Compositional Memory for Visual Question Answering | Nov 18, 2015 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Double Visual Defense: Adversarial Pre-training and Instruction Tuning for Improving Vision-Language Model Robustness | Jan 16, 2025 | Adversarial DefenseAdversarial Robustness | —Unverified | 0 |
| Learning Answer Embeddings for Visual Question Answering | Jun 10, 2018 | Question AnsweringTransfer Learning | —Unverified | 0 |
| Attention Mechanism based Cognition-level Scene Understanding | Apr 17, 2022 | Question AnsweringScene Understanding | —Unverified | 0 |
| Learning by Asking Questions | Dec 4, 2017 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Look, Learn and Leverage (L^3): Mitigating Visual-Domain Shift and Discovering Intrinsic Relations via Symbolic Alignment | Aug 30, 2024 | Question AnsweringRepresentation Learning | —Unverified | 0 |
| Learning Compositional Representation for Few-shot Visual Question Answering | Feb 21, 2021 | AttributeQuestion Answering | —Unverified | 0 |
| Generating Rationales in Visual Question Answering | Apr 4, 2020 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Generating Natural Questions from Images for Multimodal Assistants | Nov 17, 2020 | Common Sense ReasoningNatural Questions | —Unverified | 0 |
| DreamSync: Aligning Text-to-Image Generation with Image Understanding Feedback | Nov 29, 2023 | Image GenerationQuestion Answering | —Unverified | 0 |
| Attention Guided Semantic Relationship Parsing for Visual Question Answering | Oct 5, 2020 | ObjectQuestion Answering | —Unverified | 0 |
| Generating Natural Language Explanations for Visual Question Answering using Scene Graphs and Visual Attention | Feb 15, 2019 | Explanation GenerationLanguage Modeling | —Unverified | 0 |
| Learning How To Ask: Cycle-Consistency Refines Prompts in Multimodal Foundation Models | Feb 13, 2024 | Code GenerationHumanEval | —Unverified | 0 |
| Generate then Select: Open-ended Visual Question Answering Guided by World Knowledge | May 30, 2023 | Answer SelectionQuestion Answering | —Unverified | 0 |
| Learning Reasoning Paths over Semantic Graphs for Video-grounded Dialogues | Mar 1, 2021 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Explainable High-order Visual Question Reasoning: A New Benchmark and Knowledge-routed Network | Sep 23, 2019 | Question AnsweringTriplet | —Unverified | 0 |
| Compositional Attention Networks for Interpretability in Natural Language Question Answering | Oct 30, 2018 | Logical ReasoningQuestion Answering | —Unverified | 0 |