| Hummingbird: High Fidelity Image Generation via Multimodal Context Alignment | Feb 7, 2025 | DiversityHuman-Object Interaction Detection | —Unverified | 0 | 0 |
| Hyperbolic Attention Networks | May 24, 2018 | Machine TranslationQuestion Answering | —Unverified | 0 | 0 |
| Hyper-dimensional computing for a visual question-answering system that is trainable end-to-end | Nov 28, 2017 | Question AnsweringVisual Question Answering | —Unverified | 0 | 0 |
| Graph Relation Transformer: Incorporating pairwise object features into the Transformer architecture | Nov 11, 2021 | Graph AttentionQuestion Answering | —Unverified | 0 | 0 |
| Bilinear Graph Networks for Visual Question Answering | Jul 23, 2019 | Question AnsweringVisual Question Answering | —Unverified | 0 | 0 |
| Hypo3D: Exploring Hypothetical Reasoning in 3D | Feb 2, 2025 | Question AnsweringVisual Question Answering | —Unverified | 0 | 0 |
| Graph Neural Networks in Vision-Language Image Understanding: A Survey | Mar 7, 2023 | Image CaptioningImage Retrieval | —Unverified | 0 | 0 |
| Graph-based Heuristic Search for Module Selection Procedure in Neural Module Network | Sep 30, 2020 | Heuristic SearchQuestion Answering | —Unverified | 0 | 0 |
| ICDAR 2019 Competition on Scene Text Visual Question Answering | Jun 30, 2019 | Question AnsweringVisual Question Answering | —Unverified | 0 | 0 |
| GRAM: Global Reasoning for Multi-Page VQA | Jan 7, 2024 | Question AnsweringVisual Question Answering | —Unverified | 0 | 0 |
| i-Code Studio: A Configurable and Composable Framework for Integrative AI | May 23, 2023 | Question AnsweringRetrieval | —Unverified | 0 | 0 |
| GRADE: Quantifying Sample Diversity in Text-to-Image Models | Oct 29, 2024 | AttributeDiversity | —Unverified | 0 | 0 |
| GPT-4V Explorations: Mining Autonomous Driving | Jun 24, 2024 | Autonomous DrivingDecision Making | —Unverified | 0 | 0 |
| Understanding Knowledge Gaps in Visual Question Answering: Implications for Gap Identification and Testing | Apr 8, 2020 | DiversityQuestion Answering | —Unverified | 0 | 0 |
| What is needed for simple spatial language capabilities in VQA? | Aug 17, 2019 | DiagnosticQuestion Answering | —Unverified | 0 | 0 |
| Good, Better, Best: Textual Distractors Generation for Multiple-Choice Visual Question Answering via Reinforcement Learning | Oct 21, 2019 | Data AugmentationDecision Making | —Unverified | 0 | 0 |
| ILLUME: Illuminating Your LLMs to See, Draw, and Self-Enhance | Dec 9, 2024 | Image GenerationLanguage Modeling | —Unverified | 0 | 0 |
| Understanding the Role of Scene Graphs in Visual Question Answering | Jan 14, 2021 | Graph GenerationQuestion Answering | —Unverified | 0 | 0 |
| Goal-Oriented Semantic Communication for Wireless Visual Question Answering | Nov 3, 2024 | Edge-computingQuestion Answering | —Unverified | 0 | 0 |
| ZALM3: Zero-Shot Enhancement of Vision-Language Alignment via In-Context Information in Multi-Turn Multimodal Medical Dialogue | Sep 26, 2024 | Medical Visual Question AnsweringQuestion Answering | —Unverified | 0 | 0 |
| Are You Smarter Than a Sixth Grader? Textbook Question Answering for Multimodal Machine Comprehension | Jul 1, 2017 | Question AnsweringReading Comprehension | —Unverified | 0 | 0 |
| CLIPPO: Image-and-Language Understanding from Pixels Only | Dec 15, 2022 | Contrastive Learningimage-classification | —Unverified | 0 | 0 |
| UnICLAM:Contrastive Representation Learning with Adversarial Masking for Unified and Interpretable Medical Vision Question Answering | Dec 21, 2022 | Data AugmentationDecision Making | —Unverified | 0 | 0 |
| Image as a Foreign Language: BEiT Pretraining for Vision and Vision-Language Tasks | Jan 1, 2023 | Cross-Modal RetrievalImage Captioning | —Unverified | 0 | 0 |
| Image Captioning and Visual Question Answering Based on Attributes and External Knowledge | Mar 9, 2016 | General KnowledgeImage Captioning | —Unverified | 0 | 0 |
| Bidirectional Contrastive Split Learning for Visual Question Answering | Aug 24, 2022 | Adversarial AttackBackdoor Attack | —Unverified | 0 | 0 |
| Image Captioning with Compositional Neural Module Networks | Jul 10, 2020 | Image CaptioningQuestion Answering | —Unverified | 0 | 0 |
| Uni-EDEN: Universal Encoder-Decoder Network by Multi-Granular Vision-Language Pre-training | Jan 11, 2022 | DecoderImage Captioning | —Unverified | 0 | 0 |
| Image Manipulation via Multi-Hop Instructions -- A New Dataset and Weakly-Supervised Neuro-Symbolic Approach | May 23, 2023 | Image ManipulationQuestion Answering | —Unverified | 0 | 0 |
| Image-of-Thought Prompting for Visual Reasoning Refinement in Multimodal Large Language Models | May 22, 2024 | Multimodal ReasoningVisual Question Answering | —Unverified | 0 | 0 |
| Image Position Prediction in Multimodal Documents | May 1, 2020 | ArticlesCaption Generation | —Unverified | 0 | 0 |
| Image Semantic Relation Generation | Oct 19, 2022 | Image RetrievalImage Segmentation | —Unverified | 0 | 0 |
| ImageTTR: Grounding Type Theory with Records in Image Classification for Visual Question Answering | Jun 1, 2019 | General Classificationimage-classification | —Unverified | 0 | 0 |
| Img-Diff: Contrastive Data Synthesis for Multimodal Large Language Models | Aug 8, 2024 | Contrastive LearningFine-Grained Image Recognition | —Unverified | 0 | 0 |
| γ-MoD: Exploring Mixture-of-Depth Adaptation for Multimodal Large Language Models | Oct 17, 2024 | Visual Question Answering | —Unverified | 0 | 0 |
| Imperfect Vision Encoders: Efficient and Robust Tuning for Vision-Language Models | Jul 23, 2024 | Computational EfficiencyImage Captioning | —Unverified | 0 | 0 |
| GiVE: Guiding Visual Encoder to Perceive Overlooked Information | Oct 26, 2024 | ObjectQuestion Answering | —Unverified | 0 | 0 |
| Improved Alignment of Modalities in Large Vision Language Models | Mar 25, 2025 | GPUImage Captioning | —Unverified | 0 | 0 |
| Improved Baselines for Data-efficient Perceptual Augmentation of LLMs | Mar 20, 2024 | Audio captioningImage Captioning | —Unverified | 0 | 0 |
| GeoRSMLLM: A Multimodal Large Language Model for Vision-Language Tasks in Geoscience and Remote Sensing | Mar 16, 2025 | Change DetectionImage Captioning | —Unverified | 0 | 0 |
| Improved Bilinear Pooling with CNNs | Jul 21, 2017 | GPUQuestion Answering | —Unverified | 0 | 0 |
| Unified Multimodal Pre-training and Prompt-based Tuning for Vision-Language Understanding and Generation | Dec 10, 2021 | Image-text matchingImage-text Retrieval | —Unverified | 0 | 0 |
| Are we asking the right questions in MovieQA? | Nov 8, 2019 | Question AnsweringVisual Question Answering | —Unverified | 0 | 0 |
| Improving and Diagnosing Knowledge-Based Visual Question Answering via Entity Enhanced Knowledge Injection | Dec 13, 2021 | Common Sense ReasoningKnowledge Graph Embeddings | —Unverified | 0 | 0 |
| Improving Automatic VQA Evaluation Using Large Language Models | Oct 4, 2023 | In-Context LearningQuestion Answering | —Unverified | 0 | 0 |
| Improving Cross-Modal Understanding in Visual Dialog via Contrastive Learning | Apr 15, 2022 | Contrastive LearningQuestion Answering | —Unverified | 0 | 0 |
| Improving Data Augmentation for Robust Visual Question Answering with Effective Curriculum Learning | Jan 28, 2024 | Data AugmentationQuestion Answering | —Unverified | 0 | 0 |
| Improving mitosis detection on histopathology images using large vision-language models | Oct 11, 2023 | Domain GeneralizationImage Captioning | —Unverified | 0 | 0 |
| Improving Multi-modal Large Language Model through Boosting Vision Capabilities | Oct 17, 2024 | DecoderLanguage Modeling | —Unverified | 0 | 0 |
| GeoPix: Multi-Modal Large Language Model for Pixel-level Image Understanding in Remote Sensing | Jan 12, 2025 | Image CaptioningLanguage Modeling | —Unverified | 0 | 0 |