| How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites | Apr 25, 2024 | 4kLanguage Modeling | —Unverified | 0 |
| How good are deep models in understanding the generated images? | Aug 23, 2022 | ObjectObject Recognition | —Unverified | 0 |
| How Much Can CLIP Benefit Vision-and-Language Tasks? | Sep 29, 2021 | Question AnsweringVisual Entailment | —Unverified | 0 |
| How to Design Sample and Computationally Efficient VQA Models | Mar 22, 2021 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| How to find a good image-text embedding for remote sensing visual question answering? | Sep 24, 2021 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| How Transferable are Reasoning Patterns in VQA? | Apr 8, 2021 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| How Vision-Language Tasks Benefit from Large Pre-trained Models: A Survey | Dec 11, 2024 | Image CaptioningQuestion Answering | —Unverified | 0 |
| How Well Can Vison-Language Models Understand Humans' Intention? An Open-ended Theory of Mind Question Evaluation Benchmark | Mar 28, 2025 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| HRVQA: A Visual Question Answering Benchmark for High-Resolution Aerial Images | Jan 23, 2023 | AttributeQuestion Answering | —Unverified | 0 |
| Human-Adversarial Visual Question Answering | Jun 4, 2021 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Human Attention in Visual Question Answering: Do Humans and Deep Networks Look at the Same Regions? | Jun 17, 2016 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Human Attention in Visual Question Answering: Do Humans and Deep Networks Look at the Same Regions? | Jun 11, 2016 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Human-centered Interactive Learning via MLLMs for Text-to-Image Person Re-identification | May 21, 2025 | Data AugmentationLarge Language Model | —Unverified | 0 |
| Human Mobility Question Answering (Vision Paper) | Oct 2, 2023 | ManagementQuestion Answering | —Unverified | 0 |
| Hummingbird: High Fidelity Image Generation via Multimodal Context Alignment | Feb 7, 2025 | DiversityHuman-Object Interaction Detection | —Unverified | 0 |
| Hyperbolic Attention Networks | May 24, 2018 | Machine TranslationQuestion Answering | —Unverified | 0 |
| Hyper-dimensional computing for a visual question-answering system that is trainable end-to-end | Nov 28, 2017 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Hypo3D: Exploring Hypothetical Reasoning in 3D | Feb 2, 2025 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| ICDAR 2019 Competition on Scene Text Visual Question Answering | Jun 30, 2019 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| i-Code Studio: A Configurable and Composable Framework for Integrative AI | May 23, 2023 | Question AnsweringRetrieval | —Unverified | 0 |
| ILLUME: Illuminating Your LLMs to See, Draw, and Self-Enhance | Dec 9, 2024 | Image GenerationLanguage Modeling | —Unverified | 0 |
| CLIPPO: Image-and-Language Understanding from Pixels Only | Dec 15, 2022 | Contrastive Learningimage-classification | —Unverified | 0 |
| Image as a Foreign Language: BEiT Pretraining for Vision and Vision-Language Tasks | Jan 1, 2023 | Cross-Modal RetrievalImage Captioning | —Unverified | 0 |
| Image Captioning and Visual Question Answering Based on Attributes and External Knowledge | Mar 9, 2016 | General KnowledgeImage Captioning | —Unverified | 0 |
| Image Captioning with Compositional Neural Module Networks | Jul 10, 2020 | Image CaptioningQuestion Answering | —Unverified | 0 |
| Image Manipulation via Multi-Hop Instructions -- A New Dataset and Weakly-Supervised Neuro-Symbolic Approach | May 23, 2023 | Image ManipulationQuestion Answering | —Unverified | 0 |
| Image-of-Thought Prompting for Visual Reasoning Refinement in Multimodal Large Language Models | May 22, 2024 | Multimodal ReasoningVisual Question Answering | —Unverified | 0 |
| Image Position Prediction in Multimodal Documents | May 1, 2020 | ArticlesCaption Generation | —Unverified | 0 |
| Image Semantic Relation Generation | Oct 19, 2022 | Image RetrievalImage Segmentation | —Unverified | 0 |
| ImageTTR: Grounding Type Theory with Records in Image Classification for Visual Question Answering | Jun 1, 2019 | General Classificationimage-classification | —Unverified | 0 |
| Img-Diff: Contrastive Data Synthesis for Multimodal Large Language Models | Aug 8, 2024 | Contrastive LearningFine-Grained Image Recognition | —Unverified | 0 |
| Imperfect Vision Encoders: Efficient and Robust Tuning for Vision-Language Models | Jul 23, 2024 | Computational EfficiencyImage Captioning | —Unverified | 0 |
| Improved Alignment of Modalities in Large Vision Language Models | Mar 25, 2025 | GPUImage Captioning | —Unverified | 0 |
| Improved Baselines for Data-efficient Perceptual Augmentation of LLMs | Mar 20, 2024 | Audio captioningImage Captioning | —Unverified | 0 |
| Improved Bilinear Pooling with CNNs | Jul 21, 2017 | GPUQuestion Answering | —Unverified | 0 |
| Improving and Diagnosing Knowledge-Based Visual Question Answering via Entity Enhanced Knowledge Injection | Dec 13, 2021 | Common Sense ReasoningKnowledge Graph Embeddings | —Unverified | 0 |
| Improving Automatic VQA Evaluation Using Large Language Models | Oct 4, 2023 | In-Context LearningQuestion Answering | —Unverified | 0 |
| Improving Cross-Modal Understanding in Visual Dialog via Contrastive Learning | Apr 15, 2022 | Contrastive LearningQuestion Answering | —Unverified | 0 |
| Improving Data Augmentation for Robust Visual Question Answering with Effective Curriculum Learning | Jan 28, 2024 | Data AugmentationQuestion Answering | —Unverified | 0 |
| Improving mitosis detection on histopathology images using large vision-language models | Oct 11, 2023 | Domain GeneralizationImage Captioning | —Unverified | 0 |
| Improving Multi-modal Large Language Model through Boosting Vision Capabilities | Oct 17, 2024 | DecoderLanguage Modeling | —Unverified | 0 |
| Improving Users' Mental Model with Attention-directed Counterfactual Edits | Oct 13, 2021 | counterfactualQuestion Answering | —Unverified | 0 |
| Improving Visual Question Answering by Referring to Generated Paragraph Captions | Jun 14, 2019 | DecoderImage Captioning | —Unverified | 0 |
| Improving Visual Question Answering Models through Robustness Analysis and In-Context Learning with a Chain of Basic Questions | Apr 6, 2023 | In-Context LearningQuestion Answering | —Unverified | 0 |
| Improving VQA and its Explanations \\ by Comparing Competing Explanations | Jun 28, 2020 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Incorporating External Knowledge to Answer Open-Domain Visual Questions with Dynamic Memory Networks | Dec 3, 2017 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| In Factuality: Efficient Integration of Relevant Facts for Visual Question Answering | Aug 1, 2021 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| InfiMM-HD: A Leap Forward in High-Resolution Multimodal Understanding | Mar 3, 2024 | Visual Question Answering | —Unverified | 0 |
| InfographicVQA | Apr 26, 2021 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Inquire, Interact, and Integrate: A Proactive Agent Collaborative Framework for Zero-Shot Multimodal Medical Reasoning | May 19, 2024 | Multimodal ReasoningQuestion Answering | —Unverified | 0 |