| How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites | Apr 25, 2024 | 4kLanguage Modeling | —Unverified | 0 |
| How good are deep models in understanding the generated images? | Aug 23, 2022 | ObjectObject Recognition | —Unverified | 0 |
| How Much Can CLIP Benefit Vision-and-Language Tasks? | Sep 29, 2021 | Question AnsweringVisual Entailment | —Unverified | 0 |
| How to Design Sample and Computationally Efficient VQA Models | Mar 22, 2021 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| How to find a good image-text embedding for remote sensing visual question answering? | Sep 24, 2021 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| How Transferable are Reasoning Patterns in VQA? | Apr 8, 2021 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| How Vision-Language Tasks Benefit from Large Pre-trained Models: A Survey | Dec 11, 2024 | Image CaptioningQuestion Answering | —Unverified | 0 |
| How Well Can Vison-Language Models Understand Humans' Intention? An Open-ended Theory of Mind Question Evaluation Benchmark | Mar 28, 2025 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| HRVQA: A Visual Question Answering Benchmark for High-Resolution Aerial Images | Jan 23, 2023 | AttributeQuestion Answering | —Unverified | 0 |
| Human-Adversarial Visual Question Answering | Jun 4, 2021 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Human Attention in Visual Question Answering: Do Humans and Deep Networks Look at the Same Regions? | Jun 17, 2016 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Human Attention in Visual Question Answering: Do Humans and Deep Networks Look at the Same Regions? | Jun 11, 2016 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Human-centered Interactive Learning via MLLMs for Text-to-Image Person Re-identification | May 21, 2025 | Data AugmentationLarge Language Model | —Unverified | 0 |
| Human Mobility Question Answering (Vision Paper) | Oct 2, 2023 | ManagementQuestion Answering | —Unverified | 0 |
| Hummingbird: High Fidelity Image Generation via Multimodal Context Alignment | Feb 7, 2025 | DiversityHuman-Object Interaction Detection | —Unverified | 0 |
| Hyperbolic Attention Networks | May 24, 2018 | Machine TranslationQuestion Answering | —Unverified | 0 |
| Hyper-dimensional computing for a visual question-answering system that is trainable end-to-end | Nov 28, 2017 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Hypo3D: Exploring Hypothetical Reasoning in 3D | Feb 2, 2025 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| ICDAR 2019 Competition on Scene Text Visual Question Answering | Jun 30, 2019 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| i-Code Studio: A Configurable and Composable Framework for Integrative AI | May 23, 2023 | Question AnsweringRetrieval | —Unverified | 0 |
| ILLUME: Illuminating Your LLMs to See, Draw, and Self-Enhance | Dec 9, 2024 | Image GenerationLanguage Modeling | —Unverified | 0 |
| CLIPPO: Image-and-Language Understanding from Pixels Only | Dec 15, 2022 | Contrastive Learningimage-classification | —Unverified | 0 |
| Image as a Foreign Language: BEiT Pretraining for Vision and Vision-Language Tasks | Jan 1, 2023 | Cross-Modal RetrievalImage Captioning | —Unverified | 0 |
| Image Captioning and Visual Question Answering Based on Attributes and External Knowledge | Mar 9, 2016 | General KnowledgeImage Captioning | —Unverified | 0 |
| Image Captioning with Compositional Neural Module Networks | Jul 10, 2020 | Image CaptioningQuestion Answering | —Unverified | 0 |