| Hierarchical Modeling for Medical Visual Question Answering with Cross-Attention Fusion | Apr 4, 2025 | DiagnosticMedical Visual Question Answering | —Unverified | 0 | 0 |
| Grounded Knowledge-Enhanced Medical VLP for Chest X-Ray | Apr 23, 2024 | Medical Visual Question AnsweringQuestion Answering | —Unverified | 0 | 0 |
| GRILL: Grounded Vision-language Pre-training via Aligning Text and Image Regions | May 24, 2023 | ObjectQuestion Answering | —Unverified | 0 | 0 |
| What If We Recaption Billions of Web Images with LLaMA-3? | Jun 12, 2024 | Cross-Modal RetrievalImage Generation | —Unverified | 0 | 0 |
| HOI-Ref: Hand-Object Interaction Referral in Egocentric Vision | Apr 15, 2024 | ObjectQuestion Answering | —Unverified | 0 | 0 |
| Griffon-G: Bridging Vision-Language and Vision-Centric Tasks via Large Multimodal Models | Oct 21, 2024 | Instruction Followingobject-detection | —Unverified | 0 | 0 |
| How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites | Apr 25, 2024 | 4kLanguage Modeling | —Unverified | 0 | 0 |
| How good are deep models in understanding the generated images? | Aug 23, 2022 | ObjectObject Recognition | —Unverified | 0 | 0 |
| Understanding Complexity in VideoQA via Visual Program Generation | May 19, 2025 | Code GenerationQuestion Answering | —Unverified | 0 | 0 |
| GraspCorrect: Robotic Grasp Correction via Vision-Language Model-Guided Feedback | Mar 19, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 | 0 |
| How Much Can CLIP Benefit Vision-and-Language Tasks? | Sep 29, 2021 | Question AnsweringVisual Entailment | —Unverified | 0 | 0 |
| Graph-Structured Representations for Visual Question Answering | Sep 19, 2016 | Multiple-choiceQuestion Answering | —Unverified | 0 | 0 |
| How to Design Sample and Computationally Efficient VQA Models | Mar 22, 2021 | Question AnsweringVisual Question Answering | —Unverified | 0 | 0 |
| Understanding in Artificial Intelligence | Jan 17, 2021 | Natural Language UnderstandingQuestion Answering | —Unverified | 0 | 0 |
| How to find a good image-text embedding for remote sensing visual question answering? | Sep 24, 2021 | Question AnsweringVisual Question Answering | —Unverified | 0 | 0 |
| How Transferable are Reasoning Patterns in VQA? | Apr 8, 2021 | Question AnsweringVisual Question Answering | —Unverified | 0 | 0 |
| How Vision-Language Tasks Benefit from Large Pre-trained Models: A Survey | Dec 11, 2024 | Image CaptioningQuestion Answering | —Unverified | 0 | 0 |
| How Well Can Vison-Language Models Understand Humans' Intention? An Open-ended Theory of Mind Question Evaluation Benchmark | Mar 28, 2025 | Question AnsweringVisual Question Answering | —Unverified | 0 | 0 |
| Understanding Information Storage and Transfer in Multi-modal Large Language Models | Jun 6, 2024 | Factual Visual Question AnsweringModel Editing | —Unverified | 0 | 0 |
| HRVQA: A Visual Question Answering Benchmark for High-Resolution Aerial Images | Jan 23, 2023 | AttributeQuestion Answering | —Unverified | 0 | 0 |
| Human-Adversarial Visual Question Answering | Jun 4, 2021 | Question AnsweringVisual Question Answering | —Unverified | 0 | 0 |
| Human Attention in Visual Question Answering: Do Humans and Deep Networks Look at the Same Regions? | Jun 17, 2016 | Question AnsweringVisual Question Answering | —Unverified | 0 | 0 |
| Human Attention in Visual Question Answering: Do Humans and Deep Networks Look at the Same Regions? | Jun 11, 2016 | Question AnsweringVisual Question Answering | —Unverified | 0 | 0 |
| Human-centered Interactive Learning via MLLMs for Text-to-Image Person Re-identification | May 21, 2025 | Data AugmentationLarge Language Model | —Unverified | 0 | 0 |
| Human Mobility Question Answering (Vision Paper) | Oct 2, 2023 | ManagementQuestion Answering | —Unverified | 0 | 0 |