Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering Jul 25, 2017 Image Captioning Visual Question Answering
Code Code Available 1Harnessing the Power of Multi-Task Pretraining for Ground-Truth Level Natural Language Explanations Dec 8, 2022 Explanation Generation Visual Entailment
Code Code Available 1Graphhopper: Multi-Hop Scene Graph Reasoning for Visual Question Answering Jul 13, 2021 Navigate Question Answering
Code Code Available 1Graph Optimal Transport for Cross-Domain Alignment Jun 26, 2020 Graph Matching Image Captioning
Code Code Available 1Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization Oct 7, 2016 General Classification Image Attribution
Code Code Available 1An Empirical Study of Training End-to-End Vision-and-Language Transformers Nov 3, 2021 Cross-Modal Retrieval Decoder
Code Code Available 1GraghVQA: Language-Guided Graph Neural Networks for Graph-based Visual Question Answering Apr 20, 2021 Graph Neural Network Graph Question Answering
Code Code Available 1Greedy Gradient Ensemble for Robust Visual Question Answering Jul 27, 2021 Question Answering Visual Question Answering
Code Code Available 1HIDRO-VQA: High Dynamic Range Oracle for Video Quality Assessment Nov 18, 2023 Video Quality Assessment Visual Question Answering (VQA)
Code Code Available 1An Empirical Study of Multimodal Model Merging Apr 28, 2023 model Retrieval
Code Code Available 1An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA Sep 10, 2021 Image Captioning Question Answering
Code Code Available 1An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling Sep 4, 2022 Fill Mask Optical Flow Estimation
Code Code Available 1A Comprehensive Evaluation of GPT-4V on Knowledge-Intensive Visual Question Answering Nov 13, 2023 Decision Making Explanation Generation
Code Code Available 1Going Full-TILT Boogie on Document Understanding with Text-Image-Layout Transformer Feb 18, 2021 Decoder Document Image Classification
Code Code Available 13D-Aware Visual Question Answering about Parts, Poses and Occlusions Oct 27, 2023 Question Answering Visual Question Answering
Code Code Available 1An Empirical Analysis on Spatial Reasoning Capabilities of Large Multimodal Models Nov 9, 2024 object-detection Object Detection
Code Code Available 1Visual Grounding Methods for VQA are Working for the Wrong Reasons! Apr 12, 2020 Question Answering Visual Grounding
Code Code Available 1A Comparison of Pre-trained Vision-and-Language Models for Multimodal Representation Learning across Medical Images and Reports Sep 3, 2020 Image-text Retrieval Medical Visual Question Answering
Code Code Available 1GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering Feb 25, 2019 Question Answering Visual Question Answering (VQA)
Code Code Available 1Hierarchical Conditional Relation Networks for Video Question Answering Feb 25, 2020 Audio-Visual Question Answering (AVQA) Question Answering
Code Code Available 1Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers Mar 29, 2021 Decoder Image Segmentation
Code Code Available 1Blindly Assess Quality of In-the-Wild Videos via Quality-aware Pre-training and Motion Perception Aug 19, 2021 Action Recognition Image Quality Assessment
Code Code Available 1Genixer: Empowering Multimodal Large Language Models as a Powerful Data Generator Dec 11, 2023 Image Captioning Question Answering
Code Code Available 1GeneAnnotator: A Semi-automatic Annotation Tool for Visual Scene Graph Sep 6, 2021 Graph Generation Graph Learning
Code Code Available 1Gemini Goes to Med School: Exploring the Capabilities of Multimodal Large Language Models on Medical Challenge Problems & Hallucinations Feb 10, 2024 Diagnostic Hallucination
Code Code Available 1Generalizing from SIMPLE to HARD Visual Reasoning: Can We Mitigate Modality Imbalance in VLMs? Jan 5, 2025 Image Captioning Image to text
Code Code Available 1Analysis of Video Quality Datasets via Design of Minimalistic Video Quality Models Jul 26, 2023 Image Quality Assessment No-Reference Image Quality Assessment
Code Code Available 1FunQA: Towards Surprising Video Comprehension Jun 26, 2023 Question Answering Text Generation
Code Code Available 1Generative Bias for Robust Visual Question Answering Aug 1, 2022 Knowledge Distillation Question Answering
Code Code Available 1GeoLLaVA-8K: Scaling Remote-Sensing Multimodal Large Language Models to 8K Resolution May 27, 2025 8k Avg
Code Code Available 1FoodieQA: A Multimodal Dataset for Fine-Grained Understanding of Chinese Food Culture Jun 16, 2024 Diversity Multiple-choice
Code Code Available 1AMD-Hummingbird: Towards an Efficient Text-to-Video Model Mar 24, 2025 Computational Efficiency Video Generation
Code Code Available 1Found a Reason for me? Weakly-supervised Grounded Visual Question Answering using Capsules May 11, 2021 Question Answering Visual Question Answering
Code Code Available 1A Dataset and Baselines for Visual Question Answering on Art Aug 28, 2020 Question Answering Question Generation
Code Code Available 1FlowLearn: Evaluating Large Vision-Language Models on Flowchart Understanding Jul 6, 2024 Optical Character Recognition (OCR) Visual Question Answering (VQA)
Code Code Available 1Fine-grained Image Classification and Retrieval by Combining Visual and Locally Pooled Textual Features Jan 14, 2020 Classification Diversity
Code Code Available 1FiLM: Visual Reasoning with a General Conditioning Layer Sep 22, 2017 Image Retrieval with Multi-Modal Query Visual Question Answering (VQA)
Code Code Available 1FloodNet: A High Resolution Aerial Imagery Dataset for Post Flood Scene Understanding Dec 5, 2020 image-classification Image Classification
Code Code Available 1Fast Prompt Alignment for Text-to-Image Generation Dec 11, 2024 Image Generation In-Context Learning
Code Code Available 1Boosting the Power of Small Multimodal Reasoning Models to Match Larger Models with Self-Consistency Training Nov 23, 2023 Multimodal Reasoning Science Question Answering
Code Code Available 1FAVER: Blind Quality Prediction of Variable Frame Rate Videos Jan 5, 2022 Cloud Computing Video Quality Assessment
Code Code Available 12BiVQA: Double Bi-LSTM based Video Quality Assessment of UGC Videos Aug 31, 2022 Video Quality Assessment Visual Question Answering (VQA)
Code Code Available 1Attention in Reasoning: Dataset, Analysis, and Modeling Apr 20, 2022 Question Answering Visual Question Answering
Code Code Available 1FFAA: Multimodal Large Language Model based Explainable Open-World Face Forgery Analysis Assistant Aug 19, 2024 Descriptive Face Swapping
Code Code Available 1Florence: A New Foundation Model for Computer Vision Nov 22, 2021 Action Classification Action Recognition
Code Code Available 1Bilateral Cross-Modality Graph Matching Attention for Feature Fusion in Visual Question Answering Dec 14, 2021 Graph Matching Question Answering
Code Code Available 1From the Least to the Most: Building a Plug-and-Play Visual Reasoner via Data Synthesis Jun 28, 2024 Visual Question Answering (VQA) Visual Reasoning
Code Code Available 1Hierarchical multimodal transformers for Multi-Page DocVQA Dec 7, 2022 Decoder Question Answering
Code Code Available 1Expert Knowledge-Aware Image Difference Graph Representation Learning for Difference-Aware Medical Visual Question Answering Jul 22, 2023 Graph Representation Learning Language Modeling
Code Code Available 1Attention-Based Context Aware Reasoning for Situation Recognition Jun 1, 2020 Action Recognition Fine-grained Action Recognition
Code Code Available 1