Investigating Prompting Techniques for Zero- and Few-Shot Visual Question Answering Jun 16, 2023 Image Captioning Question Answering
Code Code Available 15 Introspective Distillation for Robust Question Answering Nov 1, 2021 counterfactual Inductive Bias
Code Code Available 15 JDocQA: Japanese Document Question Answering Dataset for Generative Language Models Mar 28, 2024 Hallucination Question Answering
Code Code Available 15 Fast Prompt Alignment for Text-to-Image Generation Dec 11, 2024 Image Generation In-Context Learning
Code Code Available 15 AssistQ: Affordance-centric Question-driven Task Completion for Egocentric Assistant Mar 8, 2022 Visual Question Answering (VQA)
Code Code Available 15 Interpreting Chest X-rays Like a Radiologist: A Benchmark with Clinical Reasoning May 29, 2025 Diagnostic Question Answering
Code Code Available 15 Just Ask: Learning to Answer Questions from Millions of Narrated Videos Dec 1, 2020 Question Answering Question Generation
Code Code Available 15 DeVLBert: Learning Deconfounded Visio-Linguistic Representations Aug 16, 2020 Image Retrieval Question Answering
Code Code Available 15 Align before Fuse: Vision and Language Representation Learning with Momentum Distillation Jul 16, 2021 Cross-Modal Retrieval Grounded language learning
Code Code Available 15 Detecting Hate Speech in Multi-modal Memes Dec 29, 2020 Binary Classification Hate Speech Detection
Code Code Available 15 DocVQA: A Dataset for VQA on Document Images Jul 1, 2020 Question Answering Reading Comprehension
Code Code Available 15 InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks Dec 21, 2023 Image Retrieval Image-to-Text Retrieval
Code Code Available 15 KAT: A Knowledge Augmented Transformer for Vision-and-Language Dec 16, 2021 Answer Generation Decoder
Code Code Available 15 Language-Informed Visual Concept Learning Dec 6, 2023 Disentanglement Novel Concepts
Code Code Available 15 Align and Prompt: Video-and-Language Pre-training with Entity Prompts Dec 17, 2021 cross-modal alignment Entity Alignment
Code Code Available 15 InfMLLM: A Unified Framework for Visual-Language Tasks Nov 12, 2023 GPU Image Captioning
Code Code Available 15 Describe Anything Model for Visual Question Answering on Text-rich Images Jul 16, 2025 Descriptive Language Modeling
Code Code Available 15 Deep Multimodal Neural Architecture Search Apr 25, 2020 Decoder Image-text matching
Code Code Available 15 In Defense of Grid Features for Visual Question Answering Jan 10, 2020 Image Captioning Question Answering
Code Code Available 15 IMPACT: A Large-scale Integrated Multimodal Patent Analysis and Creation Dataset for Design Patents Dec 10, 2024 Cross-Modal Retrieval Image Classification
Code Code Available 15 IllusionVQA: A Challenging Optical Illusion Dataset for Vision Language Models Mar 23, 2024 Common Sense Reasoning In-Context Learning
Code Code Available 15 IconQA: A New Benchmark for Abstract Diagram Understanding and Visual Language Reasoning Oct 25, 2021 Arithmetic Reasoning Mathematical Question Answering
Code Code Available 15 AIM 2024 Challenge on Compressed Video Quality Assessment: Methods and Results Aug 21, 2024 Image Manipulation valid
Code Code Available 15 IGLUE: A Benchmark for Transfer Learning across Modalities, Tasks, and Languages Jan 27, 2022 Cross-Modal Retrieval Few-Shot Learning
Code Code Available 15 Improving Selective Visual Question Answering by Learning from Your Peers Jun 14, 2023 Question Answering Visual Question Answering
Code Code Available 15 Instruction-Guided Visual Masking May 30, 2024 Instruction Following Visual Grounding
Code Code Available 15 Declaration-based Prompt Tuning for Visual Question Answering May 5, 2022 Image-text matching Language Modeling
Code Code Available 15 AIGV-Assessor: Benchmarking and Evaluating the Perceptual Quality of Text-to-Video Generation with LMM Nov 26, 2024 Benchmarking Text-to-Video Generation
Code Code Available 15 ActiView: Evaluating Active Perception Ability for Multimodal Large Language Models Oct 7, 2024 Question Answering Visual Question Answering
Code Code Available 15 AI2-THOR: An Interactive 3D Environment for Visual AI Dec 14, 2017 Deep Reinforcement Learning Imitation Learning
Code Code Available 15 Debiasing Multimodal Models via Causal Information Minimization Nov 28, 2023 Visual Question Answering (VQA)
Code Code Available 15 Hypergraph Transformer: Weakly-supervised Multi-hop Reasoning for Knowledge-based Visual Question Answering Apr 22, 2022 Question Answering Visual Question Answering
Code Code Available 15 Does Vision-and-Language Pretraining Improve Lexical Grounding? Sep 21, 2021 Question Answering Visual Question Answering
Code Code Available 15 Detecting and Preventing Hallucinations in Large Vision Language Models Aug 11, 2023 16k Hallucination
Code Code Available 15 Debiased Visual Question Answering from Feature and Sample Perspectives Dec 1, 2021 Bias Detection Question Answering
Code Code Available 15 DataEnvGym: Data Generation Agents in Teacher Environments with Student Feedback Oct 8, 2024 Math Sequential Decision Making
Code Code Available 15 Decoupled Seg Tokens Make Stronger Reasoning Video Segmenter and Grounder Jun 28, 2025 Image Segmentation Large Language Model
Code Code Available 15 HYDRA: A Hyper Agent for Dynamic Compositional Visual Reasoning Mar 19, 2024 Reinforcement Learning (RL) Visual Grounding
Code Code Available 15 I Can't Believe There's No Images! Learning Visual Tasks Using only Language Supervision Nov 17, 2022 Image Captioning Question Answering
Code Code Available 15 Inter-GPS: Interpretable Geometry Problem Solving with Formal Language and Symbolic Reasoning May 10, 2021 Arithmetic Reasoning Geometry Problem Solving
Code Code Available 15 Language Prior Is Not the Only Shortcut: A Benchmark for Shortcut Learning in VQA Oct 10, 2022 Question Answering Visual Question Answering
Code Code Available 15 Cross-modal Retrieval for Knowledge-based Visual Question Answering Jan 11, 2024 Cross-Modal Retrieval Question Answering
Code Code Available 15 A Hitchhikers Guide to Fine-Grained Face Forgery Detection Using Common Sense Reasoning Oct 1, 2024 Common Sense Reasoning DeepFake Detection
Code Code Available 15 Hierarchical multimodal transformers for Multi-Page DocVQA Dec 7, 2022 Decoder Question Answering
Code Code Available 15 Cross-Modality Relevance for Reasoning on Language and Vision May 12, 2020 Question Answering Visual Question Answering
Code Code Available 15 Hierarchical Conditional Relation Networks for Video Question Answering Feb 25, 2020 Audio-Visual Question Answering (AVQA) Question Answering
Code Code Available 15 Hierarchical Question-Image Co-Attention for Visual Question Answering May 31, 2016 Visual Dialog Visual Question Answering
Code Code Available 15 A Good Prompt Is Worth Millions of Parameters: Low-resource Prompt-based Learning for Vision-Language Models Oct 16, 2021 Image Captioning Language Modeling
Code Code Available 15 Are Vision Language Models Ready for Clinical Diagnosis? A 3D Medical Benchmark for Tumor-centric Visual Question Answering May 25, 2025 Anatomy Benchmarking
Code Code Available 15 Harnessing the Power of Multi-Task Pretraining for Ground-Truth Level Natural Language Explanations Dec 8, 2022 Explanation Generation Visual Entailment
Code Code Available 15