Learning to Discretely Compose Reasoning Module Networks for Video Captioning Jul 17, 2020 Decoder Question Answering
Code Code Available 15 Light-VQA: A Multi-Dimensional Quality Assessment Model for Low-Light Video Enhancement May 16, 2023 Video Enhancement Video Quality Assessment
Code Code Available 15 Don't Take the Easy Way Out: Ensemble Based Methods for Avoiding Known Dataset Biases Sep 9, 2019 Natural Language Inference Question Answering
Code Code Available 15 Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning Mar 20, 2017 Deep Reinforcement Learning reinforcement-learning
Code Code Available 15 Does Vision-and-Language Pretraining Improve Lexical Grounding? Sep 21, 2021 Question Answering Visual Question Answering
Code Code Available 15 Align before Fuse: Vision and Language Representation Learning with Momentum Distillation Jul 16, 2021 Cross-Modal Retrieval Grounded language learning
Code Code Available 15 FiLM: Visual Reasoning with a General Conditioning Layer Sep 22, 2017 Image Retrieval with Multi-Modal Query Visual Question Answering (VQA)
Code Code Available 15 Combo of Thinking and Observing for Outside-Knowledge VQA May 10, 2023 Decoder Question Answering
Code Code Available 15 LIVE: Learnable In-Context Vector for Visual Question Answering Jun 19, 2024 In-Context Learning Question Answering
Code Code Available 15 Learning Situation Hyper-Graphs for Video Question Answering Apr 18, 2023 Decoder Question Answering
Code Code Available 15 DocFormerv2: Local Features for Document Understanding Jun 2, 2023 Decoder document understanding
Code Code Available 15 Enhancing Vision-Language Pre-Training with Jointly Learned Questioner and Dense Captioner May 19, 2023 Dense Captioning Image Captioning
Code Code Available 15 LaTr: Layout-Aware Transformer for Scene-Text VQA Dec 23, 2021 Optical Character Recognition (OCR) Question Answering
Code Code Available 15 FlowLearn: Evaluating Large Vision-Language Models on Flowchart Understanding Jul 6, 2024 Optical Character Recognition (OCR) Visual Question Answering (VQA)
Code Code Available 15 FoodieQA: A Multimodal Dataset for Fine-Grained Understanding of Chinese Food Culture Jun 16, 2024 Diversity Multiple-choice
Code Code Available 15 Distilled Dual-Encoder Model for Vision-Language Understanding Dec 16, 2021 Image to text model
Code Code Available 15 OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge May 31, 2019 object-detection Object Detection
Code Code Available 15 DocVQA: A Dataset for VQA on Document Images Jul 1, 2020 Question Answering Reading Comprehension
Code Code Available 15 Layout and Task Aware Instruction Prompt for Zero-shot Document Image Question Answering Jun 1, 2023 Optical Character Recognition (OCR) Question Answering
Code Code Available 15 From the Least to the Most: Building a Plug-and-Play Visual Reasoner via Data Synthesis Jun 28, 2024 Visual Question Answering (VQA) Visual Reasoning
Code Code Available 15 Learning to Answer Questions in Dynamic Audio-Visual Scenarios Mar 26, 2022 audio-visual learning Audio-visual Question Answering
Code Code Available 15 Light-VQA+: A Video Quality Assessment Model for Exposure Correction with Vision-Language Guidance May 6, 2024 Exposure Correction Video Enhancement
Code Code Available 15 Maintaining Reasoning Consistency in Compositional Visual Question Answering Jan 1, 2022 Question Answering Visual Question Answering
Code Code Available 15 mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections May 24, 2022 Computational Efficiency cross-modal alignment
Code Code Available 15 ConceptBert: Concept-Aware Representation for Visual Question Answering Nov 1, 2020 Common Sense Reasoning Question Answering
Code Code Available 15 Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts Feb 17, 2021 Caption Generation Diversity
Code Code Available 15 Language Prior Is Not the Only Shortcut: A Benchmark for Shortcut Learning in VQA Oct 10, 2022 Question Answering Visual Question Answering
Code Code Available 15 Generalizing from SIMPLE to HARD Visual Reasoning: Can We Mitigate Modality Imbalance in VLMs? Jan 5, 2025 Image Captioning Image to text
Code Code Available 15 Language-Informed Visual Concept Learning Dec 6, 2023 Disentanglement Novel Concepts
Code Code Available 15 Content-Rich AIGC Video Quality Assessment via Intricate Text Alignment and Motion-Aware Consistency Feb 6, 2025 Video Generation Video Quality Assessment
Code Code Available 15 Multi-Agent VQA: Exploring Multi-Agent Foundation Models in Zero-Shot Visual Question Answering Mar 21, 2024 object-detection Object Detection
Code Code Available 15 ConTEXTual Net: A Multimodal Vision-Language Model for Segmentation of Pneumothorax Mar 2, 2023 Descriptive Image Captioning
Code Code Available 15 LaPA: Latent Prompt Assist Model For Medical Visual Question Answering Apr 19, 2024 Medical Visual Question Answering Question Answering
Code Code Available 15 Can Pre-trained Vision and Language Models Answer Visual Information-Seeking Questions? Feb 23, 2023 Open-Domain Question Answering Question Answering
Code Code Available 15 Contrast and Classify: Training Robust VQA Models Oct 13, 2020 Contrastive Learning Data Augmentation
Code Code Available 15 2BiVQA: Double Bi-LSTM based Video Quality Assessment of UGC Videos Aug 31, 2022 Video Quality Assessment Visual Question Answering (VQA)
Code Code Available 15 GeoLLaVA-8K: Scaling Remote-Sensing Multimodal Large Language Models to 8K Resolution May 27, 2025 8k Avg
Code Code Available 15 A-OKVQA: A Benchmark for Visual Question Answering using World Knowledge Jun 3, 2022 Question Answering Visual Question Answering
Code Code Available 15 Language Guided Visual Question Answering: Elevate Your Multimodal Language Model Using Knowledge-Enriched Prompts Oct 31, 2023 Image Captioning Language Modeling
Code Code Available 15 Multimodal Inverse Cloze Task for Knowledge-based Visual Question Answering Jan 11, 2023 Question Answering Reading Comprehension
Code Code Available 15 Large Language Models are Temporal and Causal Reasoners for Video Question Answering Oct 24, 2023 Natural Language Understanding Question Answering
Code Code Available 15 Can I Trust Your Answer? Visually Grounded Video Question Answering Sep 4, 2023 Grounded Video Question Answering Question Answering
Code Code Available 15 Counterfactual Samples Synthesizing and Training for Robust Visual Question Answering Oct 3, 2021 counterfactual Diagnostic
Code Code Available 15 Counterfactual Samples Synthesizing for Robust Visual Question Answering Mar 14, 2020 counterfactual Question Answering
Code Code Available 15 Label-Descriptive Patterns and Their Application to Characterizing Classification Errors Oct 18, 2021 Descriptive named-entity-recognition
Code Code Available 15 Counterfactual VQA: A Cause-Effect Look at Language Bias Jun 8, 2020 Causal Inference counterfactual
Code Code Available 15 GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering Feb 25, 2019 Question Answering Visual Question Answering (VQA)
Code Code Available 15 Multi-modal Understanding and Generation for Medical Images and Text via Vision-Language Pre-Training May 24, 2021 Image Captioning Medical Visual Question Answering
Code Code Available 15 Detecting Hate Speech in Multi-modal Memes Dec 29, 2020 Binary Classification Hate Speech Detection
Code Code Available 15 DeVLBert: Learning Deconfounded Visio-Linguistic Representations Aug 16, 2020 Image Retrieval Question Answering
Code Code Available 15