A Unified End-to-End Retriever-Reader Framework for Knowledge-based VQA Jun 30, 2022 Question Answering Retrieval
Code Code Available 15 A Dataset and Baselines for Visual Question Answering on Art Aug 28, 2020 Question Answering Question Generation
Code Code Available 15 AMD-Hummingbird: Towards an Efficient Text-to-Video Model Mar 24, 2025 Computational Efficiency Video Generation
Code Code Available 15 Hierarchical Conditional Relation Networks for Video Question Answering Feb 25, 2020 Audio-Visual Question Answering (AVQA) Question Answering
Code Code Available 15 HAAR: Text-Conditioned Generative Model of 3D Strand-based Human Hairstyles Dec 18, 2023 Question Answering Visual Question Answering
Code Code Available 15 CrossGET: Cross-Guided Ensemble of Tokens for Accelerating Vision-Language Transformers May 27, 2023 Image Captioning Image Retrieval
Code Code Available 15 UniTAB: Unifying Text and Box Outputs for Grounded Vision-Language Modeling Nov 23, 2021 Image Captioning Image Description
Code Code Available 15 Debiasing Multimodal Models via Causal Information Minimization Nov 28, 2023 Visual Question Answering (VQA)
Code Code Available 15 Analysis of Video Quality Datasets via Design of Minimalistic Video Quality Models Jul 26, 2023 Image Quality Assessment No-Reference Image Quality Assessment
Code Code Available 15 Debiased Visual Question Answering from Feature and Sample Perspectives Dec 1, 2021 Bias Detection Question Answering
Code Code Available 15 Decoupled Seg Tokens Make Stronger Reasoning Video Segmenter and Grounder Jun 28, 2025 Image Segmentation Large Language Model
Code Code Available 15 Declaration-based Prompt Tuning for Visual Question Answering May 5, 2022 Image-text matching Language Modeling
Code Code Available 15 Awaker2.5-VL: Stably Scaling MLLMs with Parameter-Efficient Mixture of Experts Nov 16, 2024 Mixture-of-Experts Optical Character Recognition (OCR)
Code Code Available 15 HallE-Control: Controlling Object Hallucination in Large Multimodal Models Oct 3, 2023 Attribute Decoder
Code Code Available 15 Graph Optimal Transport for Cross-Domain Alignment Jun 26, 2020 Graph Matching Image Captioning
Code Code Available 15 CRAFT: A Benchmark for Causal Reasoning About Forces and inTeractions Dec 8, 2020 counterfactual Descriptive
Code Code Available 15 BadCM: Invisible Backdoor Attack Against Cross-Modal Learning Oct 3, 2024 Backdoor Attack Cross-Modal Retrieval
Code Code Available 15 IllusionVQA: A Challenging Optical Illusion Dataset for Vision Language Models Mar 23, 2024 Common Sense Reasoning In-Context Learning
Code Code Available 15 Detecting and Preventing Hallucinations in Large Vision Language Models Aug 11, 2023 16k Hallucination
Code Code Available 15 Describe Anything Model for Visual Question Answering on Text-rich Images Jul 16, 2025 Descriptive Language Modeling
Code Code Available 15 Greedy Gradient Ensemble for Robust Visual Question Answering Jul 27, 2021 Question Answering Visual Question Answering
Code Code Available 15 In Defense of Grid Features for Visual Question Answering Jan 10, 2020 Image Captioning Question Answering
Code Code Available 15 Visual Grounding Methods for VQA are Working for the Wrong Reasons! Apr 12, 2020 Question Answering Visual Grounding
Code Code Available 15 A Comparison of Pre-trained Vision-and-Language Models for Multimodal Representation Learning across Medical Images and Reports Sep 3, 2020 Image-text Retrieval Medical Visual Question Answering
Code Code Available 15 An Empirical Analysis on Spatial Reasoning Capabilities of Large Multimodal Models Nov 9, 2024 object-detection Object Detection
Code Code Available 15 2BiVQA: Double Bi-LSTM based Video Quality Assessment of UGC Videos Aug 31, 2022 Video Quality Assessment Visual Question Answering (VQA)
Code Code Available 15 3D-Aware Visual Question Answering about Parts, Poses and Occlusions Oct 27, 2023 Question Answering Visual Question Answering
Code Code Available 15 Attention in Reasoning: Dataset, Analysis, and Modeling Apr 20, 2022 Question Answering Visual Question Answering
Code Code Available 15 Benchmarking Vision Language Model Unlearning via Fictitious Facial Identity Dataset Nov 5, 2024 Benchmarking Language Modeling
Code Code Available 15 Investigating Prompting Techniques for Zero- and Few-Shot Visual Question Answering Jun 16, 2023 Image Captioning Question Answering
Code Code Available 15 A Comprehensive Evaluation of GPT-4V on Knowledge-Intensive Visual Question Answering Nov 13, 2023 Decision Making Explanation Generation
Code Code Available 15 Disentangling 3D Prototypical Networks For Few-Shot Concept Learning Nov 6, 2020 3D geometry 3D Object Detection
Code Code Available 15 An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling Sep 4, 2022 Fill Mask Optical Flow Estimation
Code Code Available 15 Knowledge-Routed Visual Question Reasoning: Challenges for Deep Representation Embedding Dec 14, 2020 Question Answering Visual Question Answering
Code Code Available 15 An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA Sep 10, 2021 Image Captioning Question Answering
Code Code Available 15 DualVGR: A Dual-Visual Graph Reasoning Unit for Video Question Answering Jul 10, 2021 Graph Attention Question Answering
Code Code Available 15 GRIT: General Robust Image Task Benchmark Apr 28, 2022 Instance Segmentation Keypoint Detection
Code Code Available 15 An Empirical Study of Multimodal Model Merging Apr 28, 2023 model Retrieval
Code Code Available 15 Beyond Task Performance: Evaluating and Reducing the Flaws of Large Multimodal Models with In-Context Learning Oct 1, 2023 In-Context Learning Instruction Following
Code Code Available 15 An Empirical Study of Training End-to-End Vision-and-Language Transformers Nov 3, 2021 Cross-Modal Retrieval Decoder
Code Code Available 15 How Much Can CLIP Benefit Vision-and-Language Tasks? Jul 13, 2021 Question Answering Vision and Language Navigation
Code Code Available 15 Bilateral Cross-Modality Graph Matching Attention for Feature Fusion in Visual Question Answering Dec 14, 2021 Graph Matching Question Answering
Code Code Available 15 Inter-GPS: Interpretable Geometry Problem Solving with Formal Language and Symbolic Reasoning May 10, 2021 Arithmetic Reasoning Geometry Problem Solving
Code Code Available 15 Learning to Answer Visual Questions from Web Videos May 10, 2022 Dataset Generation Question Answering
Code Code Available 15 Don't Take the Easy Way Out: Ensemble Based Methods for Avoiding Known Dataset Biases Sep 9, 2019 Natural Language Inference Question Answering
Code Code Available 15 Language Prior Is Not the Only Shortcut: A Benchmark for Shortcut Learning in VQA Oct 10, 2022 Question Answering Visual Question Answering
Code Code Available 15 Going Full-TILT Boogie on Document Understanding with Text-Image-Layout Transformer Feb 18, 2021 Decoder Document Image Classification
Code Code Available 15 Blindly Assess Quality of In-the-Wild Videos via Quality-aware Pre-training and Motion Perception Aug 19, 2021 Action Recognition Image Quality Assessment
Code Code Available 15 BiomedCLIP: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs Mar 2, 2023 Articles Medical Visual Question Answering
Code Code Available 15 Attention-Based Context Aware Reasoning for Situation Recognition Jun 1, 2020 Action Recognition Fine-grained Action Recognition
Code Code Available 15