Language-Informed Visual Concept Learning Dec 6, 2023 Disentanglement Novel Concepts
Code Code Available 15 Language Guided Visual Question Answering: Elevate Your Multimodal Language Model Using Knowledge-Enriched Prompts Oct 31, 2023 Image Captioning Language Modeling
Code Code Available 15 Language Prior Is Not the Only Shortcut: A Benchmark for Shortcut Learning in VQA Oct 10, 2022 Question Answering Visual Question Answering
Code Code Available 15 LaKo: Knowledge-driven Visual Question Answering via Late Knowledge-to-Text Injection Jul 26, 2022 Decoder Knowledge Graphs
Code Code Available 15 LaPA: Latent Prompt Assist Model For Medical Visual Question Answering Apr 19, 2024 Medical Visual Question Answering Question Answering
Code Code Available 15 BiomedCLIP: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs Mar 2, 2023 Articles Medical Visual Question Answering
Code Code Available 15 Learning Situation Hyper-Graphs for Video Question Answering Apr 18, 2023 Decoder Question Answering
Code Code Available 15 Change Detection Meets Visual Question Answering Dec 12, 2021 Answer Generation Change Detection
Code Code Available 15 A Good Prompt Is Worth Millions of Parameters: Low-resource Prompt-based Learning for Vision-Language Models Oct 16, 2021 Image Captioning Language Modeling
Code Code Available 15 Disentangling 3D Prototypical Networks For Few-Shot Concept Learning Nov 6, 2020 3D geometry 3D Object Detection
Code Code Available 15 Don't Take the Easy Way Out: Ensemble Based Methods for Avoiding Known Dataset Biases Sep 9, 2019 Natural Language Inference Question Answering
Code Code Available 15 Distilled Dual-Encoder Model for Vision-Language Understanding Dec 16, 2021 Image to text model
Code Code Available 15 DocFormerv2: Local Features for Document Understanding Jun 2, 2023 Decoder document understanding
Code Code Available 15 Check It Again: Progressive Visual Question Answering via Visual Entailment Jun 8, 2021 Question Answering Visual Entailment
Code Code Available 15 Check It Again:Progressive Visual Question Answering via Visual Entailment Aug 1, 2021 Question Answering Visual Entailment
Code Code Available 15 ChipQA: No-Reference Video Quality Prediction via Space-Time Chips Sep 17, 2021 Video Quality Assessment Visual Question Answering (VQA)
Code Code Available 15 ChiQA: A Large Scale Image-based Real-World Question Answering Dataset for Multi-Modal Understanding Aug 5, 2022 Image Retrieval Question Answering
Code Code Available 15 DocVQA: A Dataset for VQA on Document Images Jul 1, 2020 Question Answering Reading Comprehension
Code Code Available 15 Kosmos-2: Grounding Multimodal Large Language Models to the World Jun 26, 2023 Image Captioning In-Context Learning
Code Code Available 15 Detecting and Preventing Hallucinations in Large Vision Language Models Aug 11, 2023 16k Hallucination
Code Code Available 15 Describe Anything Model for Visual Question Answering on Text-rich Images Jul 16, 2025 Descriptive Language Modeling
Code Code Available 15 Detecting Hate Speech in Multi-modal Memes Dec 29, 2020 Binary Classification Hate Speech Detection
Code Code Available 15 Dual-Key Multimodal Backdoors for Visual Question Answering Dec 14, 2021 Question Answering Visual Question Answering
Code Code Available 15 Comprehensive Visual Question Answering on Point Clouds through Compositional Scene Manipulation Dec 22, 2021 Common Sense Reasoning Question Answering
Code Code Available 15 CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning Dec 20, 2016 Diagnostic Question Answering
Code Code Available 15 KVQ: Boosting Video Quality Assessment via Saliency-guided Local Perception Mar 13, 2025 Video Quality Assessment Visual Question Answering (VQA)
Code Code Available 15 KAT: A Knowledge Augmented Transformer for Vision-and-Language Dec 16, 2021 Answer Generation Decoder
Code Code Available 15 CLEVR-Math: A Dataset for Compositional Language, Visual and Mathematical Reasoning Aug 10, 2022 Math Mathematical Reasoning
Code Code Available 15 ActiView: Evaluating Active Perception Ability for Multimodal Large Language Models Oct 7, 2024 Question Answering Visual Question Answering
Code Code Available 15 Light-VQA: A Multi-Dimensional Quality Assessment Model for Low-Light Video Enhancement May 16, 2023 Video Enhancement Video Quality Assessment
Code Code Available 15 Deep Multimodal Neural Architecture Search Apr 25, 2020 Decoder Image-text matching
Code Code Available 15 AIGV-Assessor: Benchmarking and Evaluating the Perceptual Quality of Text-to-Video Generation with LMM Nov 26, 2024 Benchmarking Text-to-Video Generation
Code Code Available 15 Knowledge-Based Video Question Answering with Unsupervised Scene Descriptions Jul 17, 2020 Question Answering Video Question Answering
Code Code Available 15 JDocQA: Japanese Document Question Answering Dataset for Generative Language Models Mar 28, 2024 Hallucination Question Answering
Code Code Available 15 DeVLBert: Learning Deconfounded Visio-Linguistic Representations Aug 16, 2020 Image Retrieval Question Answering
Code Code Available 15 Just Ask: Learning to Answer Questions from Millions of Narrated Videos Dec 1, 2020 Question Answering Question Generation
Code Code Available 15 Knowledge-Routed Visual Question Reasoning: Challenges for Deep Representation Embedding Dec 14, 2020 Question Answering Visual Question Answering
Code Code Available 15 Declaration-based Prompt Tuning for Visual Question Answering May 5, 2022 Image-text matching Language Modeling
Code Code Available 15 InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks Dec 21, 2023 Image Retrieval Image-to-Text Retrieval
Code Code Available 15 Clover: Towards A Unified Video-Language Alignment and Fusion Model Jul 16, 2022 Language Modeling Language Modelling
Code Code Available 15 Coarse-to-Fine Reasoning for Visual Question Answering Oct 6, 2021 Question Answering Visual Question Answering
Code Code Available 15 Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone Jun 15, 2022 Described Object Detection Image Captioning
Code Code Available 15 Interpreting Chest X-rays Like a Radiologist: A Benchmark with Clinical Reasoning May 29, 2025 Diagnostic Question Answering
Code Code Available 15 End-to-end Document Recognition and Understanding with Dessurt Mar 30, 2022 document understanding Visual Question Answering (VQA)
Code Code Available 15 CAT-ViL: Co-Attention Gated Vision-Language Embedding for Visual Question Localized-Answering in Robotic Surgery Jul 11, 2023 Question Answering Scene Understanding
Code Code Available 15 Efficient Vision-Language Pretraining with Visual Concepts and Hierarchical Alignment Aug 29, 2022 cross-modal alignment Image-text Retrieval
Code Code Available 15 COBRA: Contrastive Bi-Modal Representation Algorithm May 7, 2020 Cross-Modal Retrieval Image Captioning
Code Code Available 15 CoCa: Contrastive Captioners are Image-Text Foundation Models May 4, 2022 Action Classification Decoder
Code Code Available 15 MAPL: Parameter-Efficient Adaptation of Unimodal Pre-Trained Models for Vision-Language Few-Shot Prompting Oct 13, 2022 Image Captioning Question Answering
Code Code Available 15 Debiased Visual Question Answering from Feature and Sample Perspectives Dec 1, 2021 Bias Detection Question Answering
Code Code Available 15