eP-ALM: Efficient Perceptual Augmentation of Language Models Mar 20, 2023 In-Context Learning Visual Question Answering (VQA)
Code Code Available 15 Large Scale Multimodal Classification Using an Ensemble of Transformer Models and Co-Attention Nov 23, 2020 Classification General Classification
Code Code Available 15 MicroVQA: A Multimodal Reasoning Benchmark for Microscopy-Based Scientific Research Mar 17, 2025 Articles Benchmarking
Code Code Available 15 Large-Scale Adversarial Training for Vision-and-Language Representation Learning Jun 11, 2020 Image-text Retrieval Question Answering
Code Code Available 15 End-to-end Knowledge Retrieval with Multi-modal Queries Jun 1, 2023 Benchmarking Cross-Modal Retrieval
Code Code Available 15 Deep Multimodal Neural Architecture Search Apr 25, 2020 Decoder Image-text matching
Code Code Available 15 Can I Trust Your Answer? Visually Grounded Video Question Answering Sep 4, 2023 Grounded Video Question Answering Question Answering
Code Code Available 15 LaTr: Layout-Aware Transformer for Scene-Text VQA Dec 23, 2021 Optical Character Recognition (OCR) Question Answering
Code Code Available 15 Benchmarking Vision Language Model Unlearning via Fictitious Facial Identity Dataset Nov 5, 2024 Benchmarking Language Modeling
Code Code Available 15 LIVE: Learnable In-Context Vector for Visual Question Answering Jun 19, 2024 In-Context Learning Question Answering
Code Code Available 15 Skipping Computations in Multimodal LLMs Oct 12, 2024 Question Answering Visual Question Answering
Code Code Available 15 Describe Anything Model for Visual Question Answering on Text-rich Images Jul 16, 2025 Descriptive Language Modeling
Code Code Available 15 End-to-end Document Recognition and Understanding with Dessurt Mar 30, 2022 document understanding Visual Question Answering (VQA)
Code Code Available 15 ERNIE-Layout: Layout Knowledge Enhanced Pre-training for Visually-rich Document Understanding Oct 12, 2022 document-image-classification Document Image Classification
Code Code Available 15 Sparse Continuous Distributions and Fenchel-Young Losses Aug 4, 2021 Audio Classification Question Answering
Code Code Available 15 Detecting and Preventing Hallucinations in Large Vision Language Models Aug 11, 2023 16k Hallucination
Code Code Available 15 MediConfusion: Can you trust your AI radiologist? Probing the reliability of multimodal medical foundation models Sep 23, 2024 Medical Visual Question Answering Question Answering
Code Code Available 15 An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling Sep 4, 2022 Fill Mask Optical Flow Estimation
Code Code Available 15 DeVLBert: Learning Deconfounded Visio-Linguistic Representations Aug 16, 2020 Image Retrieval Question Answering
Code Code Available 15 StableVQA: A Deep No-Reference Quality Assessment Model for Video Stability Aug 9, 2023 Optical Flow Estimation Video Quality Assessment
Code Code Available 15 An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA Sep 10, 2021 Image Captioning Question Answering
Code Code Available 15 MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation Models Mar 23, 2023 Auxiliary Learning Multimodal Sentiment Analysis
Code Code Available 15 Learning Situation Hyper-Graphs for Video Question Answering Apr 18, 2023 Decoder Question Answering
Code Code Available 15 Learning to Answer Questions in Dynamic Audio-Visual Scenarios Mar 26, 2022 audio-visual learning Audio-visual Question Answering
Code Code Available 15 Beyond Question-Based Biases: Assessing Multimodal Shortcut Learning in Visual Question Answering Apr 7, 2021 Question Answering Visual Question Answering
Code Code Available 15 Super-CLEVR: A Virtual Benchmark to Diagnose Domain Robustness in Visual Reasoning Dec 1, 2022 Domain Generalization Question Answering
Code Code Available 15 An Empirical Study of Multimodal Model Merging Apr 28, 2023 model Retrieval
Code Code Available 15 Surgical-VQA: Visual Question Answering in Surgical Scenes using Transformer Jun 22, 2022 Question Answering Sentence
Code Code Available 15 Mimic In-Context Learning for Multimodal Tasks Apr 11, 2025 In-Context Learning Visual Question Answering (VQA)
Code Code Available 15 Learning to Discretely Compose Reasoning Module Networks for Video Captioning Jul 17, 2020 Decoder Question Answering
Code Code Available 15 An Empirical Study of Training End-to-End Vision-and-Language Transformers Nov 3, 2021 Cross-Modal Retrieval Decoder
Code Code Available 15 LiveXiv -- A Multi-Modal Live Benchmark Based on Arxiv Papers Content Oct 14, 2024 Visual Question Answering (VQA) World Knowledge
Code Code Available 15 Calibrating Concepts and Operations: Towards Symbolic Reasoning on Real Images Oct 1, 2021 Question Answering Visual Question Answering
Code Code Available 15 EHRXQA: A Multi-Modal Question Answering Dataset for Electronic Health Records with Chest X-ray Images Oct 28, 2023 Decision Making Medical Visual Question Answering
Code Code Available 15 MedBLIP: Bootstrapping Language-Image Pre-training from 3D Medical Images and Texts May 18, 2023 Medical Visual Question Answering Question Answering
Code Code Available 15 Bilateral Cross-Modality Graph Matching Attention for Feature Fusion in Visual Question Answering Dec 14, 2021 Graph Matching Question Answering
Code Code Available 15 Efficient Vision-Language Pretraining with Visual Concepts and Hierarchical Alignment Aug 29, 2022 cross-modal alignment Image-text Retrieval
Code Code Available 15 Answer Mining from a Pool of Images: Towards Retrieval-Based Visual Question Answering Jun 29, 2023 Answer Generation Question Answering
Code Code Available 15 MedAgentBoard: Benchmarking Multi-Agent Collaboration with Conventional Methods for Diverse Medical Tasks May 18, 2025 Benchmarking Medical Visual Question Answering
Code Code Available 15 Less is More: ClipBERT for Video-and-Language Learning via Sparse Sampling Feb 11, 2021 Question Answering Retrieval
Code Code Available 15 MedCoT: Medical Chain of Thought via Hierarchical Expert Dec 18, 2024 Diagnostic Medical Visual Question Answering
Code Code Available 15 Light-VQA+: A Video Quality Assessment Model for Exposure Correction with Vision-Language Guidance May 6, 2024 Exposure Correction Video Enhancement
Code Code Available 15 DocFormerv2: Local Features for Document Understanding Jun 2, 2023 Decoder document understanding
Code Code Available 15 TextCoT: Zoom In for Enhanced Multimodal Text-Rich Image Understanding Apr 15, 2024 Question Answering Visual Question Answering (VQA)
Code Code Available 15 MD-VQA: Multi-Dimensional Quality Assessment for UGC Live Videos Mar 27, 2023 Video Quality Assessment Visual Question Answering (VQA)
Code Code Available 15 EarthVQA: Towards Queryable Earth via Relational Reasoning-Based Remote Sensing Visual Question Answering Dec 19, 2023 Object Object Counting
Code Code Available 15 3DMIT: 3D Multi-modal Instruction Tuning for Scene Understanding Jan 6, 2024 Scene Understanding Visual Question Answering (VQA)
Code Code Available 15 DocVQA: A Dataset for VQA on Document Images Jul 1, 2020 Question Answering Reading Comprehension
Code Code Available 15 MDETR -- Modulated Detection for End-to-End Multi-Modal Understanding Apr 26, 2021 Generalized Referring Expression Comprehension Phrase Grounding
Code Code Available 15 Mind Your Outliers! Investigating the Negative Impact of Outliers on Active Learning for Visual Question Answering Jul 6, 2021 Active Learning Object Recognition
Code Code Available 15