Meta-Learning via Classifier(-free) Diffusion Guidance Oct 17, 2022 Few-Shot Learning Image Generation
Code Code Available 15 MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation Models Mar 23, 2023 Auxiliary Learning Multimodal Sentiment Analysis
Code Code Available 15 NLX-GPT: A Model for Natural Language Explanations in Vision and Vision-Language Tasks Mar 9, 2022 Decision Making Explainable artificial intelligence
Code Code Available 15 Multi-Modal Answer Validation for Knowledge-Based VQA Mar 23, 2021 Question Answering Retrieval
Code Code Available 15 Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts Nov 16, 2021 Cross-Modal Retrieval Image Captioning
Code Code Available 15 Multi-modal Auto-regressive Modeling via Visual Words Mar 12, 2024 Visual Question Answering Visual Question Answering (VQA)
Code Code Available 15 MuKEA: Multimodal Knowledge Extraction and Accumulation for Knowledge-based Visual Question Answering Mar 17, 2022 Implicit Relations Question Answering
Code Code Available 15 Fast Prompt Alignment for Text-to-Image Generation Dec 11, 2024 Image Generation In-Context Learning
Code Code Available 15 Multi-Agent VQA: Exploring Multi-Agent Foundation Models in Zero-Shot Visual Question Answering Mar 21, 2024 object-detection Object Detection
Code Code Available 15 Multimodal ChatGPT for Medical Applications: an Experimental Study of GPT-4V Oct 29, 2023 Diagnostic Language Modeling
Code Code Available 15 mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections May 24, 2022 Computational Efficiency cross-modal alignment
Code Code Available 15 EarthVQA: Towards Queryable Earth via Relational Reasoning-Based Remote Sensing Visual Question Answering Dec 19, 2023 Object Object Counting
Code Code Available 15 Explaining Autonomous Driving Actions with Visual Question Answering Jul 19, 2023 Autonomous Driving Autonomous Vehicles
Code Code Available 15 FAVER: Blind Quality Prediction of Variable Frame Rate Videos Jan 5, 2022 Cloud Computing Video Quality Assessment
Code Code Available 15 Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations Nov 21, 2022 Contrastive Learning Representation Learning
Code Code Available 15 Dynamic Language Binding in Relational Visual Reasoning Apr 30, 2020 Object Question Answering
Code Code Available 15 GPT-4V-AD: Exploring Grounding Potential of VQA-oriented GPT-4V for Zero-shot Anomaly Detection Nov 5, 2023 Anomaly Detection Question Answering
Code Code Available 15 Multimodal Co-Attention Transformer for Survival Prediction in Gigapixel Whole Slide Images Jan 1, 2021 Attribute Multiple Instance Learning
Code Code Available 15 ERNIE-Layout: Layout Knowledge Enhanced Pre-training for Visually-rich Document Understanding Oct 12, 2022 document-image-classification Document Image Classification
Code Code Available 15 eP-ALM: Efficient Perceptual Augmentation of Language Models Mar 20, 2023 In-Context Learning Visual Question Answering (VQA)
Code Code Available 15 MMUnlearner: Reformulating Multimodal Machine Unlearning in the Era of Multimodal Large Language Models Feb 16, 2025 Language Modeling Language Modelling
Code Code Available 15 Enhancing Vision-Language Pre-Training with Jointly Learned Questioner and Dense Captioner May 19, 2023 Dense Captioning Image Captioning
Code Code Available 15 Enhancing Visual Question Answering through Question-Driven Image Captions as Prompts Apr 12, 2024 Image Captioning Question Answering
Code Code Available 15 MM-Skin: Enhancing Dermatology Vision-Language Model with an Image-Text Dataset Derived from Textbooks May 9, 2025 Diagnostic Instruction Following
Code Code Available 15 Modular Visual Question Answering via Code Generation Jun 8, 2023 Code Generation In-Context Learning
Code Code Available 15 MMFT-BERT: Multimodal Fusion Transformer with BERT Encodings for Visual Question Answering Oct 27, 2020 Diagnostic Question Answering
Code Code Available 15 Are Bias Mitigation Techniques for Deep Learning Effective? Apr 1, 2021 Deep Learning Question Answering
Code Code Available 15 MMNeuron: Discovering Neuron-Level Domain-Specific Interpretation in Multimodal Large Language Model Jun 17, 2024 Language Modeling Language Modelling
Code Code Available 15 Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering Jul 25, 2017 Image Captioning Visual Question Answering
Code Code Available 15 Evaluating Multimodal Representations on Visual Semantic Textual Similarity Apr 4, 2020 Benchmarking Image Captioning
Code Code Available 15 DualVGR: A Dual-Visual Graph Reasoning Unit for Video Question Answering Jul 10, 2021 Graph Attention Question Answering
Code Code Available 15 Break It Down: A Question Understanding Benchmark Jan 31, 2020 Open-Domain Question Answering Question Answering
Code Code Available 15 Expert Knowledge-Aware Image Difference Graph Representation Learning for Difference-Aware Medical Visual Question Answering Jul 22, 2023 Graph Representation Learning Language Modeling
Code Code Available 15 MLP Architectures for Vision-and-Language Modeling: An Empirical Study Dec 8, 2021 Language Modeling Language Modelling
Code Code Available 15 MixPHM: Redundancy-Aware Parameter-Efficient Tuning for Low-Resource Visual Question Answering Mar 2, 2023 Mixture-of-Experts Question Answering
Code Code Available 15 MMBERT: Multimodal BERT Pretraining for Improved Medical VQA Apr 3, 2021 Language Modeling Language Modelling
Code Code Available 15 Exploring Opinion-unaware Video Quality Assessment with Semantic Affinity Criterion Feb 26, 2023 Video Quality Assessment Visual Question Answering (VQA)
Code Code Available 15 Boosting the Power of Small Multimodal Reasoning Models to Match Larger Models with Self-Consistency Training Nov 23, 2023 Multimodal Reasoning Science Question Answering
Code Code Available 15 Bridging the Gap between 2D and 3D Visual Question Answering: A Fusion Approach for 3D VQA Feb 24, 2024 3D Question Answering (3D-QA) Question Answering
Code Code Available 15 Faithful Multimodal Explanation for Visual Question Answering Sep 8, 2018 Explanatory Visual Question Answering Question Answering
Code Code Available 15 Dual-Key Multimodal Backdoors for Visual Question Answering Dec 14, 2021 Question Answering Visual Question Answering
Code Code Available 15 FaceBench: A Multi-View Multi-Level Facial Attribute VQA Dataset for Benchmarking Face Perception MLLMs Mar 27, 2025 Attribute Benchmarking
Code Code Available 15 Efficient Vision-Language Pretraining with Visual Concepts and Hierarchical Alignment Aug 29, 2022 cross-modal alignment Image-text Retrieval
Code Code Available 15 EHRXQA: A Multi-Modal Question Answering Dataset for Electronic Health Records with Chest X-ray Images Oct 28, 2023 Decision Making Medical Visual Question Answering
Code Code Available 15 FloodNet: A High Resolution Aerial Imagery Dataset for Post Flood Scene Understanding Dec 5, 2020 image-classification Image Classification
Code Code Available 15 FFAA: Multimodal Large Language Model based Explainable Open-World Face Forgery Analysis Assistant Aug 19, 2024 Descriptive Face Swapping
Code Code Available 15 FiLM: Visual Reasoning with a General Conditioning Layer Sep 22, 2017 Image Retrieval with Multi-Modal Query Visual Question Answering (VQA)
Code Code Available 15 End-to-end Knowledge Retrieval with Multi-modal Queries Jun 1, 2023 Benchmarking Cross-Modal Retrieval
Code Code Available 15 HIDRO-VQA: High Dynamic Range Oracle for Video Quality Assessment Nov 18, 2023 Video Quality Assessment Visual Question Answering (VQA)
Code Code Available 15 End-to-end Document Recognition and Understanding with Dessurt Mar 30, 2022 document understanding Visual Question Answering (VQA)
Code Code Available 15