IMPACT: A Large-scale Integrated Multimodal Patent Analysis and Creation Dataset for Design Patents Dec 10, 2024 Cross-Modal Retrieval Image Classification
Code Code Available 1Instruction-Guided Visual Masking May 30, 2024 Instruction Following Visual Grounding
Code Code Available 1How to Configure Good In-Context Sequence for Visual Question Answering Dec 4, 2023 In-Context Learning Question Answering
Code Code Available 1Meta-Learning via Classifier(-free) Diffusion Guidance Oct 17, 2022 Few-Shot Learning Image Generation
Code Code Available 1How Much Can CLIP Benefit Vision-and-Language Tasks? Jul 13, 2021 Question Answering Vision and Language Navigation
Code Code Available 1AssistQ: Affordance-centric Question-driven Task Completion for Egocentric Assistant Mar 8, 2022 Visual Question Answering (VQA)
Code Code Available 1How Many Unicorns Are in This Image? A Safety Evaluation Benchmark for Vision LLMs Nov 27, 2023 Adversarial Robustness Visual Question Answering (VQA)
Code Code Available 1HYDRA: A Hyper Agent for Dynamic Compositional Visual Reasoning Mar 19, 2024 Reinforcement Learning (RL) Visual Grounding
Code Code Available 1Hierarchical Conditional Relation Networks for Video Question Answering Feb 25, 2020 Audio-Visual Question Answering (AVQA) Question Answering
Code Code Available 1Align before Fuse: Vision and Language Representation Learning with Momentum Distillation Jul 16, 2021 Cross-Modal Retrieval Grounded language learning
Code Code Available 1Hierarchical multimodal transformers for Multi-Page DocVQA Dec 7, 2022 Decoder Question Answering
Code Code Available 1Closed Loop Neural-Symbolic Learning via Integrating Neural Perception, Grammar Parsing, and Symbolic Reasoning Jun 11, 2020 Question Answering Reinforcement Learning (RL)
Code Code Available 1HIDRO-VQA: High Dynamic Range Oracle for Video Quality Assessment Nov 18, 2023 Video Quality Assessment Visual Question Answering (VQA)
Code Code Available 1Hierarchical Question-Image Co-Attention for Visual Question Answering May 31, 2016 Visual Dialog Visual Question Answering
Code Code Available 1Hypergraph Transformer: Weakly-supervised Multi-hop Reasoning for Knowledge-based Visual Question Answering Apr 22, 2022 Question Answering Visual Question Answering
Code Code Available 1Inter-GPS: Interpretable Geometry Problem Solving with Formal Language and Symbolic Reasoning May 10, 2021 Arithmetic Reasoning Geometry Problem Solving
Code Code Available 1CLEVR-Math: A Dataset for Compositional Language, Visual and Mathematical Reasoning Aug 10, 2022 Math Mathematical Reasoning
Code Code Available 1Align and Prompt: Video-and-Language Pre-training with Entity Prompts Dec 17, 2021 cross-modal alignment Entity Alignment
Code Code Available 1HAAR: Text-Conditioned Generative Model of 3D Strand-based Human Hairstyles Dec 18, 2023 Question Answering Visual Question Answering
Code Code Available 1GRIT: General Robust Image Task Benchmark Apr 28, 2022 Instance Segmentation Keypoint Detection
Code Code Available 1Greedy Gradient Ensemble for Robust Visual Question Answering Jul 27, 2021 Question Answering Visual Question Answering
Code Code Available 1HallE-Control: Controlling Object Hallucination in Large Multimodal Models Oct 3, 2023 Attribute Decoder
Code Code Available 1CLEVR-X: A Visual Reasoning Dataset for Natural Language Explanations Apr 5, 2022 Explanation Generation Question Answering
Code Code Available 1Comprehensive Visual Question Answering on Point Clouds through Compositional Scene Manipulation Dec 22, 2021 Common Sense Reasoning Question Answering
Code Code Available 1Graphhopper: Multi-Hop Scene Graph Reasoning for Visual Question Answering Jul 13, 2021 Navigate Question Answering
Code Code Available 1Classification-Regression for Chart Comprehension Nov 29, 2021 Chart Question Answering Classification
Code Code Available 1AIM 2024 Challenge on Compressed Video Quality Assessment: Methods and Results Aug 21, 2024 Image Manipulation valid
Code Code Available 1CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning Dec 20, 2016 Diagnostic Question Answering
Code Code Available 1Graph Optimal Transport for Cross-Domain Alignment Jun 26, 2020 Graph Matching Image Captioning
Code Code Available 1AIGV-Assessor: Benchmarking and Evaluating the Perceptual Quality of Text-to-Video Generation with LMM Nov 26, 2024 Benchmarking Text-to-Video Generation
Code Code Available 1GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering Feb 25, 2019 Question Answering Visual Question Answering (VQA)
Code Code Available 1ActiView: Evaluating Active Perception Ability for Multimodal Large Language Models Oct 7, 2024 Question Answering Visual Question Answering
Code Code Available 1AI2-THOR: An Interactive 3D Environment for Visual AI Dec 14, 2017 Deep Reinforcement Learning Imitation Learning
Code Code Available 1Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization Oct 7, 2016 General Classification Image Attribution
Code Code Available 1GeoLLaVA-8K: Scaling Remote-Sensing Multimodal Large Language Models to 8K Resolution May 27, 2025 8k Avg
Code Code Available 1Clover: Towards A Unified Video-Language Alignment and Fusion Model Jul 16, 2022 Language Modeling Language Modelling
Code Code Available 1Going Full-TILT Boogie on Document Understanding with Text-Image-Layout Transformer Feb 18, 2021 Decoder Document Image Classification
Code Code Available 1GraghVQA: Language-Guided Graph Neural Networks for Graph-based Visual Question Answering Apr 20, 2021 Graph Neural Network Graph Question Answering
Code Code Available 1Harnessing the Power of Multi-Task Pretraining for Ground-Truth Level Natural Language Explanations Dec 8, 2022 Explanation Generation Visual Entailment
Code Code Available 1GeneAnnotator: A Semi-automatic Annotation Tool for Visual Scene Graph Sep 6, 2021 Graph Generation Graph Learning
Code Code Available 1A Hitchhikers Guide to Fine-Grained Face Forgery Detection Using Common Sense Reasoning Oct 1, 2024 Common Sense Reasoning DeepFake Detection
Code Code Available 1Generalizing from SIMPLE to HARD Visual Reasoning: Can We Mitigate Modality Imbalance in VLMs? Jan 5, 2025 Image Captioning Image to text
Code Code Available 1FunQA: Towards Surprising Video Comprehension Jun 26, 2023 Question Answering Text Generation
Code Code Available 1Gemini Goes to Med School: Exploring the Capabilities of Multimodal Large Language Models on Medical Challenge Problems & Hallucinations Feb 10, 2024 Diagnostic Hallucination
Code Code Available 1Generative Bias for Robust Visual Question Answering Aug 1, 2022 Knowledge Distillation Question Answering
Code Code Available 1A Good Prompt Is Worth Millions of Parameters: Low-resource Prompt-based Learning for Vision-Language Models Oct 16, 2021 Image Captioning Language Modeling
Code Code Available 1Are Vision Language Models Ready for Clinical Diagnosis? A 3D Medical Benchmark for Tumor-centric Visual Question Answering May 25, 2025 Anatomy Benchmarking
Code Code Available 1FoodieQA: A Multimodal Dataset for Fine-Grained Understanding of Chinese Food Culture Jun 16, 2024 Diversity Multiple-choice
Code Code Available 1Check It Again:Progressive Visual Question Answering via Visual Entailment Aug 1, 2021 Question Answering Visual Entailment
Code Code Available 1Found a Reason for me? Weakly-supervised Grounded Visual Question Answering using Capsules May 11, 2021 Question Answering Visual Question Answering
Code Code Available 1