Hierarchical multimodal transformers for Multi-Page DocVQA Dec 7, 2022 Decoder Question Answering
Code Code Available 1Super-CLEVR: A Virtual Benchmark to Diagnose Domain Robustness in Visual Reasoning Dec 1, 2022 Domain Generalization Question Answering
Code Code Available 1Seeing What You Miss: Vision-Language Pre-training with Semantic Completion Learning Nov 24, 2022 cross-modal alignment Image-text Retrieval
Code Code Available 1Self-supervised vision-language pretraining for Medical visual question answering Nov 24, 2022 Contrastive Learning Image-text matching
Code Code Available 1Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations Nov 21, 2022 Contrastive Learning Representation Learning
Code Code Available 1I Can't Believe There's No Images! Learning Visual Tasks Using only Language Supervision Nov 17, 2022 Image Captioning Question Answering
Code Code Available 1PromptCap: Prompt-Guided Task-Aware Image Captioning Nov 15, 2022 Image Captioning Language Modelling
Code Code Available 1MapQA: A Dataset for Question Answering on Choropleth Maps Nov 15, 2022 Articles Question Answering
Code Code Available 1Visual Named Entity Linking: A New Dataset and A Baseline Nov 9, 2022 Entity Linking Image Retrieval
Code Code Available 1VLC-BERT: Visual Question Answering with Contextualized Commonsense Knowledge Oct 24, 2022 Question Answering Visual Question Answering
Code Code Available 1Meta-Learning via Classifier(-free) Diffusion Guidance Oct 17, 2022 Few-Shot Learning Image Generation
Code Code Available 1MAPL: Parameter-Efficient Adaptation of Unimodal Pre-Trained Models for Vision-Language Few-Shot Prompting Oct 13, 2022 Image Captioning Question Answering
Code Code Available 1SlotFormer: Unsupervised Visual Dynamics Simulation with Object-Centric Models Oct 12, 2022 Object Question Answering
Code Code Available 1ERNIE-Layout: Layout Knowledge Enhanced Pre-training for Visually-rich Document Understanding Oct 12, 2022 document-image-classification Document Image Classification
Code Code Available 1MAP: Multimodal Uncertainty-Aware Vision-Language Pre-training Model Oct 11, 2022 Contrastive Learning Image-text matching
Code Code Available 1Towards Robust Visual Question Answering: Making the Most of Biased Samples via Contrastive Learning Oct 10, 2022 Contrastive Learning Question Answering
Code Code Available 1Language Prior Is Not the Only Shortcut: A Benchmark for Shortcut Learning in VQA Oct 10, 2022 Question Answering Visual Question Answering
Code Code Available 1Linearly Mapping from Image to Text Space Sep 30, 2022 Image Captioning Image to text
Code Code Available 1TVLT: Textless Vision-Language Transformer Sep 28, 2022 Automatic Speech Recognition (ASR) Image Retrieval
Code Code Available 1Towards Explainable 3D Grounded Visual Question Answering: A New Benchmark and Strong Baseline Sep 24, 2022 Question Answering Visual Question Answering
Code Code Available 1Panoramic Vision Transformer for Saliency Detection in 360° Videos Sep 19, 2022 Saliency Detection Saliency Prediction
Code Code Available 1MaXM: Towards Multilingual Visual Question Answering Sep 12, 2022 Question Answering Translation
Code Code Available 1An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling Sep 4, 2022 Fill Mask Optical Flow Estimation
Code Code Available 12BiVQA: Double Bi-LSTM based Video Quality Assessment of UGC Videos Aug 31, 2022 Video Quality Assessment Visual Question Answering (VQA)
Code Code Available 1Efficient Vision-Language Pretraining with Visual Concepts and Hierarchical Alignment Aug 29, 2022 cross-modal alignment Image-text Retrieval
Code Code Available 1Symbolic Replay: Scene Graph as Prompt for Continual Learning on VQA Task Aug 24, 2022 Continual Learning Question Answering
Code Code Available 1CLEVR-Math: A Dataset for Compositional Language, Visual and Mathematical Reasoning Aug 10, 2022 Math Mathematical Reasoning
Code Code Available 1ChiQA: A Large Scale Image-based Real-World Question Answering Dataset for Multi-Modal Understanding Aug 5, 2022 Image Retrieval Question Answering
Code Code Available 1TAG: Boosting Text-VQA via Text-aware Visual Question-answer Generation Aug 3, 2022 Answer Generation Question-Answer-Generation
Code Code Available 1Generative Bias for Robust Visual Question Answering Aug 1, 2022 Knowledge Distillation Question Answering
Code Code Available 1Cross-Modal Causal Relational Reasoning for Event-Level Visual Question Answering Jul 26, 2022 Causal Inference Question Answering
Code Code Available 1LaKo: Knowledge-driven Visual Question Answering via Late Knowledge-to-Text Injection Jul 26, 2022 Decoder Knowledge Graphs
Code Code Available 1Rethinking Data Augmentation for Robust Visual Question Answering Jul 18, 2022 Data Augmentation Knowledge Distillation
Code Code Available 1Clover: Towards A Unified Video-Language Alignment and Fusion Model Jul 16, 2022 Language Modeling Language Modelling
Code Code Available 1Video Graph Transformer for Video Question Answering Jul 12, 2022 Question Answering Relation
Code Code Available 1ViQuAE, a Dataset for Knowledge-based Visual Question Answering about Named Entities Jul 11, 2022 Articles Few-Shot Learning
Code Code Available 1Weakly Supervised Grounding for VQA in Vision-Language Transformers Jul 5, 2022 Question Answering Representation Learning
Code Code Available 1A Unified End-to-End Retriever-Reader Framework for Knowledge-based VQA Jun 30, 2022 Question Answering Retrieval
Code Code Available 1Consistency-preserving Visual Question Answering in Medical Imaging Jun 27, 2022 Question Answering Visual Question Answering
Code Code Available 1Surgical-VQA: Visual Question Answering in Surgical Scenes using Transformer Jun 22, 2022 Question Answering Sentence
Code Code Available 1Zero-Shot Video Question Answering via Frozen Bidirectional Language Models Jun 16, 2022 Fill Mask Language Modeling
Code Code Available 1MixGen: A New Multi-Modal Data Augmentation Jun 16, 2022 Data Augmentation Image-text Retrieval
Code Code Available 1Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone Jun 15, 2022 Described Object Detection Image Captioning
Code Code Available 1A-OKVQA: A Benchmark for Visual Question Answering using World Knowledge Jun 3, 2022 Question Answering Visual Question Answering
Code Code Available 1REVIVE: Regional Visual Representation Matters in Knowledge-Based Visual Question Answering Jun 2, 2022 Question Answering Retrieval
Code Code Available 1mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections May 24, 2022 Computational Efficiency cross-modal alignment
Code Code Available 1PEVL: Position-enhanced Pre-training and Prompt Tuning for Vision-language Models May 23, 2022 Language Modeling Language Modelling
Code Code Available 1Learning to Answer Visual Questions from Web Videos May 10, 2022 Dataset Generation Question Answering
Code Code Available 1Declaration-based Prompt Tuning for Visual Question Answering May 5, 2022 Image-text matching Language Modeling
Code Code Available 1CoCa: Contrastive Captioners are Image-Text Foundation Models May 4, 2022 Action Classification Decoder
Code Code Available 1