| AlignVE: Visual Entailment Recognition Based on Alignment Relations | Nov 16, 2022 | Question AnsweringRelation | —Unverified | 0 |
| PromptCap: Prompt-Guided Task-Aware Image Captioning | Nov 15, 2022 | Image CaptioningLanguage Modelling | CodeCode Available | 1 |
| MapQA: A Dataset for Question Answering on Choropleth Maps | Nov 15, 2022 | ArticlesQuestion Answering | CodeCode Available | 1 |
| Visually Grounded VQA by Lattice-based Retrieval | Nov 15, 2022 | Information RetrievalQuestion Answering | CodeCode Available | 0 |
| MF2-MVQA: A Multi-stage Feature Fusion method for Medical Visual Question Answering | Nov 11, 2022 | Medical Visual Question AnsweringQuestion Answering | —Unverified | 0 |
| Towards Reasoning-Aware Explainable VQA | Nov 9, 2022 | DecoderExplanation Generation | —Unverified | 0 |
| Visual Named Entity Linking: A New Dataset and A Baseline | Nov 9, 2022 | Entity LinkingImage Retrieval | CodeCode Available | 1 |
| ERNIE-UniX2: A Unified Cross-lingual Cross-modal Framework for Understanding and Generation | Nov 9, 2022 | Contrastive LearningDecoder | —Unverified | 0 |
| What's Different between Visual Question Answering for Machine "Understanding" Versus for Accessibility? | Oct 26, 2022 | BenchmarkingQuestion Answering | CodeCode Available | 0 |
| Generalization Differences between End-to-End and Neuro-Symbolic Vision-Language Reasoning Systems | Oct 26, 2022 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Compressing And Debiasing Vision-Language Pre-Trained Models for Visual Question Answering | Oct 26, 2022 | Question AnsweringVisual Question Answering | CodeCode Available | 0 |
| Learning by Hallucinating: Vision-Language Pre-training with Weak Supervision | Oct 24, 2022 | cross-modal alignmentCross-Modal Retrieval | —Unverified | 0 |
| VLC-BERT: Visual Question Answering with Contextualized Commonsense Knowledge | Oct 24, 2022 | Question AnsweringVisual Question Answering | CodeCode Available | 1 |
| RSVG: Exploring Data and Models for Visual Grounding on Remote Sensing Data | Oct 23, 2022 | Image CaptioningImage-text Retrieval | —Unverified | 0 |
| PoseScript: Linking 3D Human Poses and Natural Language | Oct 21, 2022 | Cross-Modal RetrievalImage Captioning | CodeCode Available | 2 |
| Image Semantic Relation Generation | Oct 19, 2022 | Image RetrievalImage Segmentation | —Unverified | 0 |
| CPL: Counterfactual Prompt Learning for Vision and Language Models | Oct 19, 2022 | counterfactualimage-classification | —Unverified | 0 |
| Aligning MAGMA by Few-Shot Learning and Finetuning | Oct 18, 2022 | Few-Shot LearningImage Captioning | —Unverified | 0 |
| Entity-Focused Dense Passage Retrieval for Outside-Knowledge Visual Question Answering | Oct 18, 2022 | Passage RetrievalQuestion Answering | —Unverified | 0 |
| Plug-and-Play VQA: Zero-shot VQA by Conjoining Large Pretrained Models with Zero Training | Oct 17, 2022 | Image CaptioningNetwork Interpretation | CodeCode Available | 0 |
| Vision-Language Pre-training: Basics, Recent Advances, and Future Trends | Oct 17, 2022 | Few-Shot LearningImage Captioning | CodeCode Available | 3 |
| MAPL: Parameter-Efficient Adaptation of Unimodal Pre-Trained Models for Vision-Language Few-Shot Prompting | Oct 13, 2022 | Image CaptioningQuestion Answering | CodeCode Available | 1 |
| SlotFormer: Unsupervised Visual Dynamics Simulation with Object-Centric Models | Oct 12, 2022 | ObjectQuestion Answering | CodeCode Available | 1 |
| ERNIE-Layout: Layout Knowledge Enhanced Pre-training for Visually-rich Document Understanding | Oct 12, 2022 | document-image-classificationDocument Image Classification | CodeCode Available | 1 |
| MAP: Multimodal Uncertainty-Aware Vision-Language Pre-training Model | Oct 11, 2022 | Contrastive LearningImage-text matching | CodeCode Available | 1 |
| Multi-Modal Fusion Transformer for Visual Question Answering in Remote Sensing | Oct 10, 2022 | Question AnsweringRepresentation Learning | —Unverified | 0 |
| Language Prior Is Not the Only Shortcut: A Benchmark for Shortcut Learning in VQA | Oct 10, 2022 | Question AnsweringVisual Question Answering | CodeCode Available | 1 |
| Towards Robust Visual Question Answering: Making the Most of Biased Samples via Contrastive Learning | Oct 10, 2022 | Contrastive LearningQuestion Answering | CodeCode Available | 1 |
| MAMO: Masked Multimodal Modeling for Fine-Grained Vision-Language Representation Learning | Oct 9, 2022 | Image-text Retrievalmultimodal interaction | —Unverified | 0 |
| Retrieval Augmented Visual Question Answering with Outside Knowledge | Oct 7, 2022 | Answer GenerationDiagnostic | CodeCode Available | 2 |
| On the Effects of Video Grounding on Language Models | Oct 1, 2022 | Image CaptioningQuestion Answering | —Unverified | 0 |
| Dual Capsule Attention Mask Network with Mutual Learning for Visual Question Answering | Oct 1, 2022 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| A Dual-Attention Learning Network with Word and Sentence Embedding for Medical Visual Question Answering | Oct 1, 2022 | Medical Visual Question AnsweringQuestion Answering | CodeCode Available | 0 |
| Task Formulation Matters When Learning Continually: A Case Study in Visual Question Answering | Sep 30, 2022 | Continual LearningQuestion Answering | CodeCode Available | 0 |
| Linearly Mapping from Image to Text Space | Sep 30, 2022 | Image CaptioningImage to text | CodeCode Available | 1 |
| TVLT: Textless Vision-Language Transformer | Sep 28, 2022 | Automatic Speech Recognition (ASR)Image Retrieval | CodeCode Available | 1 |
| RepsNet: Combining Vision with Language for Automated Medical Reports | Sep 27, 2022 | Contrastive LearningDecoder | —Unverified | 0 |
| Towards Explainable 3D Grounded Visual Question Answering: A New Benchmark and Strong Baseline | Sep 24, 2022 | Question AnsweringVisual Question Answering | CodeCode Available | 1 |
| Exploring Modulated Detection Transformer as a Tool for Action Recognition in Videos | Sep 21, 2022 | Action DetectionAction Recognition | CodeCode Available | 0 |
| Toward 3D Spatial Reasoning for Human-like Text-based Visual Question Answering | Sep 21, 2022 | Image CaptioningOptical Character Recognition (OCR) | —Unverified | 0 |
| Continual VQA for Disaster Response Systems | Sep 21, 2022 | Disaster ResponseManagement | CodeCode Available | 0 |
| Overcoming Language Priors in Visual Question Answering via Distinguishing Superficially Similar Instances | Sep 18, 2022 | AttributeQuestion Answering | CodeCode Available | 0 |
| LAVIS: A Library for Language-Vision Intelligence | Sep 15, 2022 | BenchmarkingImage Captioning | —Unverified | 0 |
| Correlation Information Bottleneck: Towards Adapting Pretrained Multimodal Models for Robust Visual Question Answering | Sep 14, 2022 | Adversarial RobustnessQuestion Answering | —Unverified | 0 |
| MUST-VQA: MUltilingual Scene-text VQA | Sep 14, 2022 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| PaLI: A Jointly-Scaled Multilingual Language-Image Model | Sep 14, 2022 | DecoderFew-Shot Image Classification | —Unverified | 0 |
| PreSTU: Pre-Training for Scene-Text Understanding | Sep 12, 2022 | DecoderImage Captioning | —Unverified | 0 |
| MaXM: Towards Multilingual Visual Question Answering | Sep 12, 2022 | Question AnsweringTranslation | CodeCode Available | 1 |
| Pre-training image-language transformers for open-vocabulary tasks | Sep 9, 2022 | Question AnsweringVisual Entailment | —Unverified | 0 |
| Improving the Cross-Lingual Generalisation in Visual Question Answering | Sep 7, 2022 | Cross-Lingual TransferQuestion Answering | CodeCode Available | 0 |