| AlignVE: Visual Entailment Recognition Based on Alignment Relations | Nov 16, 2022 | Question AnsweringRelation | —Unverified | 0 |
| PromptCap: Prompt-Guided Task-Aware Image Captioning | Nov 15, 2022 | Image CaptioningLanguage Modelling | CodeCode Available | 1 |
| MapQA: A Dataset for Question Answering on Choropleth Maps | Nov 15, 2022 | ArticlesQuestion Answering | CodeCode Available | 1 |
| Visually Grounded VQA by Lattice-based Retrieval | Nov 15, 2022 | Information RetrievalQuestion Answering | CodeCode Available | 0 |
| MF2-MVQA: A Multi-stage Feature Fusion method for Medical Visual Question Answering | Nov 11, 2022 | Medical Visual Question AnsweringQuestion Answering | —Unverified | 0 |
| Towards Reasoning-Aware Explainable VQA | Nov 9, 2022 | DecoderExplanation Generation | —Unverified | 0 |
| Visual Named Entity Linking: A New Dataset and A Baseline | Nov 9, 2022 | Entity LinkingImage Retrieval | CodeCode Available | 1 |
| ERNIE-UniX2: A Unified Cross-lingual Cross-modal Framework for Understanding and Generation | Nov 9, 2022 | Contrastive LearningDecoder | —Unverified | 0 |
| What's Different between Visual Question Answering for Machine "Understanding" Versus for Accessibility? | Oct 26, 2022 | BenchmarkingQuestion Answering | CodeCode Available | 0 |
| Generalization Differences between End-to-End and Neuro-Symbolic Vision-Language Reasoning Systems | Oct 26, 2022 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Compressing And Debiasing Vision-Language Pre-Trained Models for Visual Question Answering | Oct 26, 2022 | Question AnsweringVisual Question Answering | CodeCode Available | 0 |
| Learning by Hallucinating: Vision-Language Pre-training with Weak Supervision | Oct 24, 2022 | cross-modal alignmentCross-Modal Retrieval | —Unverified | 0 |
| VLC-BERT: Visual Question Answering with Contextualized Commonsense Knowledge | Oct 24, 2022 | Question AnsweringVisual Question Answering | CodeCode Available | 1 |
| RSVG: Exploring Data and Models for Visual Grounding on Remote Sensing Data | Oct 23, 2022 | Image CaptioningImage-text Retrieval | —Unverified | 0 |
| PoseScript: Linking 3D Human Poses and Natural Language | Oct 21, 2022 | Cross-Modal RetrievalImage Captioning | CodeCode Available | 2 |
| Image Semantic Relation Generation | Oct 19, 2022 | Image RetrievalImage Segmentation | —Unverified | 0 |
| CPL: Counterfactual Prompt Learning for Vision and Language Models | Oct 19, 2022 | counterfactualimage-classification | —Unverified | 0 |
| Aligning MAGMA by Few-Shot Learning and Finetuning | Oct 18, 2022 | Few-Shot LearningImage Captioning | —Unverified | 0 |
| Entity-Focused Dense Passage Retrieval for Outside-Knowledge Visual Question Answering | Oct 18, 2022 | Passage RetrievalQuestion Answering | —Unverified | 0 |
| Plug-and-Play VQA: Zero-shot VQA by Conjoining Large Pretrained Models with Zero Training | Oct 17, 2022 | Image CaptioningNetwork Interpretation | CodeCode Available | 0 |
| Vision-Language Pre-training: Basics, Recent Advances, and Future Trends | Oct 17, 2022 | Few-Shot LearningImage Captioning | CodeCode Available | 3 |
| MAPL: Parameter-Efficient Adaptation of Unimodal Pre-Trained Models for Vision-Language Few-Shot Prompting | Oct 13, 2022 | Image CaptioningQuestion Answering | CodeCode Available | 1 |
| SlotFormer: Unsupervised Visual Dynamics Simulation with Object-Centric Models | Oct 12, 2022 | ObjectQuestion Answering | CodeCode Available | 1 |
| ERNIE-Layout: Layout Knowledge Enhanced Pre-training for Visually-rich Document Understanding | Oct 12, 2022 | document-image-classificationDocument Image Classification | CodeCode Available | 1 |
| MAP: Multimodal Uncertainty-Aware Vision-Language Pre-training Model | Oct 11, 2022 | Contrastive LearningImage-text matching | CodeCode Available | 1 |