| Towards Unifying Medical Vision-and-Language Pre-training via Soft Prompts | Feb 17, 2023 | Image RetrievalImage-text Classification | CodeCode Available | 1 |
| Multimodal Federated Learning via Contrastive Representation Ensemble | Feb 17, 2023 | Federated LearningImage-text Retrieval | CodeCode Available | 1 |
| Differentiable Outlier Detection Enable Robust Deep Multimodal Analysis | Feb 11, 2023 | Image-text RetrievalKnowledge Graphs | CodeCode Available | 0 |
| Is Multimodal Vision Supervision Beneficial to Language? | Feb 10, 2023 | Image RetrievalNatural Language Understanding | CodeCode Available | 0 |
| Language Quantized AutoEncoders: Towards Unsupervised Text-Image Alignment | Feb 2, 2023 | AttributeFew-Shot Image Classification | CodeCode Available | 1 |
| mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video | Feb 1, 2023 | Action ClassificationImage Classification | CodeCode Available | 4 |
| Multimodality Representation Learning: A Survey on Evolution, Pretraining and Its Applications | Feb 1, 2023 | Question AnsweringRepresentation Learning | CodeCode Available | 1 |
| BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models | Jan 30, 2023 | Generative Visual Question AnsweringImage Captioning | CodeCode Available | 4 |
| BinaryVQA: A Versatile Test Set to Evaluate the Out-of-Distribution Generalization of VQA Models | Jan 28, 2023 | Out-of-Distribution GeneralizationQuestion Answering | CodeCode Available | 0 |
| Towards a Unified Model for Generating Answers and Explanations in Visual Question Answering | Jan 25, 2023 | DecoderExplanation Generation | —Unverified | 0 |
| HRVQA: A Visual Question Answering Benchmark for High-Resolution Aerial Images | Jan 23, 2023 | AttributeQuestion Answering | —Unverified | 0 |
| Champion Solution for the WSDM2023 Toloka VQA Challenge | Jan 22, 2023 | Question AnsweringVisual Grounding | CodeCode Available | 3 |
| Towards Models that Can See and Read | Jan 18, 2023 | DecoderImage Captioning | —Unverified | 0 |
| Curriculum Script Distillation for Multilingual Visual Question Answering | Jan 17, 2023 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| SlideVQA: A Dataset for Document Visual Question Answering on Multiple Images | Jan 12, 2023 | Evidence SelectionQuestion Answering | CodeCode Available | 1 |
| Multimodal Inverse Cloze Task for Knowledge-based Visual Question Answering | Jan 11, 2023 | Question AnsweringReading Comprehension | CodeCode Available | 1 |
| Adaptively Clustering Neighbor Elements for Image-Text Generation | Jan 5, 2023 | ClusteringDecoder | CodeCode Available | 0 |
| PromptCap: Prompt-Guided Image Captioning for VQA with GPT-3 | Jan 1, 2023 | Image CaptioningQuestion Answering | —Unverified | 0 |
| Variational Causal Inference Network for Explanatory Visual Question Answering | Jan 1, 2023 | Explanation GenerationExplanatory Visual Question Answering | CodeCode Available | 1 |
| Decouple Before Interact: Multi-Modal Prompt Learning for Continual Visual Question Answering | Jan 1, 2023 | Continual LearningLanguage Modelling | —Unverified | 0 |
| Toward Multi-Granularity Decision-Making: Explicit Visual Reasoning with Hierarchical Knowledge | Jan 1, 2023 | Decision MakingQuestion Answering | CodeCode Available | 0 |
| Image as a Foreign Language: BEiT Pretraining for Vision and Vision-Language Tasks | Jan 1, 2023 | Cross-Modal RetrievalImage Captioning | —Unverified | 0 |
| Exploring the Effect of Primitives for Compositional Generalization in Vision-and-Language | Jan 1, 2023 | Question AnsweringSelf-Supervised Learning | CodeCode Available | 0 |
| RMLVQA: A Margin Loss Approach for Visual Question Answering With Language Biases | Jan 1, 2023 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| VQACL: A Novel Visual Question Answering Continual Learning Setting | Jan 1, 2023 | Continual LearningQuestion Answering | CodeCode Available | 1 |
| From Images to Textual Prompts: Zero-Shot Visual Question Answering With Frozen Large Language Models | Jan 1, 2023 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| When are Lemons Purple? The Concept Association Bias of Vision-Language Models | Dec 22, 2022 | Attributeimage-classification | —Unverified | 0 |
| UnICLAM:Contrastive Representation Learning with Adversarial Masking for Unified and Interpretable Medical Vision Question Answering | Dec 21, 2022 | Data AugmentationDecision Making | —Unverified | 0 |
| From Images to Textual Prompts: Zero-shot VQA with Frozen Large Language Models | Dec 21, 2022 | Question AnsweringVisual Question Answering | CodeCode Available | 0 |
| Towards Unsupervised Visual Reasoning: Do Off-The-Shelf Features Know How to Reason? | Dec 20, 2022 | Question AnsweringRepresentation Learning | —Unverified | 0 |
| MatCha: Enhancing Visual Language Pretraining with Math Reasoning and Chart Derendering | Dec 19, 2022 | Chart Question AnsweringData Summarization | —Unverified | 0 |
| SceneGATE: Scene-Graph based co-Attention networks for TExt visual question answering | Dec 16, 2022 | Optical Character RecognitionOptical Character Recognition (OCR) | —Unverified | 0 |
| CLIPPO: Image-and-Language Understanding from Pixels Only | Dec 15, 2022 | Contrastive Learningimage-classification | —Unverified | 0 |
| REVEAL: Retrieval-Augmented Visual-Language Pre-Training with Multi-Source Multimodal Knowledge Memory | Dec 10, 2022 | Image CaptioningLanguage Modeling | CodeCode Available | 0 |
| ParsVQA-Caps: A Benchmark for Visual Question Answering and Image Captioning in Persian | Dec 7, 2022 | Image CaptioningQuestion Answering | —Unverified | 0 |
| Hierarchical multimodal transformers for Multi-Page DocVQA | Dec 7, 2022 | DecoderQuestion Answering | CodeCode Available | 1 |
| Visual Question Answering From Another Perspective: CLEVR Mental Rotation Tests | Dec 3, 2022 | Question AnsweringVisual Question Answering | CodeCode Available | 0 |
| Compound Tokens: Channel Fusion for Vision-Language Representation Learning | Dec 2, 2022 | DecoderLanguage Modeling | —Unverified | 0 |
| Super-CLEVR: A Virtual Benchmark to Diagnose Domain Robustness in Visual Reasoning | Dec 1, 2022 | Domain GeneralizationQuestion Answering | CodeCode Available | 1 |
| Optimizing Explanations by Network Canonization and Hyperparameter Search | Nov 30, 2022 | Explainable Artificial Intelligence (XAI)image-classification | —Unverified | 0 |
| PiggyBack: Pretrained Visual Question Answering Environment for Backing up Non-deep Learning Professionals | Nov 29, 2022 | Deep LearningQuestion Answering | —Unverified | 0 |
| Neuro-Symbolic Spatio-Temporal Reasoning | Nov 28, 2022 | AI AgentImage Segmentation | —Unverified | 0 |
| Seeing What You Miss: Vision-Language Pre-training with Semantic Completion Learning | Nov 24, 2022 | cross-modal alignmentImage-text Retrieval | CodeCode Available | 1 |
| Self-supervised vision-language pretraining for Medical visual question answering | Nov 24, 2022 | Contrastive LearningImage-text matching | CodeCode Available | 1 |
| Look, Read and Ask: Learning to Ask Questions by Reading Text in Images | Nov 23, 2022 | Optical Character Recognition (OCR)Question Answering | —Unverified | 0 |
| Cross-Modal Contrastive Learning for Robust Reasoning in VQA | Nov 21, 2022 | Contrastive LearningQuestion Answering | CodeCode Available | 0 |
| CL-CrossVQA: A Continual Learning Benchmark for Cross-Domain Visual Question Answering | Nov 19, 2022 | Continual LearningQuestion Answering | —Unverified | 0 |
| Visual Programming: Compositional visual reasoning without training | Nov 18, 2022 | In-Context LearningQuestion Answering | CodeCode Available | 2 |
| Text-Aware Dual Routing Network for Visual Question Answering | Nov 17, 2022 | Optical Character RecognitionOptical Character Recognition (OCR) | —Unverified | 0 |
| I Can't Believe There's No Images! Learning Visual Tasks Using only Language Supervision | Nov 17, 2022 | Image CaptioningQuestion Answering | CodeCode Available | 1 |