| Towards Unifying Medical Vision-and-Language Pre-training via Soft Prompts | Feb 17, 2023 | Image RetrievalImage-text Classification | CodeCode Available | 1 |
| Multimodal Federated Learning via Contrastive Representation Ensemble | Feb 17, 2023 | Federated LearningImage-text Retrieval | CodeCode Available | 1 |
| Differentiable Outlier Detection Enable Robust Deep Multimodal Analysis | Feb 11, 2023 | Image-text RetrievalKnowledge Graphs | CodeCode Available | 0 |
| Is Multimodal Vision Supervision Beneficial to Language? | Feb 10, 2023 | Image RetrievalNatural Language Understanding | CodeCode Available | 0 |
| Language Quantized AutoEncoders: Towards Unsupervised Text-Image Alignment | Feb 2, 2023 | AttributeFew-Shot Image Classification | CodeCode Available | 1 |
| mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video | Feb 1, 2023 | Action ClassificationImage Classification | CodeCode Available | 4 |
| Multimodality Representation Learning: A Survey on Evolution, Pretraining and Its Applications | Feb 1, 2023 | Question AnsweringRepresentation Learning | CodeCode Available | 1 |
| BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models | Jan 30, 2023 | Generative Visual Question AnsweringImage Captioning | CodeCode Available | 4 |
| BinaryVQA: A Versatile Test Set to Evaluate the Out-of-Distribution Generalization of VQA Models | Jan 28, 2023 | Out-of-Distribution GeneralizationQuestion Answering | CodeCode Available | 0 |
| Towards a Unified Model for Generating Answers and Explanations in Visual Question Answering | Jan 25, 2023 | DecoderExplanation Generation | —Unverified | 0 |
| HRVQA: A Visual Question Answering Benchmark for High-Resolution Aerial Images | Jan 23, 2023 | AttributeQuestion Answering | —Unverified | 0 |
| Champion Solution for the WSDM2023 Toloka VQA Challenge | Jan 22, 2023 | Question AnsweringVisual Grounding | CodeCode Available | 3 |
| Towards Models that Can See and Read | Jan 18, 2023 | DecoderImage Captioning | —Unverified | 0 |
| Curriculum Script Distillation for Multilingual Visual Question Answering | Jan 17, 2023 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| SlideVQA: A Dataset for Document Visual Question Answering on Multiple Images | Jan 12, 2023 | Evidence SelectionQuestion Answering | CodeCode Available | 1 |
| Multimodal Inverse Cloze Task for Knowledge-based Visual Question Answering | Jan 11, 2023 | Question AnsweringReading Comprehension | CodeCode Available | 1 |
| Adaptively Clustering Neighbor Elements for Image-Text Generation | Jan 5, 2023 | ClusteringDecoder | CodeCode Available | 0 |
| PromptCap: Prompt-Guided Image Captioning for VQA with GPT-3 | Jan 1, 2023 | Image CaptioningQuestion Answering | —Unverified | 0 |
| Variational Causal Inference Network for Explanatory Visual Question Answering | Jan 1, 2023 | Explanation GenerationExplanatory Visual Question Answering | CodeCode Available | 1 |
| Decouple Before Interact: Multi-Modal Prompt Learning for Continual Visual Question Answering | Jan 1, 2023 | Continual LearningLanguage Modelling | —Unverified | 0 |
| Toward Multi-Granularity Decision-Making: Explicit Visual Reasoning with Hierarchical Knowledge | Jan 1, 2023 | Decision MakingQuestion Answering | CodeCode Available | 0 |
| Image as a Foreign Language: BEiT Pretraining for Vision and Vision-Language Tasks | Jan 1, 2023 | Cross-Modal RetrievalImage Captioning | —Unverified | 0 |
| Exploring the Effect of Primitives for Compositional Generalization in Vision-and-Language | Jan 1, 2023 | Question AnsweringSelf-Supervised Learning | CodeCode Available | 0 |
| RMLVQA: A Margin Loss Approach for Visual Question Answering With Language Biases | Jan 1, 2023 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| VQACL: A Novel Visual Question Answering Continual Learning Setting | Jan 1, 2023 | Continual LearningQuestion Answering | CodeCode Available | 1 |