| Learning Situation Hyper-Graphs for Video Question Answering | Apr 18, 2023 | DecoderQuestion Answering | CodeCode Available | 1 |
| CLIP-Guided Vision-Language Pre-training for Question Answering in 3D Scenes | Apr 12, 2023 | Question AnsweringVisual Question Answering | CodeCode Available | 1 |
| I2I: Initializing Adapters with Improvised Knowledge | Apr 4, 2023 | Continual LearningQuestion Answering | CodeCode Available | 1 |
| TIFA: Accurate and Interpretable Text-to-Image Faithfulness Evaluation with Question Answering | Mar 21, 2023 | 4kImage Generation | CodeCode Available | 1 |
| Location-Free Scene Graph Generation | Mar 20, 2023 | Graph GenerationImage Retrieval | CodeCode Available | 1 |
| Open-Ended Medical Visual Question Answering Through Prefix Tuning of Language Models | Mar 10, 2023 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| BiomedCLIP: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs | Mar 2, 2023 | ArticlesMedical Visual Question Answering | CodeCode Available | 1 |
| ConTEXTual Net: A Multimodal Vision-Language Model for Segmentation of Pneumothorax | Mar 2, 2023 | DescriptiveImage Captioning | CodeCode Available | 1 |
| MixPHM: Redundancy-Aware Parameter-Efficient Tuning for Low-Resource Visual Question Answering | Mar 2, 2023 | Mixture-of-ExpertsQuestion Answering | CodeCode Available | 1 |
| RAMM: Retrieval-augmented Biomedical Visual Question Answering with Multi-modal Pre-training | Mar 1, 2023 | Question AnsweringRetrieval | CodeCode Available | 1 |
| Can Pre-trained Vision and Language Models Answer Visual Information-Seeking Questions? | Feb 23, 2023 | Open-Domain Question AnsweringQuestion Answering | CodeCode Available | 1 |
| Towards Unifying Medical Vision-and-Language Pre-training via Soft Prompts | Feb 17, 2023 | Image RetrievalImage-text Classification | CodeCode Available | 1 |
| Multimodal Federated Learning via Contrastive Representation Ensemble | Feb 17, 2023 | Federated LearningImage-text Retrieval | CodeCode Available | 1 |
| Language Quantized AutoEncoders: Towards Unsupervised Text-Image Alignment | Feb 2, 2023 | AttributeFew-Shot Image Classification | CodeCode Available | 1 |
| Multimodality Representation Learning: A Survey on Evolution, Pretraining and Its Applications | Feb 1, 2023 | Question AnsweringRepresentation Learning | CodeCode Available | 1 |
| SlideVQA: A Dataset for Document Visual Question Answering on Multiple Images | Jan 12, 2023 | Evidence SelectionQuestion Answering | CodeCode Available | 1 |
| Multimodal Inverse Cloze Task for Knowledge-based Visual Question Answering | Jan 11, 2023 | Question AnsweringReading Comprehension | CodeCode Available | 1 |
| Variational Causal Inference Network for Explanatory Visual Question Answering | Jan 1, 2023 | Explanation GenerationExplanatory Visual Question Answering | CodeCode Available | 1 |
| VQACL: A Novel Visual Question Answering Continual Learning Setting | Jan 1, 2023 | Continual LearningQuestion Answering | CodeCode Available | 1 |
| Hierarchical multimodal transformers for Multi-Page DocVQA | Dec 7, 2022 | DecoderQuestion Answering | CodeCode Available | 1 |
| Super-CLEVR: A Virtual Benchmark to Diagnose Domain Robustness in Visual Reasoning | Dec 1, 2022 | Domain GeneralizationQuestion Answering | CodeCode Available | 1 |
| Seeing What You Miss: Vision-Language Pre-training with Semantic Completion Learning | Nov 24, 2022 | cross-modal alignmentImage-text Retrieval | CodeCode Available | 1 |
| Self-supervised vision-language pretraining for Medical visual question answering | Nov 24, 2022 | Contrastive LearningImage-text matching | CodeCode Available | 1 |
| I Can't Believe There's No Images! Learning Visual Tasks Using only Language Supervision | Nov 17, 2022 | Image CaptioningQuestion Answering | CodeCode Available | 1 |
| PromptCap: Prompt-Guided Task-Aware Image Captioning | Nov 15, 2022 | Image CaptioningLanguage Modelling | CodeCode Available | 1 |