| Improving Selective Visual Question Answering by Learning from Your Peers | Jun 14, 2023 | Question AnsweringVisual Question Answering | CodeCode Available | 1 |
| Scalable Neural-Probabilistic Answer Set Programming | Jun 14, 2023 | Probabilistic ProgrammingQuestion Answering | CodeCode Available | 1 |
| Global and Local Semantic Completion Learning for Vision-Language Pre-training | Jun 12, 2023 | cross-modal alignmentImage-text Retrieval | CodeCode Available | 1 |
| Multi-modal Pre-training for Medical Vision-language Understanding and Generation: An Empirical Study with A New Benchmark | Jun 10, 2023 | Image-text RetrievalMedical Report Generation | CodeCode Available | 1 |
| Modular Visual Question Answering via Code Generation | Jun 8, 2023 | Code GenerationIn-Context Learning | CodeCode Available | 1 |
| Q: How to Specialize Large Vision-Language Models to Data-Scarce VQA Tasks? A: Self-Train on Unlabeled Images! | Jun 6, 2023 | counterfactualData Augmentation | CodeCode Available | 1 |
| An Approach to Solving the Abstraction and Reasoning Corpus (ARC) Challenge | Jun 6, 2023 | ARCQuestion Answering | CodeCode Available | 1 |
| Revisiting the Role of Language Priors in Vision-Language Models | Jun 2, 2023 | Image-text matchingImage-text Retrieval | CodeCode Available | 1 |
| Dense and Aligned Captions (DAC) Promote Compositional Reasoning in VL Models | May 31, 2023 | Cross-Modal RetrievalQuestion Answering | CodeCode Available | 1 |
| Multi-Scale Attention for Audio Question Answering | May 29, 2023 | Audio Question AnsweringQuestion Answering | CodeCode Available | 1 |
| CrossGET: Cross-Guided Ensemble of Tokens for Accelerating Vision-Language Transformers | May 27, 2023 | Image CaptioningImage Retrieval | CodeCode Available | 1 |
| The Art of SOCRATIC QUESTIONING: Recursive Thinking with Large Language Models | May 24, 2023 | Language ModellingMath | CodeCode Available | 1 |
| Visually-Situated Natural Language Understanding with Contrastive Reading Model and Frozen Large Language Models | May 24, 2023 | document understandingImage Captioning | CodeCode Available | 1 |
| MemeCap: A Dataset for Captioning and Interpreting Memes | May 23, 2023 | Image CaptioningMeme Captioning | CodeCode Available | 1 |
| VNHSGE: VietNamese High School Graduation Examination Dataset for Large Language Models | May 20, 2023 | Multiple-choiceQuestion Answering | CodeCode Available | 1 |
| What Makes for Good Visual Tokenizers for Large Language Models? | May 20, 2023 | Image CaptioningObject Counting | CodeCode Available | 1 |
| Enhancing Vision-Language Pre-Training with Jointly Learned Questioner and Dense Captioner | May 19, 2023 | Dense CaptioningImage Captioning | CodeCode Available | 1 |
| MedBLIP: Bootstrapping Language-Image Pre-training from 3D Medical Images and Texts | May 18, 2023 | Medical Visual Question AnsweringQuestion Answering | CodeCode Available | 1 |
| What You See is What You Read? Improving Text-Image Alignment Evaluation | May 17, 2023 | Image GenerationImage to text | CodeCode Available | 1 |
| PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering | May 17, 2023 | BenchmarkingDiagnostic | CodeCode Available | 1 |
| Combo of Thinking and Observing for Outside-Knowledge VQA | May 10, 2023 | DecoderQuestion Answering | CodeCode Available | 1 |
| Vision-Language Models in Remote Sensing: Current Progress and Future Trends | May 9, 2023 | Image CaptioningImage Generation | CodeCode Available | 1 |
| Towards Medical Artificial General Intelligence via Knowledge-Enhanced Multimodal Pretraining | Apr 26, 2023 | cross-modal alignmentMedical Visual Question Answering | CodeCode Available | 1 |
| A Symmetric Dual Encoding Dense Retrieval Framework for Knowledge-Intensive Visual Question Answering | Apr 26, 2023 | DecoderKnowledge Distillation | CodeCode Available | 1 |
| SurgicalGPT: End-to-End Language-Vision GPT for Visual Question Answering in Surgery | Apr 19, 2023 | Question AnsweringScene Segmentation | CodeCode Available | 1 |
| Learning Situation Hyper-Graphs for Video Question Answering | Apr 18, 2023 | DecoderQuestion Answering | CodeCode Available | 1 |
| CLIP-Guided Vision-Language Pre-training for Question Answering in 3D Scenes | Apr 12, 2023 | Question AnsweringVisual Question Answering | CodeCode Available | 1 |
| I2I: Initializing Adapters with Improvised Knowledge | Apr 4, 2023 | Continual LearningQuestion Answering | CodeCode Available | 1 |
| TIFA: Accurate and Interpretable Text-to-Image Faithfulness Evaluation with Question Answering | Mar 21, 2023 | 4kImage Generation | CodeCode Available | 1 |
| Location-Free Scene Graph Generation | Mar 20, 2023 | Graph GenerationImage Retrieval | CodeCode Available | 1 |
| Open-Ended Medical Visual Question Answering Through Prefix Tuning of Language Models | Mar 10, 2023 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| BiomedCLIP: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs | Mar 2, 2023 | ArticlesMedical Visual Question Answering | CodeCode Available | 1 |
| ConTEXTual Net: A Multimodal Vision-Language Model for Segmentation of Pneumothorax | Mar 2, 2023 | DescriptiveImage Captioning | CodeCode Available | 1 |
| MixPHM: Redundancy-Aware Parameter-Efficient Tuning for Low-Resource Visual Question Answering | Mar 2, 2023 | Mixture-of-ExpertsQuestion Answering | CodeCode Available | 1 |
| RAMM: Retrieval-augmented Biomedical Visual Question Answering with Multi-modal Pre-training | Mar 1, 2023 | Question AnsweringRetrieval | CodeCode Available | 1 |
| Can Pre-trained Vision and Language Models Answer Visual Information-Seeking Questions? | Feb 23, 2023 | Open-Domain Question AnsweringQuestion Answering | CodeCode Available | 1 |
| Towards Unifying Medical Vision-and-Language Pre-training via Soft Prompts | Feb 17, 2023 | Image RetrievalImage-text Classification | CodeCode Available | 1 |
| Multimodal Federated Learning via Contrastive Representation Ensemble | Feb 17, 2023 | Federated LearningImage-text Retrieval | CodeCode Available | 1 |
| Language Quantized AutoEncoders: Towards Unsupervised Text-Image Alignment | Feb 2, 2023 | AttributeFew-Shot Image Classification | CodeCode Available | 1 |
| Multimodality Representation Learning: A Survey on Evolution, Pretraining and Its Applications | Feb 1, 2023 | Question AnsweringRepresentation Learning | CodeCode Available | 1 |
| SlideVQA: A Dataset for Document Visual Question Answering on Multiple Images | Jan 12, 2023 | Evidence SelectionQuestion Answering | CodeCode Available | 1 |
| Multimodal Inverse Cloze Task for Knowledge-based Visual Question Answering | Jan 11, 2023 | Question AnsweringReading Comprehension | CodeCode Available | 1 |
| Variational Causal Inference Network for Explanatory Visual Question Answering | Jan 1, 2023 | Explanation GenerationExplanatory Visual Question Answering | CodeCode Available | 1 |
| VQACL: A Novel Visual Question Answering Continual Learning Setting | Jan 1, 2023 | Continual LearningQuestion Answering | CodeCode Available | 1 |
| Hierarchical multimodal transformers for Multi-Page DocVQA | Dec 7, 2022 | DecoderQuestion Answering | CodeCode Available | 1 |
| Super-CLEVR: A Virtual Benchmark to Diagnose Domain Robustness in Visual Reasoning | Dec 1, 2022 | Domain GeneralizationQuestion Answering | CodeCode Available | 1 |
| Seeing What You Miss: Vision-Language Pre-training with Semantic Completion Learning | Nov 24, 2022 | cross-modal alignmentImage-text Retrieval | CodeCode Available | 1 |
| Self-supervised vision-language pretraining for Medical visual question answering | Nov 24, 2022 | Contrastive LearningImage-text matching | CodeCode Available | 1 |
| I Can't Believe There's No Images! Learning Visual Tasks Using only Language Supervision | Nov 17, 2022 | Image CaptioningQuestion Answering | CodeCode Available | 1 |
| PromptCap: Prompt-Guided Task-Aware Image Captioning | Nov 15, 2022 | Image CaptioningLanguage Modelling | CodeCode Available | 1 |