| TaCA: Upgrading Your Visual Foundation Model with Task-agnostic Compatible Adapter | Jun 22, 2023 | Question AnsweringRetrieval | CodeCode Available | 0 |
| Investigating Prompting Techniques for Zero- and Few-Shot Visual Question Answering | Jun 16, 2023 | Image CaptioningQuestion Answering | CodeCode Available | 1 |
| Encyclopedic VQA: Visual questions about detailed properties of fine-grained categories | Jun 15, 2023 | Question AnsweringRetrieval | —Unverified | 0 |
| LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models | Jun 15, 2023 | HallucinationImage Captioning | CodeCode Available | 2 |
| Improving Selective Visual Question Answering by Learning from Your Peers | Jun 14, 2023 | Question AnsweringVisual Question Answering | CodeCode Available | 1 |
| Scalable Neural-Probabilistic Answer Set Programming | Jun 14, 2023 | Probabilistic ProgrammingQuestion Answering | CodeCode Available | 1 |
| Visual Question Answering (VQA) on Images with Superimposed Text | Jun 13, 2023 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Safeguarding Data in Multimodal AI: A Differentially Private Approach to CLIP Training | Jun 13, 2023 | image-classificationImage Classification | CodeCode Available | 0 |
| AVIS: Autonomous Visual Information Seeking with Large Language Model Agent | Jun 13, 2023 | Decision MakingLanguage Modeling | —Unverified | 0 |
| Global and Local Semantic Completion Learning for Vision-Language Pre-training | Jun 12, 2023 | cross-modal alignmentImage-text Retrieval | CodeCode Available | 1 |
| A Survey of Vision-Language Pre-training from the Lens of Multimodal Machine Translation | Jun 12, 2023 | Image CaptioningMachine Translation | —Unverified | 0 |
| Multi-modal Pre-training for Medical Vision-language Understanding and Generation: An Empirical Study with A New Benchmark | Jun 10, 2023 | Image-text RetrievalMedical Report Generation | CodeCode Available | 1 |
| Knowledge Detection by Relevant Question and Image Attributes in Visual Question Answering | Jun 8, 2023 | Question AnsweringRetrieval | —Unverified | 0 |
| Modular Visual Question Answering via Code Generation | Jun 8, 2023 | Code GenerationIn-Context Learning | CodeCode Available | 1 |
| MIMIC-IT: Multi-Modal In-Context Instruction Tuning | Jun 8, 2023 | In-Context LearningVisual Question Answering | CodeCode Available | 4 |
| Q: How to Specialize Large Vision-Language Models to Data-Scarce VQA Tasks? A: Self-Train on Unlabeled Images! | Jun 6, 2023 | counterfactualData Augmentation | CodeCode Available | 1 |
| Diversifying Joint Vision-Language Tokenization Learning | Jun 6, 2023 | Question AnsweringRepresentation Learning | —Unverified | 0 |
| An Approach to Solving the Abstraction and Reasoning Corpus (ARC) Challenge | Jun 6, 2023 | ARCQuestion Answering | CodeCode Available | 1 |
| Multi-CLIP: Contrastive Vision-Language Pre-training for Question Answering tasks in 3D Scenes | Jun 4, 2023 | Common Sense ReasoningQuestion Answering | —Unverified | 0 |
| Revisiting the Role of Language Priors in Vision-Language Models | Jun 2, 2023 | Image-text matchingImage-text Retrieval | CodeCode Available | 1 |
| LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day | Jun 1, 2023 | Image ClassificationInstruction Following | CodeCode Available | 4 |
| Evaluating the Capabilities of Multi-modal Reasoning Models with Synthetic Task Data | Jun 1, 2023 | Anomaly DetectionImage Generation | —Unverified | 0 |
| Overcoming Language Bias in Remote Sensing Visual Question Answering via Adversarial Training | Jun 1, 2023 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| LiT-4-RSVQA: Lightweight Transformer-based Visual Question Answering in Remote Sensing | Jun 1, 2023 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Using Visual Cropping to Enhance Fine-Detail Question Answering of BLIP-Family Models | May 31, 2023 | Question AnsweringVisual Question Answering | —Unverified | 0 |