| MedBLIP: Bootstrapping Language-Image Pre-training from 3D Medical Images and Texts | May 18, 2023 | Medical Visual Question AnsweringQuestion Answering | CodeCode Available | 1 |
| Visual Question Answering: A Survey on Techniques and Common Trends in Recent Literature | May 18, 2023 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| What You See is What You Read? Improving Text-Image Alignment Evaluation | May 17, 2023 | Image GenerationImage to text | CodeCode Available | 1 |
| IMAD: IMage-Augmented multi-modal Dialogue | May 17, 2023 | Dialogue GenerationQuestion Answering | CodeCode Available | 0 |
| An Empirical Study on the Language Modal in Visual Question Answering | May 17, 2023 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Probing the Role of Positional Information in Vision-Language Models | May 17, 2023 | Contrastive LearningImage-text matching | —Unverified | 0 |
| PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering | May 17, 2023 | BenchmarkingDiagnostic | CodeCode Available | 1 |
| Semantic Composition in Visually Grounded Language Models | May 15, 2023 | Image CaptioningInductive Bias | —Unverified | 0 |
| OCRBench: On the Hidden Mystery of OCR in Large Multimodal Models | May 13, 2023 | Key Information ExtractionNutrition | CodeCode Available | 2 |
| InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning | May 11, 2023 | 1 Image, 2*2 StitchingDiversity | CodeCode Available | 2 |
| Combo of Thinking and Observing for Outside-Knowledge VQA | May 10, 2023 | DecoderQuestion Answering | CodeCode Available | 1 |
| Vision-Language Models in Remote Sensing: Current Progress and Future Trends | May 9, 2023 | Image CaptioningImage Generation | CodeCode Available | 1 |
| OpenViVQA: Task, Dataset, and Multimodal Fusion Models for Visual Question Answering in Vietnamese | May 7, 2023 | Information RetrievalQuestion Answering | CodeCode Available | 0 |
| Adaptive loose optimization for robust question answering | May 6, 2023 | Extractive Question-AnsweringMachine Reading Comprehension | CodeCode Available | 0 |
| Otter: A Multi-Modal Model with In-Context Instruction Tuning | May 5, 2023 | GPUIn-Context Learning | CodeCode Available | 4 |
| Analysis of Visual Question Answering Algorithms with attention model | May 4, 2023 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Making the Most of What You Have: Adapting Pre-trained Visual Language Models in the Low-data Regime | May 3, 2023 | Image CaptioningQuestion Answering | —Unverified | 0 |
| CHIC: Corporate Document for Visual question Answering | May 1, 2023 | Information RetrievalQuestion Answering | —Unverified | 0 |
| LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model | Apr 28, 2023 | Instruction Followingmodel | CodeCode Available | 5 |
| Towards Medical Artificial General Intelligence via Knowledge-Enhanced Multimodal Pretraining | Apr 26, 2023 | cross-modal alignmentMedical Visual Question Answering | CodeCode Available | 1 |
| A Symmetric Dual Encoding Dense Retrieval Framework for Knowledge-Intensive Visual Question Answering | Apr 26, 2023 | DecoderKnowledge Distillation | CodeCode Available | 1 |
| MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models | Apr 20, 2023 | Image DescriptionLanguage Modelling | CodeCode Available | 7 |
| SurgicalGPT: End-to-End Language-Vision GPT for Visual Question Answering in Surgery | Apr 19, 2023 | Question AnsweringScene Segmentation | CodeCode Available | 1 |
| Learning Situation Hyper-Graphs for Video Question Answering | Apr 18, 2023 | DecoderQuestion Answering | CodeCode Available | 1 |
| Visual Instruction Tuning | Apr 17, 2023 | 1 Image, 2*2 Stitching3D Question Answering (3D-QA) | CodeCode Available | 6 |