| Benchmarking Vision Language Models for Cultural Understanding | Jul 15, 2024 | BenchmarkingQuestion Answering | —Unverified | 0 |
| DenseFusion-1M: Merging Vision Experts for Comprehensive Multimodal Perception | Jul 11, 2024 | Visual Question Answering | CodeCode Available | 2 |
| Segmentation-guided Attention for Visual Question Answering from Remote Sensing Images | Jul 11, 2024 | Question AnsweringSegmentation | —Unverified | 0 |
| Extracting Training Data from Document-Based VQA Models | Jul 11, 2024 | MemorizationQuestion Answering | —Unverified | 0 |
| VQA-Diff: Exploiting VQA and Diffusion for Zero-Shot Image-to-3D Vehicle Asset Generation in Autonomous Driving | Jul 9, 2024 | Autonomous DrivingImage to 3D | —Unverified | 0 |
| Large Language Models Understand Layout | Jul 8, 2024 | Question AnsweringVisual Question Answering | CodeCode Available | 0 |
| WSI-VQA: Interpreting Whole Slide Images by Generative Visual Question Answering | Jul 8, 2024 | DiagnosticGenerative Visual Question Answering | CodeCode Available | 2 |
| Second Place Solution of WSDM2023 Toloka Visual Question Answering Challenge | Jul 5, 2024 | Cross-Modal RetrievalQuestion Answering | —Unverified | 0 |
| Rethinking Visual Prompting for Multimodal Large Language Models with External Knowledge | Jul 5, 2024 | Instance SegmentationOptical Character Recognition (OCR) | —Unverified | 0 |
| Black-box Model Ensembling for Textual and Visual Question Answering via Information Fusion | Jul 4, 2024 | Question AnsweringVisual Question Answering | CodeCode Available | 0 |
| MiniGPT-Med: Large Language Model as a General Interface for Radiology Diagnosis | Jul 4, 2024 | DiagnosticLanguage Modeling | CodeCode Available | 2 |
| Visual Robustness Benchmark for Visual Question Answering (VQA) | Jul 3, 2024 | Visual Question AnsweringVisual Question Answering (VQA) | CodeCode Available | 0 |
| InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output | Jul 3, 2024 | ArticlesImage Comprehension | —Unverified | 0 |
| BACON: Improving Clarity of Image Captions via Bag-of-Concept Graphs | Jul 3, 2024 | Image CaptioningImage Generation | —Unverified | 0 |
| MindBench: A Comprehensive Benchmark for Mind Map Structure Recognition and Analysis | Jul 3, 2024 | PositionQuestion Answering | —Unverified | 0 |
| A Bounding Box is Worth One Token: Interleaving Layout and Text in a Large Language Model for Document Understanding | Jul 2, 2024 | document understandingKey Information Extraction | CodeCode Available | 2 |
| Certainly Uncertain: A Benchmark and Metric for Multimodal Epistemic and Aleatoric Awareness | Jul 2, 2024 | Image CaptioningQuestion Answering | —Unverified | 0 |
| TokenPacker: Efficient Visual Projector for Multimodal LLM | Jul 2, 2024 | Language ModellingLarge Language Model | CodeCode Available | 3 |
| CVLUE: A New Benchmark Dataset for Chinese Vision-Language Understanding Evaluation | Jul 1, 2024 | Image-text RetrievalQuestion Answering | CodeCode Available | 1 |
| Assistive Image Annotation Systems with Deep Learning and Natural Language Capabilities: A Review | Jun 28, 2024 | Active LearningImage Captioning | —Unverified | 0 |
| STLLaVA-Med: Self-Training Large Language and Vision Assistant for Medical Question-Answering | Jun 28, 2024 | Medical DiagnosisMedical Question Answering | CodeCode Available | 1 |
| MM-Instruct: Generated Visual Instructions for Large Multimodal Model Alignment | Jun 28, 2024 | Answer GenerationImage Captioning | CodeCode Available | 1 |
| Efficient Large Multi-modal Models via Visual Context Compression | Jun 28, 2024 | Question AnsweringVisual Question Answering | CodeCode Available | 2 |
| FlowVQA: Mapping Multimodal Logic in Visual Question Answering with Flowcharts | Jun 27, 2024 | Decision MakingLogical Reasoning | —Unverified | 0 |
| Disentangling Knowledge-based and Visual Reasoning by Question Decomposition in KB-VQA | Jun 27, 2024 | General KnowledgeQuestion Answering | —Unverified | 0 |