| Benchmarking Vision Language Models for Cultural Understanding | Jul 15, 2024 | BenchmarkingQuestion Answering | —Unverified | 0 |
| Segmentation-guided Attention for Visual Question Answering from Remote Sensing Images | Jul 11, 2024 | Question AnsweringSegmentation | —Unverified | 0 |
| Extracting Training Data from Document-Based VQA Models | Jul 11, 2024 | MemorizationQuestion Answering | —Unverified | 0 |
| VQA-Diff: Exploiting VQA and Diffusion for Zero-Shot Image-to-3D Vehicle Asset Generation in Autonomous Driving | Jul 9, 2024 | Autonomous DrivingImage to 3D | —Unverified | 0 |
| Large Language Models Understand Layout | Jul 8, 2024 | Question AnsweringVisual Question Answering | CodeCode Available | 0 |
| Rethinking Visual Prompting for Multimodal Large Language Models with External Knowledge | Jul 5, 2024 | Instance SegmentationOptical Character Recognition (OCR) | —Unverified | 0 |
| Second Place Solution of WSDM2023 Toloka Visual Question Answering Challenge | Jul 5, 2024 | Cross-Modal RetrievalQuestion Answering | —Unverified | 0 |
| Black-box Model Ensembling for Textual and Visual Question Answering via Information Fusion | Jul 4, 2024 | Question AnsweringVisual Question Answering | CodeCode Available | 0 |
| BACON: Improving Clarity of Image Captions via Bag-of-Concept Graphs | Jul 3, 2024 | Image CaptioningImage Generation | —Unverified | 0 |
| MindBench: A Comprehensive Benchmark for Mind Map Structure Recognition and Analysis | Jul 3, 2024 | PositionQuestion Answering | —Unverified | 0 |
| InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output | Jul 3, 2024 | ArticlesImage Comprehension | —Unverified | 0 |
| Visual Robustness Benchmark for Visual Question Answering (VQA) | Jul 3, 2024 | Visual Question AnsweringVisual Question Answering (VQA) | CodeCode Available | 0 |
| Certainly Uncertain: A Benchmark and Metric for Multimodal Epistemic and Aleatoric Awareness | Jul 2, 2024 | Image CaptioningQuestion Answering | —Unverified | 0 |
| Assistive Image Annotation Systems with Deep Learning and Natural Language Capabilities: A Review | Jun 28, 2024 | Active LearningImage Captioning | —Unverified | 0 |
| The Illusion of Competence: Evaluating the Effect of Explanations on Users' Mental Models of Visual Question Answering Systems | Jun 27, 2024 | Question AnsweringVisual Question Answering | CodeCode Available | 0 |
| FlowVQA: Mapping Multimodal Logic in Visual Question Answering with Flowcharts | Jun 27, 2024 | Decision MakingLogical Reasoning | —Unverified | 0 |
| Disentangling Knowledge-based and Visual Reasoning by Question Decomposition in KB-VQA | Jun 27, 2024 | General KnowledgeQuestion Answering | —Unverified | 0 |
| Enhancing Continual Learning in Visual Question Answering with Modality-Aware Feature Distillation | Jun 27, 2024 | Continual LearningQuestion Answering | CodeCode Available | 0 |
| Evaluating Fairness in Large Vision-Language Models Across Diverse Demographic Attributes and Prompts | Jun 25, 2024 | FairnessQuestion Answering | CodeCode Available | 0 |
| Claude 3.5 Sonnet Model Card Addendum | Jun 24, 2024 | Code GenerationMMR total | —Unverified | 0 |
| GPT-4V Explorations: Mining Autonomous Driving | Jun 24, 2024 | Autonomous DrivingDecision Making | —Unverified | 0 |
| MM-SpuBench: Towards Better Understanding of Spurious Biases in Multimodal LLMs | Jun 24, 2024 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| MR-MLLM: Mutual Reinforcement of Multimodal Comprehension and Vision Perception | Jun 22, 2024 | Common Sense ReasoningLanguage Modelling | —Unverified | 0 |
| Tri-VQA: Triangular Reasoning Medical Visual Question Answering for Multi-Attribute Analysis | Jun 21, 2024 | AttributeMedical Visual Question Answering | —Unverified | 0 |
| Does Object Grounding Really Reduce Hallucination of Large Vision-Language Models? | Jun 20, 2024 | Caption GenerationHallucination | —Unverified | 0 |