| Benchmarking Vision Language Models for Cultural Understanding | Jul 15, 2024 | BenchmarkingQuestion Answering | —Unverified | 0 |
| DenseFusion-1M: Merging Vision Experts for Comprehensive Multimodal Perception | Jul 11, 2024 | Visual Question Answering | CodeCode Available | 2 |
| Segmentation-guided Attention for Visual Question Answering from Remote Sensing Images | Jul 11, 2024 | Question AnsweringSegmentation | —Unverified | 0 |
| Extracting Training Data from Document-Based VQA Models | Jul 11, 2024 | MemorizationQuestion Answering | —Unverified | 0 |
| VQA-Diff: Exploiting VQA and Diffusion for Zero-Shot Image-to-3D Vehicle Asset Generation in Autonomous Driving | Jul 9, 2024 | Autonomous DrivingImage to 3D | —Unverified | 0 |
| Large Language Models Understand Layout | Jul 8, 2024 | Question AnsweringVisual Question Answering | CodeCode Available | 0 |
| WSI-VQA: Interpreting Whole Slide Images by Generative Visual Question Answering | Jul 8, 2024 | DiagnosticGenerative Visual Question Answering | CodeCode Available | 2 |
| Second Place Solution of WSDM2023 Toloka Visual Question Answering Challenge | Jul 5, 2024 | Cross-Modal RetrievalQuestion Answering | —Unverified | 0 |
| Rethinking Visual Prompting for Multimodal Large Language Models with External Knowledge | Jul 5, 2024 | Instance SegmentationOptical Character Recognition (OCR) | —Unverified | 0 |
| Black-box Model Ensembling for Textual and Visual Question Answering via Information Fusion | Jul 4, 2024 | Question AnsweringVisual Question Answering | CodeCode Available | 0 |
| MiniGPT-Med: Large Language Model as a General Interface for Radiology Diagnosis | Jul 4, 2024 | DiagnosticLanguage Modeling | CodeCode Available | 2 |
| Visual Robustness Benchmark for Visual Question Answering (VQA) | Jul 3, 2024 | Visual Question AnsweringVisual Question Answering (VQA) | CodeCode Available | 0 |
| InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output | Jul 3, 2024 | ArticlesImage Comprehension | —Unverified | 0 |
| MindBench: A Comprehensive Benchmark for Mind Map Structure Recognition and Analysis | Jul 3, 2024 | PositionQuestion Answering | —Unverified | 0 |
| BACON: Improving Clarity of Image Captions via Bag-of-Concept Graphs | Jul 3, 2024 | Image CaptioningImage Generation | —Unverified | 0 |
| A Bounding Box is Worth One Token: Interleaving Layout and Text in a Large Language Model for Document Understanding | Jul 2, 2024 | document understandingKey Information Extraction | CodeCode Available | 2 |
| TokenPacker: Efficient Visual Projector for Multimodal LLM | Jul 2, 2024 | Language ModellingLarge Language Model | CodeCode Available | 3 |
| Certainly Uncertain: A Benchmark and Metric for Multimodal Epistemic and Aleatoric Awareness | Jul 2, 2024 | Image CaptioningQuestion Answering | —Unverified | 0 |
| CVLUE: A New Benchmark Dataset for Chinese Vision-Language Understanding Evaluation | Jul 1, 2024 | Image-text RetrievalQuestion Answering | CodeCode Available | 1 |
| Assistive Image Annotation Systems with Deep Learning and Natural Language Capabilities: A Review | Jun 28, 2024 | Active LearningImage Captioning | —Unverified | 0 |
| Efficient Large Multi-modal Models via Visual Context Compression | Jun 28, 2024 | Question AnsweringVisual Question Answering | CodeCode Available | 2 |
| MM-Instruct: Generated Visual Instructions for Large Multimodal Model Alignment | Jun 28, 2024 | Answer GenerationImage Captioning | CodeCode Available | 1 |
| STLLaVA-Med: Self-Training Large Language and Vision Assistant for Medical Question-Answering | Jun 28, 2024 | Medical DiagnosisMedical Question Answering | CodeCode Available | 1 |
| Disentangling Knowledge-based and Visual Reasoning by Question Decomposition in KB-VQA | Jun 27, 2024 | General KnowledgeQuestion Answering | —Unverified | 0 |
| FlowVQA: Mapping Multimodal Logic in Visual Question Answering with Flowcharts | Jun 27, 2024 | Decision MakingLogical Reasoning | —Unverified | 0 |
| The Illusion of Competence: Evaluating the Effect of Explanations on Users' Mental Models of Visual Question Answering Systems | Jun 27, 2024 | Question AnsweringVisual Question Answering | CodeCode Available | 0 |
| Enhancing Continual Learning in Visual Question Answering with Modality-Aware Feature Distillation | Jun 27, 2024 | Continual LearningQuestion Answering | CodeCode Available | 0 |
| Evaluating Fairness in Large Vision-Language Models Across Diverse Demographic Attributes and Prompts | Jun 25, 2024 | FairnessQuestion Answering | CodeCode Available | 0 |
| MG-LLaVA: Towards Multi-Granularity Visual Instruction Tuning | Jun 25, 2024 | ObjectObject Recognition | CodeCode Available | 2 |
| Claude 3.5 Sonnet Model Card Addendum | Jun 24, 2024 | Code GenerationMMR total | —Unverified | 0 |
| MM-SpuBench: Towards Better Understanding of Spurious Biases in Multimodal LLMs | Jun 24, 2024 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| GPT-4V Explorations: Mining Autonomous Driving | Jun 24, 2024 | Autonomous DrivingDecision Making | —Unverified | 0 |
| MR-MLLM: Mutual Reinforcement of Multimodal Comprehension and Vision Perception | Jun 22, 2024 | Common Sense ReasoningLanguage Modelling | —Unverified | 0 |
| Tri-VQA: Triangular Reasoning Medical Visual Question Answering for Multi-Attribute Analysis | Jun 21, 2024 | AttributeMedical Visual Question Answering | —Unverified | 0 |
| Does Object Grounding Really Reduce Hallucination of Large Vision-Language Models? | Jun 20, 2024 | Caption GenerationHallucination | —Unverified | 0 |
| LIVE: Learnable In-Context Vector for Visual Question Answering | Jun 19, 2024 | In-Context LearningQuestion Answering | CodeCode Available | 1 |
| Enhancing Cross-Prompt Transferability in Vision-Language Models through Contextual Injection of Target Tokens | Jun 19, 2024 | Caption Generationimage-classification | CodeCode Available | 0 |
| Diversify, Rationalize, and Combine: Ensembling Multiple QA Strategies for Zero-shot Knowledge-based VQA | Jun 18, 2024 | Question AnsweringVisual Question Answering | CodeCode Available | 0 |
| VRSBench: A Versatile Vision-Language Benchmark Dataset for Remote Sensing Image Understanding | Jun 18, 2024 | Image CaptioningQuestion Answering | CodeCode Available | 2 |
| TroL: Traversal of Layers for Large Language and Vision Models | Jun 18, 2024 | Visual Question Answering | CodeCode Available | 2 |
| MMNeuron: Discovering Neuron-Level Domain-Specific Interpretation in Multimodal Large Language Model | Jun 17, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| LLARVA: Vision-Action Instruction Tuning Enhances Robot Learning | Jun 17, 2024 | Image CaptioningQuestion Answering | —Unverified | 0 |
| MFC-Bench: Benchmarking Multimodal Fact-Checking with Large Vision-Language Models | Jun 17, 2024 | BenchmarkingFact Checking | CodeCode Available | 1 |
| MMDU: A Multi-Turn Multi-Image Dialog Understanding Benchmark and Instruction-Tuning Dataset for LVLMs | Jun 17, 2024 | Visual Question Answering | CodeCode Available | 2 |
| Program Synthesis Benchmark for Visual Programming in XLogoOnline Environment | Jun 17, 2024 | Logical ReasoningMath | —Unverified | 0 |
| Mixture-of-Subspaces in Low-Rank Adaptation | Jun 16, 2024 | Common Sense ReasoningImage Generation | CodeCode Available | 0 |
| Beyond Raw Videos: Understanding Edited Videos with Large Multimodal Model | Jun 15, 2024 | Question AnsweringVideo Understanding | CodeCode Available | 0 |
| VANE-Bench: Video Anomaly Evaluation Benchmark for Conversational LMMs | Jun 14, 2024 | Anomaly DetectionBenchmarking | CodeCode Available | 1 |
| Precision Empowers, Excess Distracts: Visual Question Answering With Dynamically Infused Knowledge In Language Models | Jun 14, 2024 | DecoderKnowledge Graphs | —Unverified | 0 |
| Detecting and Evaluating Medical Hallucinations in Large Vision Language Models | Jun 14, 2024 | HallucinationMedical Visual Question Answering | —Unverified | 0 |