| IIU: Independent Inference Units for Knowledge-based Visual Question Answering | Aug 15, 2024 | Question AnsweringVisual Question Answering | CodeCode Available | 0 |
| Enhancing Visual Question Answering through Ranking-Based Hybrid Training and Multimodal Fusion | Aug 14, 2024 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| CROME: Cross-Modal Adapters for Efficient Multimodal LLM | Aug 13, 2024 | Instruction FollowingLanguage Modeling | —Unverified | 0 |
| Revisiting Multi-Modal LLM Evaluation | Aug 9, 2024 | Chart UnderstandingOptical Character Recognition | —Unverified | 0 |
| Img-Diff: Contrastive Data Synthesis for Multimodal Large Language Models | Aug 8, 2024 | Contrastive LearningFine-Grained Image Recognition | —Unverified | 0 |
| Optimus: Accelerating Large-Scale Multi-Modal LLM Training by Bubble Exploitation | Aug 7, 2024 | GPUQuestion Answering | —Unverified | 0 |
| Targeted Visual Prompting for Medical Visual Question Answering | Aug 6, 2024 | Medical Visual Question AnsweringQuestion Answering | CodeCode Available | 0 |
| LLaVA-OneVision: Easy Visual Task Transfer | Aug 6, 2024 | 3D Question Answering (3D-QA) | CodeCode Available | 0 |
| MMPKUBase: A Comprehensive and High-quality Chinese Multi-modal Knowledge Graph | Aug 3, 2024 | AttributeContrastive Learning | —Unverified | 0 |
| Towards Flexible Evaluation for Generative Visual Question Answering | Aug 1, 2024 | DecoderGenerative Visual Question Answering | CodeCode Available | 0 |
| Prompting Medical Large Vision-Language Models to Diagnose Pathologies by Visual Question Answering | Jul 31, 2024 | DiagnosticHallucination | —Unverified | 0 |
| SimpleLLM4AD: An End-to-End Vision-Language Model with Graph Visual Question Answering for Autonomous Driving | Jul 31, 2024 | Autonomous DrivingLanguage Modeling | —Unverified | 0 |
| Pyramid Coder: Hierarchical Code Generator for Compositional Visual Question Answering | Jul 30, 2024 | Code GenerationQuestion Answering | —Unverified | 0 |
| VolDoGer: LLM-assisted Datasets for Domain Generalization in Vision-Language Tasks | Jul 29, 2024 | Deep LearningDomain Generalization | —Unverified | 0 |
| Take A Step Back: Rethinking the Two Stages in Visual Reasoning | Jul 29, 2024 | Logical ReasoningQuestion Answering | —Unverified | 0 |
| AdaCoder: Adaptive Prompt Compression for Programmatic Visual Question Answering | Jul 28, 2024 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| VILA^2: VILA Augmented VILA | Jul 24, 2024 | HallucinationOptical Character Recognition (OCR) | —Unverified | 0 |
| Imperfect Vision Encoders: Efficient and Robust Tuning for Vision-Language Models | Jul 23, 2024 | Computational EfficiencyImage Captioning | —Unverified | 0 |
| Knowledge Acquisition Disentanglement for Knowledge-based Visual Question Answering with Large Language Models | Jul 22, 2024 | DisentanglementQuestion Answering | CodeCode Available | 0 |
| Exploring the Effectiveness of Object-Centric Representations in Visual Question Answering: Comparative Insights with Foundation Models | Jul 22, 2024 | Question AnsweringRepresentation Learning | —Unverified | 0 |
| QuIIL at T3 challenge: Towards Automation in Life-Saving Intervention Procedures from First-Person View | Jul 18, 2024 | Action AnticipationAction Recognition | CodeCode Available | 0 |
| ProcTag: Process Tagging for Assessing the Efficacy of Document Instruction Data | Jul 17, 2024 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Multimodal Reranking for Knowledge-Intensive Visual Question Answering | Jul 17, 2024 | Answer GenerationQuestion Answering | —Unverified | 0 |
| EchoSight: Advancing Visual-Language Models with Wiki Knowledge | Jul 17, 2024 | ArticlesQuestion Answering | —Unverified | 0 |
| TM-PATHVQA:90000+ Textless Multilingual Questions for Medical Visual Question Answering | Jul 16, 2024 | Medical Visual Question AnsweringQuestion Answering | —Unverified | 0 |
| Benchmarking Vision Language Models for Cultural Understanding | Jul 15, 2024 | BenchmarkingQuestion Answering | —Unverified | 0 |
| Segmentation-guided Attention for Visual Question Answering from Remote Sensing Images | Jul 11, 2024 | Question AnsweringSegmentation | —Unverified | 0 |
| Extracting Training Data from Document-Based VQA Models | Jul 11, 2024 | MemorizationQuestion Answering | —Unverified | 0 |
| VQA-Diff: Exploiting VQA and Diffusion for Zero-Shot Image-to-3D Vehicle Asset Generation in Autonomous Driving | Jul 9, 2024 | Autonomous DrivingImage to 3D | —Unverified | 0 |
| Large Language Models Understand Layout | Jul 8, 2024 | Question AnsweringVisual Question Answering | CodeCode Available | 0 |
| Rethinking Visual Prompting for Multimodal Large Language Models with External Knowledge | Jul 5, 2024 | Instance SegmentationOptical Character Recognition (OCR) | —Unverified | 0 |
| Second Place Solution of WSDM2023 Toloka Visual Question Answering Challenge | Jul 5, 2024 | Cross-Modal RetrievalQuestion Answering | —Unverified | 0 |
| Black-box Model Ensembling for Textual and Visual Question Answering via Information Fusion | Jul 4, 2024 | Question AnsweringVisual Question Answering | CodeCode Available | 0 |
| BACON: Improving Clarity of Image Captions via Bag-of-Concept Graphs | Jul 3, 2024 | Image CaptioningImage Generation | —Unverified | 0 |
| MindBench: A Comprehensive Benchmark for Mind Map Structure Recognition and Analysis | Jul 3, 2024 | PositionQuestion Answering | —Unverified | 0 |
| InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output | Jul 3, 2024 | ArticlesImage Comprehension | —Unverified | 0 |
| Visual Robustness Benchmark for Visual Question Answering (VQA) | Jul 3, 2024 | Visual Question AnsweringVisual Question Answering (VQA) | CodeCode Available | 0 |
| Certainly Uncertain: A Benchmark and Metric for Multimodal Epistemic and Aleatoric Awareness | Jul 2, 2024 | Image CaptioningQuestion Answering | —Unverified | 0 |
| Assistive Image Annotation Systems with Deep Learning and Natural Language Capabilities: A Review | Jun 28, 2024 | Active LearningImage Captioning | —Unverified | 0 |
| The Illusion of Competence: Evaluating the Effect of Explanations on Users' Mental Models of Visual Question Answering Systems | Jun 27, 2024 | Question AnsweringVisual Question Answering | CodeCode Available | 0 |
| FlowVQA: Mapping Multimodal Logic in Visual Question Answering with Flowcharts | Jun 27, 2024 | Decision MakingLogical Reasoning | —Unverified | 0 |
| Disentangling Knowledge-based and Visual Reasoning by Question Decomposition in KB-VQA | Jun 27, 2024 | General KnowledgeQuestion Answering | —Unverified | 0 |
| Enhancing Continual Learning in Visual Question Answering with Modality-Aware Feature Distillation | Jun 27, 2024 | Continual LearningQuestion Answering | CodeCode Available | 0 |
| Evaluating Fairness in Large Vision-Language Models Across Diverse Demographic Attributes and Prompts | Jun 25, 2024 | FairnessQuestion Answering | CodeCode Available | 0 |
| Claude 3.5 Sonnet Model Card Addendum | Jun 24, 2024 | Code GenerationMMR total | —Unverified | 0 |
| GPT-4V Explorations: Mining Autonomous Driving | Jun 24, 2024 | Autonomous DrivingDecision Making | —Unverified | 0 |
| MM-SpuBench: Towards Better Understanding of Spurious Biases in Multimodal LLMs | Jun 24, 2024 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| MR-MLLM: Mutual Reinforcement of Multimodal Comprehension and Vision Perception | Jun 22, 2024 | Common Sense ReasoningLanguage Modelling | —Unverified | 0 |
| Tri-VQA: Triangular Reasoning Medical Visual Question Answering for Multi-Attribute Analysis | Jun 21, 2024 | AttributeMedical Visual Question Answering | —Unverified | 0 |
| Does Object Grounding Really Reduce Hallucination of Large Vision-Language Models? | Jun 20, 2024 | Caption GenerationHallucination | —Unverified | 0 |