| MMPKUBase: A Comprehensive and High-quality Chinese Multi-modal Knowledge Graph | Aug 3, 2024 | AttributeContrastive Learning | —Unverified | 0 |
| Towards Flexible Evaluation for Generative Visual Question Answering | Aug 1, 2024 | DecoderGenerative Visual Question Answering | CodeCode Available | 0 |
| MM-Vet v2: A Challenging Benchmark to Evaluate Large Multimodal Models for Integrated Capabilities | Aug 1, 2024 | MathMM-Vet | CodeCode Available | 3 |
| SimpleLLM4AD: An End-to-End Vision-Language Model with Graph Visual Question Answering for Autonomous Driving | Jul 31, 2024 | Autonomous DrivingLanguage Modeling | —Unverified | 0 |
| Prompting Medical Large Vision-Language Models to Diagnose Pathologies by Visual Question Answering | Jul 31, 2024 | DiagnosticHallucination | —Unverified | 0 |
| Boosting Audio Visual Question Answering via Key Semantic-Aware Cues | Jul 30, 2024 | Audio-visual Question AnsweringAudio-Visual Question Answering (AVQA) | CodeCode Available | 1 |
| Pyramid Coder: Hierarchical Code Generator for Compositional Visual Question Answering | Jul 30, 2024 | Code GenerationQuestion Answering | —Unverified | 0 |
| Take A Step Back: Rethinking the Two Stages in Visual Reasoning | Jul 29, 2024 | Logical ReasoningQuestion Answering | —Unverified | 0 |
| VolDoGer: LLM-assisted Datasets for Domain Generalization in Vision-Language Tasks | Jul 29, 2024 | Deep LearningDomain Generalization | —Unverified | 0 |
| AdaCoder: Adaptive Prompt Compression for Programmatic Visual Question Answering | Jul 28, 2024 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Towards A Generalizable Pathology Foundation Model via Unified Knowledge Distillation | Jul 26, 2024 | Knowledge DistillationQuestion Answering | CodeCode Available | 2 |
| VILA^2: VILA Augmented VILA | Jul 24, 2024 | HallucinationOptical Character Recognition (OCR) | —Unverified | 0 |
| INF-LLaVA: Dual-perspective Perception for High-Resolution Multimodal Large Language Model | Jul 23, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| Imperfect Vision Encoders: Efficient and Robust Tuning for Vision-Language Models | Jul 23, 2024 | Computational EfficiencyImage Captioning | —Unverified | 0 |
| Learning Trimodal Relation for AVQA with Missing Modality | Jul 23, 2024 | Audio-visual Question AnsweringAudio-Visual Question Answering (AVQA) | CodeCode Available | 1 |
| Exploring the Effectiveness of Object-Centric Representations in Visual Question Answering: Comparative Insights with Foundation Models | Jul 22, 2024 | Question AnsweringRepresentation Learning | —Unverified | 0 |
| Knowledge Acquisition Disentanglement for Knowledge-based Visual Question Answering with Large Language Models | Jul 22, 2024 | DisentanglementQuestion Answering | CodeCode Available | 0 |
| HaloQuest: A Visual Hallucination Dataset for Advancing Multimodal Reasoning | Jul 22, 2024 | BenchmarkingHallucination | CodeCode Available | 1 |
| MMInstruct: A High-Quality Multi-Modal Instruction Tuning Dataset with Extensive Diversity | Jul 22, 2024 | DiversityMultiple-choice | CodeCode Available | 2 |
| QuIIL at T3 challenge: Towards Automation in Life-Saving Intervention Procedures from First-Person View | Jul 18, 2024 | Action AnticipationAction Recognition | CodeCode Available | 0 |
| Visual Haystacks: A Vision-Centric Needle-In-A-Haystack Benchmark | Jul 18, 2024 | GPUImage Retrieval | CodeCode Available | 1 |
| Multimodal Reranking for Knowledge-Intensive Visual Question Answering | Jul 17, 2024 | Answer GenerationQuestion Answering | —Unverified | 0 |
| ProcTag: Process Tagging for Assessing the Efficacy of Document Instruction Data | Jul 17, 2024 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| EchoSight: Advancing Visual-Language Models with Wiki Knowledge | Jul 17, 2024 | ArticlesQuestion Answering | —Unverified | 0 |
| TM-PATHVQA:90000+ Textless Multilingual Questions for Medical Visual Question Answering | Jul 16, 2024 | Medical Visual Question AnsweringQuestion Answering | —Unverified | 0 |