| CREPE: Coordinate-Aware End-to-End Document Parser | May 1, 2024 | document understandingOptical Character Recognition (OCR) | —Unverified | 0 |
| Beyond Human Vision: The Role of Large Vision Language Models in Microscope Image Analysis | May 1, 2024 | Image CaptioningQuestion Answering | —Unverified | 0 |
| Multi-Page Document Visual Question Answering using Self-Attention Scoring Mechanism | Apr 29, 2024 | document understandingGPU | CodeCode Available | 0 |
| Efficiency in Focus: LayerNorm as a Catalyst for Fine-tuning Medical Visual Language Pre-trained Models | Apr 25, 2024 | Medical Visual Question Answeringparameter-efficient fine-tuning | —Unverified | 0 |
| How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites | Apr 25, 2024 | 4kLanguage Modeling | —Unverified | 0 |
| Fusion of Domain-Adapted Vision and Language Models for Medical Visual Question Answering | Apr 24, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Grounded Knowledge-Enhanced Medical VLP for Chest X-Ray | Apr 23, 2024 | Medical Visual Question AnsweringQuestion Answering | —Unverified | 0 |
| Wiki-LLaVA: Hierarchical Retrieval-Augmented Generation for Multimodal LLMs | Apr 23, 2024 | Question AnsweringRetrieval | —Unverified | 0 |
| Self-Bootstrapped Visual-Language Model for Knowledge Selection and Question Answering | Apr 22, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 0 |
| WangLab at MEDIQA-M3G 2024: Multimodal Medical Answer Generation using Large Language Models | Apr 22, 2024 | Answer Generationimage-classification | —Unverified | 0 |
| Lost in Space: Probing Fine-grained Spatial Understanding in Vision and Language Resamplers | Apr 21, 2024 | DiagnosticImage Captioning | CodeCode Available | 0 |
| Exploring Diverse Methods in Visual Question Answering | Apr 21, 2024 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| PDF-MVQA: A Dataset for Multimodal Information Retrieval in PDF-based Visual Question Answering | Apr 19, 2024 | ArticlesInformation Retrieval | —Unverified | 0 |
| Look Before You Decide: Prompting Active Deduction of MLLMs for Assumptive Reasoning | Apr 19, 2024 | Benchmarkingcounterfactual | —Unverified | 0 |
| TextSquare: Scaling up Text-Centric Visual Instruction Tuning | Apr 19, 2024 | HallucinationHallucination Evaluation | —Unverified | 0 |
| MedThink: Explaining Medical Visual Question Answering via Multimodal Decision-Making Rationale | Apr 18, 2024 | Decision MakingMedical Visual Question Answering | —Unverified | 0 |
| Consistency and Uncertainty: Identifying Unreliable Responses From Black-Box Vision-Language Models for Selective Visual Question Answering | Apr 16, 2024 | Language ModellingPrediction | —Unverified | 0 |
| ViTextVQA: A Large-Scale Visual Question Answering Dataset for Evaluating Vietnamese Text Comprehension in Images | Apr 16, 2024 | Multimodal Deep LearningOptical Character Recognition (OCR) | CodeCode Available | 0 |
| Find The Gap: Knowledge Base Reasoning For Visual Question Answering | Apr 16, 2024 | Question AnsweringRetrieval | —Unverified | 0 |
| HOI-Ref: Hand-Object Interaction Referral in Egocentric Vision | Apr 15, 2024 | ObjectQuestion Answering | —Unverified | 0 |
| Bridging Vision and Language Spaces with Assignment Prediction | Apr 15, 2024 | Cross-Modal RetrievalImage Captioning | CodeCode Available | 0 |
| Language Models Meet Anomaly Detection for Better Interpretability and Generalizability | Apr 11, 2024 | Anomaly DetectionLanguage Modelling | CodeCode Available | 0 |
| Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs | Apr 11, 2024 | DescriptiveHallucination | CodeCode Available | 0 |
| InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD | Apr 9, 2024 | 4kLanguage Modeling | CodeCode Available | 0 |
| OmniFusion Technical Report | Apr 9, 2024 | MM-VetTextVQA | CodeCode Available | 0 |