| Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning | Dec 4, 2024 | Multimodal Large Language ModelVideo Understanding | CodeCode Available | 1 |
| Explainable and Interpretable Multimodal Large Language Models: A Comprehensive Survey | Dec 3, 2024 | Cross-Modal RetrievalNatural Language Understanding | —Unverified | 0 |
| Copy-Move Forgery Detection and Question Answering for Remote Sensing Image | Dec 3, 2024 | Question AnsweringVisual Question Answering | CodeCode Available | 0 |
| CEGI: Measuring the trade-off between efficiency and carbon emissions for SLMs and VLMs | Dec 3, 2024 | Image CaptioningQuantization | —Unverified | 0 |
| Understanding the World's Museums through Vision-Language Reasoning | Dec 2, 2024 | BenchmarkingQuestion Answering | CodeCode Available | 0 |
| Dynamic-LLaVA: Efficient Multimodal Large Language Models via Dynamic Vision-language Context Sparsification | Dec 1, 2024 | GPUVisual Question Answering | CodeCode Available | 2 |
| DLaVA: Document Language and Vision Assistant for Answer Localization with Enhanced Interpretability and Trustworthiness | Nov 29, 2024 | Optical Character Recognition (OCR)Question Answering | CodeCode Available | 0 |
| SURE-VQA: Systematic Understanding of Robustness Evaluation in Medical VQA Tasks | Nov 29, 2024 | Question AnsweringVisual Question Answering | CodeCode Available | 0 |
| Sparse Attention Vectors: Generative Multimodal Model Features Are Discriminative Vision-Language Classifiers | Nov 28, 2024 | Image Captioningimage-classification | —Unverified | 0 |
| Beyond Logit Lens: Contextual Embeddings for Robust Hallucination Detection & Grounding in VLMs | Nov 28, 2024 | AttributeHallucination | —Unverified | 0 |