| Efficient Bilinear Attention-based Fusion for Medical Visual Question Answering | Oct 28, 2024 | Computational EfficiencyDecision Making | —Unverified | 0 |
| R-LLaVA: Improving Med-VQA Understanding through Visual Region of Interest | Oct 27, 2024 | Medical Visual Question AnsweringMultiple-choice | —Unverified | 0 |
| Sensor2Text: Enabling Natural Language Interactions for Daily Activity Tracking Using Wearable Sensors | Oct 26, 2024 | Question AnsweringTransfer Learning | —Unverified | 0 |
| GiVE: Guiding Visual Encoder to Perceive Overlooked Information | Oct 26, 2024 | ObjectQuestion Answering | —Unverified | 0 |
| Interpretable Bilingual Multimodal Large Language Model for Diverse Biomedical Tasks | Oct 24, 2024 | image-classificationImage Classification | —Unverified | 0 |
| Visual Text Matters: Improving Text-KVQA with Visual Text Entity Knowledge-aware Large Multimodal Assistant | Oct 24, 2024 | Entity LinkingQuestion Answering | CodeCode Available | 0 |
| Which Client is Reliable?: A Reliable and Personalized Prompt-based Federated Learning for Medical Image Question Answering | Oct 23, 2024 | Federated LearningMedical Visual Question Answering | —Unverified | 0 |
| Order Matters: Exploring Order Sensitivity in Multimodal Large Language Models | Oct 22, 2024 | In-Context LearningQuestion Answering | —Unverified | 0 |
| Visual Question Answering in Ophthalmology: A Progressive and Practical Perspective | Oct 22, 2024 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Griffon-G: Bridging Vision-Language and Vision-Centric Tasks via Large Multimodal Models | Oct 21, 2024 | Instruction Followingobject-detection | —Unverified | 0 |
| Object-Centric Temporal Consistency via Conditional Autoregressive Inductive Biases | Oct 21, 2024 | ObjectQuestion Answering | —Unverified | 0 |
| CROPE: Evaluating In-Context Adaptation of Vision and Language Models to Culture-Specific Concepts | Oct 20, 2024 | Question AnsweringVisual Question Answering | CodeCode Available | 0 |
| LLaVA-Ultra: Large Chinese Language and Vision Assistant for Ultrasound | Oct 19, 2024 | Instruction FollowingKnowledge Distillation | —Unverified | 0 |
| ChitroJera: A Regionally Relevant Visual Question Answering Dataset for Bangla | Oct 19, 2024 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| ViConsFormer: Constituting Meaningful Phrases of Scene Texts using Transformer-based Method in Vietnamese Text-based Visual Question Answering | Oct 18, 2024 | Question AnsweringVisual Question Answering | CodeCode Available | 0 |
| Zero-shot Action Localization via the Confidence of Large Vision-Language Models | Oct 18, 2024 | Action LocalizationLanguage Modelling | —Unverified | 0 |
| NaturalBench: Evaluating Vision-Language Models on Natural Adversarial Samples | Oct 18, 2024 | AttributeQuestion Answering | —Unverified | 0 |
| E3D-GPT: Enhanced 3D Visual Foundation for Medical Vision-Language Model | Oct 18, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| RescueADI: Adaptive Disaster Interpretation in Remote Sensing Images with Autonomous Agents | Oct 17, 2024 | Question AnsweringTask Planning | —Unverified | 0 |
| Improving Multi-modal Large Language Model through Boosting Vision Capabilities | Oct 17, 2024 | DecoderLanguage Modeling | —Unverified | 0 |
| Help Me Identify: Is an LLM+VQA System All We Need to Identify Visual Concepts? | Oct 17, 2024 | AllLanguage Modeling | CodeCode Available | 0 |
| H2OVL-Mississippi Vision Language Models Technical Report | Oct 17, 2024 | Document AIVisual Question Answering | —Unverified | 0 |
| γ-MoD: Exploring Mixture-of-Depth Adaptation for Multimodal Large Language Models | Oct 17, 2024 | Visual Question Answering | —Unverified | 0 |
| Cross-Modal Safety Mechanism Transfer in Large Vision-Language Models | Oct 16, 2024 | Visual Question Answering | —Unverified | 0 |
| Difficult Task Yes but Simple Task No: Unveiling the Laziness in Multimodal LLMs | Oct 15, 2024 | Image DescriptionMultiple-choice | CodeCode Available | 0 |
| OMCAT: Omni Context Aware Transformer | Oct 15, 2024 | Audio-visual Question AnsweringAudio-Visual Question Answering (AVQA) | —Unverified | 0 |
| MMAR: Towards Lossless Multi-Modal Auto-Regressive Probabilistic Modeling | Oct 14, 2024 | DenoisingImage Generation | —Unverified | 0 |
| Eliminating the Language Bias for Visual Question Answering with fine-grained Causal Intervention | Oct 14, 2024 | Contrastive Learningcounterfactual | —Unverified | 0 |
| Surgical-LLaVA: Toward Surgical Scenario Understanding via Large Language and Vision Models | Oct 13, 2024 | Instruction FollowingQuestion Answering | —Unverified | 0 |
| MMCOMPOSITION: Revisiting the Compositionality of Pre-trained Vision-Language Models | Oct 13, 2024 | Cross-Modal RetrievalQuestion Answering | —Unverified | 0 |
| VLFeedback: A Large-Scale AI Feedback Dataset for Large Vision-Language Models Alignment | Oct 12, 2024 | DiversityHallucination | —Unverified | 0 |
| Declarative Knowledge Distillation from Large Language Models for Visual Question Answering Datasets | Oct 12, 2024 | Knowledge DistillationQuestion Answering | CodeCode Available | 0 |
| Zero-shot Commonsense Reasoning over Machine Imagination | Oct 12, 2024 | Question AnsweringVisual Question Answering | CodeCode Available | 0 |
| ViT3D Alignment of LLaMA3: 3D Medical Image Report Generation | Oct 11, 2024 | DiagnosticLanguage Modeling | —Unverified | 0 |
| Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training | Oct 10, 2024 | Mixture-of-ExpertsVisual Question Answering | —Unverified | 0 |
| Emerging Pixel Grounding in Large Multimodal Models Without Grounding Supervision | Oct 10, 2024 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| PAR: Prompt-Aware Token Reduction Method for Efficient Large Multimodal Models | Oct 9, 2024 | Question AnsweringRetrieval | —Unverified | 0 |
| Beyond Captioning: Task-Specific Prompting for Improved VLM Performance in Mathematical Reasoning | Oct 8, 2024 | Image RetrievalMath | —Unverified | 0 |
| Multimodal Large Language Models and Tunings: Vision, Language, Sensors, Audio, and Beyond | Oct 8, 2024 | Question AnsweringVisual Question Answering | CodeCode Available | 0 |
| Core Tokensets for Data-efficient Sequential Training of Transformers | Oct 8, 2024 | Image Captioningimage-classification | CodeCode Available | 0 |
| ERVQA: A Dataset to Benchmark the Readiness of Large Vision Language Models in Hospital Environments | Oct 8, 2024 | DecoderQuestion Answering | CodeCode Available | 0 |
| VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks | Oct 7, 2024 | Information RetrievalLanguage Modeling | —Unverified | 0 |
| MM-R^3: On (In-)Consistency of Multi-modal Large Language Models (MLLMs) | Oct 7, 2024 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| TUBench: Benchmarking Large Vision-Language Models on Trustworthiness with Unanswerable Questions | Oct 5, 2024 | BenchmarkingHallucination | CodeCode Available | 0 |
| Gamified crowd-sourcing of high-quality data for visual fine-tuning | Oct 5, 2024 | Visual Question Answering | —Unverified | 0 |
| Backdooring Vision-Language Models with Out-Of-Distribution Data | Oct 2, 2024 | Image CaptioningImage to text | —Unverified | 0 |
| Why context matters in VQA and Reasoning: Semantic interventions for VLM input modalities | Oct 2, 2024 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| BabelBench: An Omni Benchmark for Code-Driven Analysis of Multimodal and Multistructured Data | Oct 1, 2024 | Code GenerationLogical Reasoning | CodeCode Available | 0 |
| FMBench: Benchmarking Fairness in Multimodal Large Language Models on Medical Tasks | Oct 1, 2024 | BenchmarkingFairness | —Unverified | 0 |
| Unleashing the Potentials of Likelihood Composition for Multi-modal Language Models | Oct 1, 2024 | Question AnsweringVisual Question Answering | CodeCode Available | 0 |