| Aligned Vector Quantization for Edge-Cloud Collabrative Vision-Language Models | Nov 8, 2024 | QuantizationQuestion Answering | —Unverified | 0 |
| Seeing is Deceiving: Exploitation of Visual Pathways in Multi-Modal Language Models | Nov 7, 2024 | Adversarial AttackImage Captioning | —Unverified | 0 |
| SaSR-Net: Source-Aware Semantic Representation Network for Enhancing Audio-Visual Question Answering | Nov 7, 2024 | Audio-visual Question AnsweringAudio-Visual Question Answering (AVQA) | —Unverified | 0 |
| M3DocRAG: Multi-modal Retrieval is What You Need for Multi-page Multi-document Understanding | Nov 7, 2024 | document understandingOptical Character Recognition | —Unverified | 0 |
| NeurIPS 2023 Competition: Privacy Preserving Federated Learning Document VQA | Nov 6, 2024 | Federated LearningLanguage Modelling | —Unverified | 0 |
| VQA^2: Visual Question Answering for Video Quality Assessment | Nov 6, 2024 | Question AnsweringVideo Quality Assessment | CodeCode Available | 2 |
| Select2Plan: Training-Free ICL-Based Planning through VQA and Memory Retrieval | Nov 6, 2024 | Autonomous NavigationIn-Context Learning | —Unverified | 0 |
| From Pixels to Prose: Advancing Multi-Modal Language Models for Remote Sensing | Nov 5, 2024 | Change DetectionContrastive Learning | —Unverified | 0 |
| Benchmarking Multimodal Retrieval Augmented Generation with Dynamic VQA Dataset and Self-adaptive Planning Agent | Nov 5, 2024 | BenchmarkingHallucination | CodeCode Available | 3 |
| MME-Finance: A Multimodal Finance Benchmark for Expert-level Understanding and Reasoning | Nov 5, 2024 | MMEQuestion Answering | —Unverified | 0 |
| Multimodal Commonsense Knowledge Distillation for Visual Question Answering | Nov 5, 2024 | Knowledge DistillationQuestion Answering | —Unverified | 0 |
| One VLM to Keep it Learning: Generation and Balancing for Data-free Continual Visual Question Answering | Nov 4, 2024 | Continual LearningQuestion Answering | —Unverified | 0 |
| Goal-Oriented Semantic Communication for Wireless Visual Question Answering | Nov 3, 2024 | Edge-computingQuestion Answering | —Unverified | 0 |
| A Visual Question Answering Method for SAR Ship: Breaking the Requirement for Multimodal Dataset Construction and Model Fine-Tuning | Nov 3, 2024 | object-detectionObject Detection | —Unverified | 0 |
| RS-MoE: Mixture of Experts for Remote Sensing Image Captioning and Visual Question Answering | Nov 3, 2024 | DescriptiveImage Captioning | —Unverified | 0 |
| Designing a Robust Radiology Report Generation System | Nov 2, 2024 | Decision MakingDiagnostic | —Unverified | 0 |
| Right this way: Can VLMs Guide Us to See More to Answer Questions? | Nov 1, 2024 | Question AnsweringVisual Question Answering | CodeCode Available | 0 |
| Show Me What and Where has Changed? Question Answering and Grounding for Remote Sensing Change Detection | Oct 31, 2024 | Change DetectionQuestion Answering | CodeCode Available | 1 |
| Nearest Neighbor Normalization Improves Multimodal Retrieval | Oct 31, 2024 | Cross-Modal RetrievalImage Captioning | CodeCode Available | 1 |
| SimpsonsVQA: Enhancing Inquiry-Based Learning with a Tailored Dataset | Oct 30, 2024 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| GRADE: Quantifying Sample Diversity in Text-to-Image Models | Oct 29, 2024 | AttributeDiversity | —Unverified | 0 |
| Are VLMs Really Blind | Oct 29, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 0 |
| Few-Shot Multimodal Explanation for Visual Question Answering | Oct 28, 2024 | Explainable artificial intelligenceExplainable Artificial Intelligence (XAI) | CodeCode Available | 0 |
| Attention Overlap Is Responsible for The Entity Missing Problem in Text-to-image Diffusion Models! | Oct 28, 2024 | DenoisingQuestion Answering | —Unverified | 0 |
| Face-MLLM: A Large Face Perception Model | Oct 28, 2024 | Attributemodel | —Unverified | 0 |
| Efficient Bilinear Attention-based Fusion for Medical Visual Question Answering | Oct 28, 2024 | Computational EfficiencyDecision Making | —Unverified | 0 |
| AutoBench-V: Can Large Vision-Language Models Benchmark Themselves? | Oct 28, 2024 | BenchmarkingQuestion Answering | CodeCode Available | 0 |
| R-LLaVA: Improving Med-VQA Understanding through Visual Region of Interest | Oct 27, 2024 | Medical Visual Question AnsweringMultiple-choice | —Unverified | 0 |
| Sensor2Text: Enabling Natural Language Interactions for Daily Activity Tracking Using Wearable Sensors | Oct 26, 2024 | Question AnsweringTransfer Learning | —Unverified | 0 |
| GiVE: Guiding Visual Encoder to Perceive Overlooked Information | Oct 26, 2024 | ObjectQuestion Answering | —Unverified | 0 |
| Visual Text Matters: Improving Text-KVQA with Visual Text Entity Knowledge-aware Large Multimodal Assistant | Oct 24, 2024 | Entity LinkingQuestion Answering | CodeCode Available | 0 |
| Interpretable Bilingual Multimodal Large Language Model for Diverse Biomedical Tasks | Oct 24, 2024 | image-classificationImage Classification | —Unverified | 0 |
| Which Client is Reliable?: A Reliable and Personalized Prompt-based Federated Learning for Medical Image Question Answering | Oct 23, 2024 | Federated LearningMedical Visual Question Answering | —Unverified | 0 |
| ADEM-VL: Adaptive and Embedded Fusion for Efficient Vision-Language Tuning | Oct 23, 2024 | Image CaptioningInstruction Following | CodeCode Available | 1 |
| Progressive Compositionality In Text-to-Image Generative Models | Oct 22, 2024 | AttributeContrastive Learning | CodeCode Available | 1 |
| Order Matters: Exploring Order Sensitivity in Multimodal Large Language Models | Oct 22, 2024 | In-Context LearningQuestion Answering | —Unverified | 0 |
| Visual Question Answering in Ophthalmology: A Progressive and Practical Perspective | Oct 22, 2024 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Object-Centric Temporal Consistency via Conditional Autoregressive Inductive Biases | Oct 21, 2024 | ObjectQuestion Answering | —Unverified | 0 |
| Griffon-G: Bridging Vision-Language and Vision-Centric Tasks via Large Multimodal Models | Oct 21, 2024 | Instruction Followingobject-detection | —Unverified | 0 |
| CROPE: Evaluating In-Context Adaptation of Vision and Language Models to Culture-Specific Concepts | Oct 20, 2024 | Question AnsweringVisual Question Answering | CodeCode Available | 0 |
| ChitroJera: A Regionally Relevant Visual Question Answering Dataset for Bangla | Oct 19, 2024 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| LLaVA-Ultra: Large Chinese Language and Vision Assistant for Ultrasound | Oct 19, 2024 | Instruction FollowingKnowledge Distillation | —Unverified | 0 |
| E3D-GPT: Enhanced 3D Visual Foundation for Medical Vision-Language Model | Oct 18, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Zero-shot Action Localization via the Confidence of Large Vision-Language Models | Oct 18, 2024 | Action LocalizationLanguage Modelling | —Unverified | 0 |
| NaturalBench: Evaluating Vision-Language Models on Natural Adversarial Samples | Oct 18, 2024 | AttributeQuestion Answering | —Unverified | 0 |
| ViConsFormer: Constituting Meaningful Phrases of Scene Texts using Transformer-based Method in Vietnamese Text-based Visual Question Answering | Oct 18, 2024 | Question AnsweringVisual Question Answering | CodeCode Available | 0 |
| MultiChartQA: Benchmarking Vision-Language Models on Multi-Chart Problems | Oct 18, 2024 | BenchmarkingQuestion Answering | CodeCode Available | 1 |
| Help Me Identify: Is an LLM+VQA System All We Need to Identify Visual Concepts? | Oct 17, 2024 | AllLanguage Modeling | CodeCode Available | 0 |
| Improving Multi-modal Large Language Model through Boosting Vision Capabilities | Oct 17, 2024 | DecoderLanguage Modeling | —Unverified | 0 |
| H2OVL-Mississippi Vision Language Models Technical Report | Oct 17, 2024 | Document AIVisual Question Answering | —Unverified | 0 |