| Aligned Vector Quantization for Edge-Cloud Collabrative Vision-Language Models | Nov 8, 2024 | QuantizationQuestion Answering | —Unverified | 0 |
| Seeing is Deceiving: Exploitation of Visual Pathways in Multi-Modal Language Models | Nov 7, 2024 | Adversarial AttackImage Captioning | —Unverified | 0 |
| SaSR-Net: Source-Aware Semantic Representation Network for Enhancing Audio-Visual Question Answering | Nov 7, 2024 | Audio-visual Question AnsweringAudio-Visual Question Answering (AVQA) | —Unverified | 0 |
| M3DocRAG: Multi-modal Retrieval is What You Need for Multi-page Multi-document Understanding | Nov 7, 2024 | document understandingOptical Character Recognition | —Unverified | 0 |
| NeurIPS 2023 Competition: Privacy Preserving Federated Learning Document VQA | Nov 6, 2024 | Federated LearningLanguage Modelling | —Unverified | 0 |
| VQA^2: Visual Question Answering for Video Quality Assessment | Nov 6, 2024 | Question AnsweringVideo Quality Assessment | CodeCode Available | 2 |
| Select2Plan: Training-Free ICL-Based Planning through VQA and Memory Retrieval | Nov 6, 2024 | Autonomous NavigationIn-Context Learning | —Unverified | 0 |
| From Pixels to Prose: Advancing Multi-Modal Language Models for Remote Sensing | Nov 5, 2024 | Change DetectionContrastive Learning | —Unverified | 0 |
| Benchmarking Multimodal Retrieval Augmented Generation with Dynamic VQA Dataset and Self-adaptive Planning Agent | Nov 5, 2024 | BenchmarkingHallucination | CodeCode Available | 3 |
| Multimodal Commonsense Knowledge Distillation for Visual Question Answering | Nov 5, 2024 | Knowledge DistillationQuestion Answering | —Unverified | 0 |
| MME-Finance: A Multimodal Finance Benchmark for Expert-level Understanding and Reasoning | Nov 5, 2024 | MMEQuestion Answering | —Unverified | 0 |
| One VLM to Keep it Learning: Generation and Balancing for Data-free Continual Visual Question Answering | Nov 4, 2024 | Continual LearningQuestion Answering | —Unverified | 0 |
| Goal-Oriented Semantic Communication for Wireless Visual Question Answering | Nov 3, 2024 | Edge-computingQuestion Answering | —Unverified | 0 |
| A Visual Question Answering Method for SAR Ship: Breaking the Requirement for Multimodal Dataset Construction and Model Fine-Tuning | Nov 3, 2024 | object-detectionObject Detection | —Unverified | 0 |
| RS-MoE: Mixture of Experts for Remote Sensing Image Captioning and Visual Question Answering | Nov 3, 2024 | DescriptiveImage Captioning | —Unverified | 0 |
| Designing a Robust Radiology Report Generation System | Nov 2, 2024 | Decision MakingDiagnostic | —Unverified | 0 |
| Right this way: Can VLMs Guide Us to See More to Answer Questions? | Nov 1, 2024 | Question AnsweringVisual Question Answering | CodeCode Available | 0 |
| Show Me What and Where has Changed? Question Answering and Grounding for Remote Sensing Change Detection | Oct 31, 2024 | Change DetectionQuestion Answering | CodeCode Available | 1 |
| Nearest Neighbor Normalization Improves Multimodal Retrieval | Oct 31, 2024 | Cross-Modal RetrievalImage Captioning | CodeCode Available | 1 |
| SimpsonsVQA: Enhancing Inquiry-Based Learning with a Tailored Dataset | Oct 30, 2024 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| GRADE: Quantifying Sample Diversity in Text-to-Image Models | Oct 29, 2024 | AttributeDiversity | —Unverified | 0 |
| Are VLMs Really Blind | Oct 29, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 0 |
| Few-Shot Multimodal Explanation for Visual Question Answering | Oct 28, 2024 | Explainable artificial intelligenceExplainable Artificial Intelligence (XAI) | CodeCode Available | 0 |
| Attention Overlap Is Responsible for The Entity Missing Problem in Text-to-image Diffusion Models! | Oct 28, 2024 | DenoisingQuestion Answering | —Unverified | 0 |
| Face-MLLM: A Large Face Perception Model | Oct 28, 2024 | Attributemodel | —Unverified | 0 |