| Enhancing Cross-Prompt Transferability in Vision-Language Models through Contextual Injection of Target Tokens | Jun 19, 2024 | Caption Generationimage-classification | CodeCode Available | 0 |
| Diversify, Rationalize, and Combine: Ensembling Multiple QA Strategies for Zero-shot Knowledge-based VQA | Jun 18, 2024 | Question AnsweringVisual Question Answering | CodeCode Available | 0 |
| Program Synthesis Benchmark for Visual Programming in XLogoOnline Environment | Jun 17, 2024 | Logical ReasoningMath | —Unverified | 0 |
| LLARVA: Vision-Action Instruction Tuning Enhances Robot Learning | Jun 17, 2024 | Image CaptioningQuestion Answering | —Unverified | 0 |
| Mixture-of-Subspaces in Low-Rank Adaptation | Jun 16, 2024 | Common Sense ReasoningImage Generation | CodeCode Available | 0 |
| Beyond Raw Videos: Understanding Edited Videos with Large Multimodal Model | Jun 15, 2024 | Question AnsweringVideo Understanding | CodeCode Available | 0 |
| Precision Empowers, Excess Distracts: Visual Question Answering With Dynamically Infused Knowledge In Language Models | Jun 14, 2024 | DecoderKnowledge Graphs | —Unverified | 0 |
| SHMamba: Structured Hyperbolic State Space Model for Audio-Visual Question Answering | Jun 14, 2024 | Audio-visual Question AnsweringAudio-Visual Question Answering (AVQA) | —Unverified | 0 |
| Detecting and Evaluating Medical Hallucinations in Large Vision Language Models | Jun 14, 2024 | HallucinationMedical Visual Question Answering | —Unverified | 0 |
| Optimizing Visual Question Answering Models for Driving: Bridging the Gap Between Human and Machine Attention Patterns | Jun 13, 2024 | Autonomous DrivingQuestion Answering | —Unverified | 0 |
| Towards Multilingual Audio-Visual Question Answering | Jun 13, 2024 | Audio-visual Question AnsweringAudio-Visual Question Answering (AVQA) | CodeCode Available | 0 |
| DistilDoc: Knowledge Distillation for Visually-Rich Document Applications | Jun 12, 2024 | document-image-classificationDocument Image Classification | —Unverified | 0 |
| What If We Recaption Billions of Web Images with LLaMA-3? | Jun 12, 2024 | Cross-Modal RetrievalImage Generation | —Unverified | 0 |
| Benchmarking Vision-Language Contrastive Methods for Medical Representation Learning | Jun 11, 2024 | BenchmarkingContrastive Learning | CodeCode Available | 0 |
| Solution for SMART-101 Challenge of CVPR Multi-modal Algorithmic Reasoning Task 2024 | Jun 10, 2024 | Language Modellingobject-detection | —Unverified | 0 |
| CVQA: Culturally-diverse Multilingual Visual Question Answering Benchmark | Jun 10, 2024 | DiversityQuestion Answering | —Unverified | 0 |
| Towards Semantic Equivalence of Tokenization in Multimodal LLM | Jun 7, 2024 | Visual Question Answering | —Unverified | 0 |
| DeepStack: Deeply Stacking Visual Tokens is Surprisingly Simple and Effective for LMMs | Jun 6, 2024 | Language ModellingLarge Language Model | —Unverified | 0 |
| Understanding Information Storage and Transfer in Multi-modal Large Language Models | Jun 6, 2024 | Factual Visual Question AnsweringModel Editing | —Unverified | 0 |
| RoboMamba: Efficient Vision-Language-Action Model for Robotic Reasoning and Manipulation | Jun 6, 2024 | Common Sense ReasoningMamba | —Unverified | 0 |
| Balancing Performance and Efficiency in Zero-shot Robotic Navigation | Jun 5, 2024 | Computational EfficiencyQuestion Answering | —Unverified | 0 |
| Translation Deserves Better: Analyzing Translation Artifacts in Cross-lingual Visual Question Answering | Jun 4, 2024 | Data AugmentationMachine Translation | —Unverified | 0 |
| Story Generation from Visual Inputs: Techniques, Related Tasks, and Challenges | Jun 4, 2024 | Question AnsweringStory Generation | —Unverified | 0 |
| Diffusion-Refined VQA Annotations for Semi-Supervised Gaze Following | Jun 4, 2024 | Question AnsweringVisual Question Answering | CodeCode Available | 0 |
| Mixture of Rationale: Multi-Modal Reasoning Mixture for Visual Question Answering | Jun 3, 2024 | DiversityQuestion Answering | —Unverified | 0 |
| Selectively Answering Visual Questions | Jun 3, 2024 | AvgIn-Context Learning | —Unverified | 0 |
| Video Question Answering for People with Visual Impairments Using an Egocentric 360-Degree Camera | May 30, 2024 | Question AnsweringVideo Question Answering | —Unverified | 0 |
| VQA Training Sets are Self-play Environments for Generating Few-shot Pools | May 30, 2024 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Uncovering Bias in Large Vision-Language Models at Scale with Counterfactuals | May 30, 2024 | counterfactualQuestion Answering | —Unverified | 0 |
| MetaToken: Detecting Hallucination in Image Descriptions by Meta Classification | May 29, 2024 | HallucinationImage Captioning | —Unverified | 0 |
| Evaluating Zero-Shot GPT-4V Performance on 3D Visual Question Answering Benchmarks | May 29, 2024 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Data-augmented phrase-level alignment for mitigating object hallucination | May 28, 2024 | Data AugmentationHallucination | —Unverified | 0 |
| MMCTAgent: Multi-modal Critical Thinking Agent Framework for Complex Visual Reasoning | May 28, 2024 | Decision MakingVideo Understanding | —Unverified | 0 |
| Prompt-Aware Adapter: Towards Learning Adaptive Visual Tokens for Multimodal Large Language Models | May 24, 2024 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Reframing Spatial Reasoning Evaluation in Language Models: A Real-World Simulation Benchmark for Qualitative Reasoning | May 23, 2024 | Logical Reasoning Question AnsweringSpatial Reasoning | CodeCode Available | 0 |
| AlignGPT: Multi-modal Large Language Models with Adaptive Alignment Capability | May 23, 2024 | cross-modal alignmentLanguage Modelling | —Unverified | 0 |
| SearchLVLMs: A Plug-and-Play Framework for Augmenting Large Vision-Language Models by Searching Up-to-Date Internet Knowledge | May 23, 2024 | Question AnsweringRAG | —Unverified | 0 |
| Image-of-Thought Prompting for Visual Reasoning Refinement in Multimodal Large Language Models | May 22, 2024 | Multimodal ReasoningVisual Question Answering | —Unverified | 0 |
| Dataset and Benchmark for Urdu Natural Scenes Text Detection, Recognition and Visual Question Answering | May 21, 2024 | DiversityInformation Retrieval | CodeCode Available | 0 |
| Inquire, Interact, and Integrate: A Proactive Agent Collaborative Framework for Zero-Shot Multimodal Medical Reasoning | May 19, 2024 | Multimodal ReasoningQuestion Answering | —Unverified | 0 |
| EyeFound: A Multimodal Generalist Foundation Model for Ophthalmic Imaging | May 18, 2024 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| StackOverflowVQA: Stack Overflow Visual Question Answering Dataset | May 17, 2024 | Question AnsweringSentence | —Unverified | 0 |
| CLIP-Powered TASS: Target-Aware Single-Stream Network for Audio-Visual Question Answering | May 13, 2024 | Audio-visual Question AnsweringAudio-Visual Question Answering (AVQA) | —Unverified | 0 |
| Realizing Visual Question Answering for Education: GPT-4V as a Multimodal AI | May 12, 2024 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Federated Document Visual Question Answering: A Pilot Study | May 10, 2024 | Federated LearningQuestion Answering | CodeCode Available | 0 |
| Is the House Ready For Sleeptime? Generating and Evaluating Situational Queries for Embodied Question Answering | May 8, 2024 | 2kEmbodied Question Answering | —Unverified | 0 |
| Language-Image Models with 3D Understanding | May 6, 2024 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| VSA4VQA: Scaling a Vector Symbolic Architecture to Visual Question Answering on Natural Images | May 6, 2024 | AttributeLanguage Modeling | —Unverified | 0 |
| Advancing Multimodal Medical Capabilities of Gemini | May 6, 2024 | Computed Tomography (CT)image-classification | —Unverified | 0 |
| Enhanced Textual Feature Extraction for Visual Question Answering: A Simple Convolutional Approach | May 1, 2024 | Computational EfficiencyQuestion Answering | —Unverified | 0 |