| Attribute Diversity Determines the Systematicity Gap in VQA | Nov 15, 2023 | AttributeDiagnostic | CodeCode Available | 0 |
| Asking More Informative Questions for Grounded Retrieval | Nov 14, 2023 | Question AnsweringQuestion Selection | —Unverified | 0 |
| What Large Language Models Bring to Text-rich VQA? | Nov 13, 2023 | Image ComprehensionOptical Character Recognition (OCR) | —Unverified | 0 |
| Visual Commonsense based Heterogeneous Graph Contrastive Learning | Nov 11, 2023 | Contrastive LearningQuestion Answering | —Unverified | 0 |
| Zero-shot Translation of Attention Patterns in VQA Models to Natural Language | Nov 8, 2023 | Image CaptioningLanguage Modeling | CodeCode Available | 0 |
| CoVLM: Composing Visual Entities and Relationships in Large Language Models Via Communicative Decoding | Nov 6, 2023 | CoLAQuestion Answering | —Unverified | 0 |
| From Image to Language: A Critical Analysis of Visual Question Answering (VQA) Approaches, Challenges, and Opportunities | Nov 1, 2023 | NavigateQuestion Answering | —Unverified | 0 |
| VQA-GEN: A Visual Question Answering Benchmark for Domain Generalization | Nov 1, 2023 | Domain GeneralizationQuestion Answering | —Unverified | 0 |
| A Systematic Evaluation of GPT-4V's Multimodal Capability for Medical Image Analysis | Oct 31, 2023 | DescriptiveMedical Image Analysis | —Unverified | 0 |
| Learning to Follow Object-Centric Image Editing Instructions Faithfully | Oct 29, 2023 | ObjectQuestion Answering | CodeCode Available | 0 |
| Dynamic Task and Weight Prioritization Curriculum Learning for Multimodal Imagery | Oct 29, 2023 | Deep LearningMultimodal Deep Learning | CodeCode Available | 0 |
| ViCLEVR: A Visual Reasoning Dataset and Hybrid Multimodal Fusion Model for Visual Question Answering in Vietnamese | Oct 27, 2023 | Information RetrievalNatural Language Queries | CodeCode Available | 0 |
| Davidsonian Scene Graph: Improving Reliability in Fine-grained Evaluation for Text-to-Image Generation | Oct 27, 2023 | Image GenerationQuestion Answering | —Unverified | 0 |
| Incorporating Probing Signals into Multimodal Machine Translation via Visual Question-Answering Pairs | Oct 26, 2023 | AttributeMachine Translation | CodeCode Available | 0 |
| CAD -- Contextual Multi-modal Alignment for Dynamic AVQA | Oct 25, 2023 | Audio-visual Question AnsweringAudio-Visual Question Answering (AVQA) | —Unverified | 0 |
| Exploring Question Decomposition for Zero-Shot VQA | Oct 25, 2023 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Enhancing Document Information Analysis with Multi-Task Pre-training: A Robust Approach for Information Extraction in Visually-Rich Documents | Oct 25, 2023 | AllDocument Classification | —Unverified | 0 |
| Multimodal Representations for Teacher-Guided Compositional Visual Reasoning | Oct 24, 2023 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Dataset Bias Mitigation in Multiple-Choice Visual Question Answering and Beyond | Oct 23, 2023 | counterfactualMultiple-choice | —Unverified | 0 |
| LXMERT Model Compression for Visual Question Answering | Oct 23, 2023 | modelModel Compression | CodeCode Available | 0 |
| SILC: Improving Vision Language Pretraining with Self-Distillation | Oct 20, 2023 | ClassificationContrastive Learning | —Unverified | 0 |
| A Simple Baseline for Knowledge-Based Visual Question Answering | Oct 20, 2023 | In-Context LearningQuestion Answering | CodeCode Available | 0 |
| RSAdapter: Adapting Multimodal Models for Remote Sensing Visual Question Answering | Oct 19, 2023 | Image CaptioningQuestion Answering | CodeCode Available | 0 |
| UNK-VQA: A Dataset and a Probe into the Abstention Ability of Multi-modal Large Models | Oct 17, 2023 | AttributeQuestion Answering | CodeCode Available | 0 |
| Exploring Sparse Spatial Relation in Graph Inference for Text-Based VQA | Oct 13, 2023 | Graph LearningObject | —Unverified | 0 |
| Enhancing BERT-Based Visual Question Answering through Keyword-Driven Sentence Selection | Oct 13, 2023 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Ziya-Visual: Bilingual Large Vision-Language Model via Multi-Task Instruction Tuning | Oct 12, 2023 | Image CaptioningImage-text Retrieval | —Unverified | 0 |
| Open-Set Knowledge-Based Visual Question Answering with Inference Paths | Oct 12, 2023 | Knowledge GraphsMulti-class Classification | CodeCode Available | 0 |
| Improving mitosis detection on histopathology images using large vision-language models | Oct 11, 2023 | Domain GeneralizationImage Captioning | —Unverified | 0 |
| Uncovering Hidden Connections: Iterative Search and Reasoning for Video-grounded Dialog | Oct 11, 2023 | Question AnsweringResponse Generation | CodeCode Available | 0 |
| Jaeger: A Concatenation-Based Multi-Transformer VQA Model | Oct 11, 2023 | Dimensionality Reductionmodel | —Unverified | 0 |
| Solution for SMART-101 Challenge of ICCV Multi-modal Algorithmic Reasoning Task 2023 | Oct 10, 2023 | Decoderobject-detection | —Unverified | 0 |
| Negative Object Presence Evaluation (NOPE) to Measure Object Hallucination in Vision-Language Models | Oct 9, 2023 | HallucinationObject | —Unverified | 0 |
| Causal Reasoning through Two Layers of Cognition for Improving Generalization in Visual Question Answering | Oct 9, 2023 | Answer GenerationQuestion Answering | —Unverified | 0 |
| Lightweight In-Context Tuning for Multimodal Unified Models | Oct 8, 2023 | Image CaptioningIn-Context Learning | —Unverified | 0 |
| Improving Automatic VQA Evaluation Using Large Language Models | Oct 4, 2023 | In-Context LearningQuestion Answering | —Unverified | 0 |
| On the Cognition of Visual Question Answering Models and Human Intelligence: A Comparative Study | Oct 4, 2023 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| SelfGraphVQA: A Self-Supervised Graph Neural Network for Scene-based Question Answering | Oct 3, 2023 | Graph Neural NetworkQuestion Answering | —Unverified | 0 |
| Human Mobility Question Answering (Vision Paper) | Oct 2, 2023 | ManagementQuestion Answering | —Unverified | 0 |
| Tackling VQA with Pretrained Foundation Models without Further Training | Sep 27, 2023 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| KOSMOS-2.5: A Multimodal Literate Model | Sep 20, 2023 | document understandingmodel | —Unverified | 0 |
| Visual Question Answering in the Medical Domain | Sep 20, 2023 | Contrastive LearningMedical Visual Question Answering | —Unverified | 0 |
| Sentence Attention Blocks for Answer Grounding | Sep 20, 2023 | Question AnsweringSentence | —Unverified | 0 |
| Syntax Tree Constrained Graph Network for Visual Question Answering | Sep 17, 2023 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| D3: Data Diversity Design for Systematic Generalization in Visual Question Answering | Sep 15, 2023 | DiversityQuestion Answering | CodeCode Available | 0 |
| Rank2Tell: A Multimodal Driving Dataset for Joint Importance Ranking and Reasoning | Sep 12, 2023 | Autonomous VehiclesQuestion Answering | —Unverified | 0 |
| Interpretable Visual Question Answering via Reasoning Supervision | Sep 7, 2023 | Common Sense ReasoningQuestion Answering | —Unverified | 0 |
| Evaluation and Enhancement of Semantic Grounding in Large Vision-Language Models | Sep 7, 2023 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Physically Grounded Vision-Language Models for Robotic Manipulation | Sep 5, 2023 | Image CaptioningLanguage Modelling | —Unverified | 0 |
| Towards Addressing the Misalignment of Object Proposal Evaluation for Vision-Language Tasks via Semantic Grounding | Sep 1, 2023 | Graph GenerationImage Captioning | CodeCode Available | 0 |