| CogVLM: Visual Expert for Pretrained Language Models | Nov 6, 2023 | 1 Image, 2*2 StitchingFS-MEVQA | CodeCode Available | 5 |
| GPT-4V-AD: Exploring Grounding Potential of VQA-oriented GPT-4V for Zero-shot Anomaly Detection | Nov 5, 2023 | Anomaly DetectionQuestion Answering | CodeCode Available | 1 |
| VQA-GEN: A Visual Question Answering Benchmark for Domain Generalization | Nov 1, 2023 | Domain GeneralizationQuestion Answering | —Unverified | 0 |
| From Image to Language: A Critical Analysis of Visual Question Answering (VQA) Approaches, Challenges, and Opportunities | Nov 1, 2023 | NavigateQuestion Answering | —Unverified | 0 |
| Making Large Language Models Better Data Creators | Oct 31, 2023 | Instruction FollowingPrompt Engineering | CodeCode Available | 1 |
| Language Guided Visual Question Answering: Elevate Your Multimodal Language Model Using Knowledge-Enriched Prompts | Oct 31, 2023 | Image CaptioningLanguage Modeling | CodeCode Available | 1 |
| A Systematic Evaluation of GPT-4V's Multimodal Capability for Medical Image Analysis | Oct 31, 2023 | DescriptiveMedical Image Analysis | —Unverified | 0 |
| Learning to Follow Object-Centric Image Editing Instructions Faithfully | Oct 29, 2023 | ObjectQuestion Answering | CodeCode Available | 0 |
| Multimodal ChatGPT for Medical Applications: an Experimental Study of GPT-4V | Oct 29, 2023 | DiagnosticLanguage Modeling | CodeCode Available | 1 |
| Dynamic Task and Weight Prioritization Curriculum Learning for Multimodal Imagery | Oct 29, 2023 | Deep LearningMultimodal Deep Learning | CodeCode Available | 0 |
| EHRXQA: A Multi-Modal Question Answering Dataset for Electronic Health Records with Chest X-ray Images | Oct 28, 2023 | Decision MakingMedical Visual Question Answering | CodeCode Available | 1 |
| 3D-Aware Visual Question Answering about Parts, Poses and Occlusions | Oct 27, 2023 | Question AnsweringVisual Question Answering | CodeCode Available | 1 |
| ViCLEVR: A Visual Reasoning Dataset and Hybrid Multimodal Fusion Model for Visual Question Answering in Vietnamese | Oct 27, 2023 | Information RetrievalNatural Language Queries | CodeCode Available | 0 |
| Davidsonian Scene Graph: Improving Reliability in Fine-grained Evaluation for Text-to-Image Generation | Oct 27, 2023 | Image GenerationQuestion Answering | —Unverified | 0 |
| AntifakePrompt: Prompt-Tuned Vision-Language Models are Fake Image Detectors | Oct 26, 2023 | DeepFake DetectionFace Swapping | CodeCode Available | 1 |
| Incorporating Probing Signals into Multimodal Machine Translation via Visual Question-Answering Pairs | Oct 26, 2023 | AttributeMachine Translation | CodeCode Available | 0 |
| Exploring Question Decomposition for Zero-Shot VQA | Oct 25, 2023 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Enhancing Document Information Analysis with Multi-Task Pre-training: A Robust Approach for Information Extraction in Visually-Rich Documents | Oct 25, 2023 | AllDocument Classification | —Unverified | 0 |
| CAD -- Contextual Multi-modal Alignment for Dynamic AVQA | Oct 25, 2023 | Audio-visual Question AnsweringAudio-Visual Question Answering (AVQA) | —Unverified | 0 |
| Towards Perceiving Small Visual Details in Zero-shot Visual Question Answering with Multimodal LLMs | Oct 24, 2023 | Question AnsweringVisual Question Answering | CodeCode Available | 1 |
| Multimodal Representations for Teacher-Guided Compositional Visual Reasoning | Oct 24, 2023 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| LXMERT Model Compression for Visual Question Answering | Oct 23, 2023 | modelModel Compression | CodeCode Available | 0 |
| Dataset Bias Mitigation in Multiple-Choice Visual Question Answering and Beyond | Oct 23, 2023 | counterfactualMultiple-choice | —Unverified | 0 |
| SILC: Improving Vision Language Pretraining with Self-Distillation | Oct 20, 2023 | ClassificationContrastive Learning | —Unverified | 0 |
| A Simple Baseline for Knowledge-Based Visual Question Answering | Oct 20, 2023 | In-Context LearningQuestion Answering | CodeCode Available | 0 |
| RSAdapter: Adapting Multimodal Models for Remote Sensing Visual Question Answering | Oct 19, 2023 | Image CaptioningQuestion Answering | CodeCode Available | 0 |
| Frozen Transformers in Language Models Are Effective Visual Encoder Layers | Oct 19, 2023 | Action RecognitionImage-text Retrieval | CodeCode Available | 2 |
| UNK-VQA: A Dataset and a Probe into the Abstention Ability of Multi-modal Large Models | Oct 17, 2023 | AttributeQuestion Answering | CodeCode Available | 0 |
| MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning | Oct 14, 2023 | Image ClassificationImage Description | CodeCode Available | 7 |
| Enhancing BERT-Based Visual Question Answering through Keyword-Driven Sentence Selection | Oct 13, 2023 | Language ModelingLanguage Modelling | —Unverified | 0 |
| From CLIP to DINO: Visual Encoders Shout in Multi-modal Large Language Models | Oct 13, 2023 | HallucinationImage Captioning | CodeCode Available | 2 |
| Exploring Sparse Spatial Relation in Graph Inference for Text-Based VQA | Oct 13, 2023 | Graph LearningObject | —Unverified | 0 |
| Open-Set Knowledge-Based Visual Question Answering with Inference Paths | Oct 12, 2023 | Knowledge GraphsMulti-class Classification | CodeCode Available | 0 |
| Ziya-Visual: Bilingual Large Vision-Language Model via Multi-Task Instruction Tuning | Oct 12, 2023 | Image CaptioningImage-text Retrieval | —Unverified | 0 |
| Jaeger: A Concatenation-Based Multi-Transformer VQA Model | Oct 11, 2023 | Dimensionality Reductionmodel | —Unverified | 0 |
| Improving mitosis detection on histopathology images using large vision-language models | Oct 11, 2023 | Domain GeneralizationImage Captioning | —Unverified | 0 |
| Uncovering Hidden Connections: Iterative Search and Reasoning for Video-grounded Dialog | Oct 11, 2023 | Question AnsweringResponse Generation | CodeCode Available | 0 |
| Solution for SMART-101 Challenge of ICCV Multi-modal Algorithmic Reasoning Task 2023 | Oct 10, 2023 | Decoderobject-detection | —Unverified | 0 |
| Negative Object Presence Evaluation (NOPE) to Measure Object Hallucination in Vision-Language Models | Oct 9, 2023 | HallucinationObject | —Unverified | 0 |
| Rephrase, Augment, Reason: Visual Grounding of Questions for Vision-Language Models | Oct 9, 2023 | Language ModellingQuestion Answering | CodeCode Available | 1 |
| Causal Reasoning through Two Layers of Cognition for Improving Generalization in Visual Question Answering | Oct 9, 2023 | Answer GenerationQuestion Answering | —Unverified | 0 |
| Lightweight In-Context Tuning for Multimodal Unified Models | Oct 8, 2023 | Image CaptioningIn-Context Learning | —Unverified | 0 |
| Improved Baselines with Visual Instruction Tuning | Oct 5, 2023 | Factual Inconsistency Detection in Chart CaptioningImage Classification | CodeCode Available | 6 |
| On the Cognition of Visual Question Answering Models and Human Intelligence: A Comparative Study | Oct 4, 2023 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Improving Automatic VQA Evaluation Using Large Language Models | Oct 4, 2023 | In-Context LearningQuestion Answering | —Unverified | 0 |
| MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts | Oct 3, 2023 | ChatbotImage Captioning | CodeCode Available | 2 |
| SelfGraphVQA: A Self-Supervised Graph Neural Network for Scene-based Question Answering | Oct 3, 2023 | Graph Neural NetworkQuestion Answering | —Unverified | 0 |
| Human Mobility Question Answering (Vision Paper) | Oct 2, 2023 | ManagementQuestion Answering | —Unverified | 0 |
| Fine-grained Late-interaction Multi-modal Retrieval for Retrieval Augmented Visual Question Answering | Sep 29, 2023 | Image to textPassage Retrieval | CodeCode Available | 2 |
| Toloka Visual Question Answering Benchmark | Sep 28, 2023 | Question AnsweringVisual Question Answering | CodeCode Available | 1 |