| Privacy-Aware Document Visual Question Answering | Dec 15, 2023 | document understandingFederated Learning | CodeCode Available | 1 |
| VL-GPT: A Generative Pre-trained Transformer for Vision and Language Understanding and Generation | Dec 14, 2023 | Image CaptioningImage Generation | CodeCode Available | 1 |
| ViLA: Efficient Video-Language Alignment for Video Question Answering | Dec 13, 2023 | cross-modal alignmentLanguage Modeling | CodeCode Available | 1 |
| Hallucination Augmented Contrastive Learning for Multimodal Large Language Model | Dec 12, 2023 | Contrastive LearningHallucination | CodeCode Available | 1 |
| Genixer: Empowering Multimodal Large Language Models as a Powerful Data Generator | Dec 11, 2023 | Image CaptioningQuestion Answering | CodeCode Available | 1 |
| NuScenes-MQA: Integrated Evaluation of Captions and QA for Autonomous Driving Datasets using Markup Annotations | Dec 11, 2023 | Autonomous DrivingDescriptive | CodeCode Available | 1 |
| Language-Informed Visual Concept Learning | Dec 6, 2023 | DisentanglementNovel Concepts | CodeCode Available | 1 |
| BenchLMM: Benchmarking Cross-style Visual Capability of Large Multimodal Models | Dec 5, 2023 | BenchmarkingVisual Question Answering | CodeCode Available | 1 |
| Good Questions Help Zero-Shot Image Reasoning | Dec 4, 2023 | Fine-Grained Image ClassificationQuestion Answering | CodeCode Available | 1 |
| How to Configure Good In-Context Sequence for Visual Question Answering | Dec 4, 2023 | In-Context LearningQuestion Answering | CodeCode Available | 1 |
| Recursive Visual Programming | Dec 4, 2023 | Code GenerationQuestion Answering | CodeCode Available | 1 |
| Emergent Open-Vocabulary Semantic Segmentation from Off-the-shelf Vision-Language Models | Nov 28, 2023 | Image CaptioningImage-text matching | CodeCode Available | 1 |
| EgoThink: Evaluating First-Person Perspective Thinking Capability of Vision-Language Models | Nov 27, 2023 | AttributeQuestion Answering | CodeCode Available | 1 |
| A Comprehensive Evaluation of GPT-4V on Knowledge-Intensive Visual Question Answering | Nov 13, 2023 | Decision MakingExplanation Generation | CodeCode Available | 1 |
| Volcano: Mitigating Multimodal Hallucination through Self-Feedback Guided Revision | Nov 13, 2023 | HallucinationMM-Vet | CodeCode Available | 1 |
| InfMLLM: A Unified Framework for Visual-Language Tasks | Nov 12, 2023 | GPUImage Captioning | CodeCode Available | 1 |
| GENOME: GenerativE Neuro-symbOlic visual reasoning by growing and reusing ModulEs | Nov 8, 2023 | Question AnsweringReferring Expression | CodeCode Available | 1 |
| GPT-4V-AD: Exploring Grounding Potential of VQA-oriented GPT-4V for Zero-shot Anomaly Detection | Nov 5, 2023 | Anomaly DetectionQuestion Answering | CodeCode Available | 1 |
| Language Guided Visual Question Answering: Elevate Your Multimodal Language Model Using Knowledge-Enriched Prompts | Oct 31, 2023 | Image CaptioningLanguage Modeling | CodeCode Available | 1 |
| Making Large Language Models Better Data Creators | Oct 31, 2023 | Instruction FollowingPrompt Engineering | CodeCode Available | 1 |
| Multimodal ChatGPT for Medical Applications: an Experimental Study of GPT-4V | Oct 29, 2023 | DiagnosticLanguage Modeling | CodeCode Available | 1 |
| EHRXQA: A Multi-Modal Question Answering Dataset for Electronic Health Records with Chest X-ray Images | Oct 28, 2023 | Decision MakingMedical Visual Question Answering | CodeCode Available | 1 |
| 3D-Aware Visual Question Answering about Parts, Poses and Occlusions | Oct 27, 2023 | Question AnsweringVisual Question Answering | CodeCode Available | 1 |
| AntifakePrompt: Prompt-Tuned Vision-Language Models are Fake Image Detectors | Oct 26, 2023 | DeepFake DetectionFace Swapping | CodeCode Available | 1 |
| Towards Perceiving Small Visual Details in Zero-shot Visual Question Answering with Multimodal LLMs | Oct 24, 2023 | Question AnsweringVisual Question Answering | CodeCode Available | 1 |
| Rephrase, Augment, Reason: Visual Grounding of Questions for Vision-Language Models | Oct 9, 2023 | Language ModellingQuestion Answering | CodeCode Available | 1 |
| VDC: Versatile Data Cleanser based on Visual-Linguistic Inconsistency by Multimodal Large Language Models | Sep 28, 2023 | Backdoor Attackcross-modal alignment | CodeCode Available | 1 |
| Toloka Visual Question Answering Benchmark | Sep 28, 2023 | Question AnsweringVisual Question Answering | CodeCode Available | 1 |
| TextBind: Multi-turn Interleaved Multimodal Instruction-following in the Wild | Sep 14, 2023 | DecoderInstruction Following | CodeCode Available | 1 |
| A Survey on Interpretable Cross-modal Reasoning | Sep 5, 2023 | Cross-Modal RetrievalDecision Making | CodeCode Available | 1 |
| UniPT: Universal Parallel Tuning for Transfer Learning with Efficient Parameter and Memory | Aug 28, 2023 | Question AnsweringRetrieval | CodeCode Available | 1 |
| Towards Vision-Language Mechanistic Interpretability: A Causal Tracing Tool for BLIP | Aug 27, 2023 | Question AnsweringText Generation | CodeCode Available | 1 |
| InstructionGPT-4: A 200-Instruction Paradigm for Fine-Tuning MiniGPT-4 | Aug 23, 2023 | Instruction FollowingQuestion Answering | CodeCode Available | 1 |
| StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized Image-Dialogue Data | Aug 20, 2023 | Visual Question Answering | CodeCode Available | 1 |
| Uni-NLX: Unifying Textual Explanations for Vision and Vision-Language Tasks | Aug 17, 2023 | Question AnsweringText Generation | CodeCode Available | 1 |
| Pro-Cap: Leveraging a Frozen Vision-Language Model for Hateful Meme Detection | Aug 16, 2023 | Image CaptioningLanguage Modeling | CodeCode Available | 1 |
| Detecting and Preventing Hallucinations in Large Vision Language Models | Aug 11, 2023 | 16kHallucination | CodeCode Available | 1 |
| Foundation Model is Efficient Multimodal Multitask Model Selector | Aug 11, 2023 | modelModel Selection | CodeCode Available | 1 |
| Progressive Spatio-temporal Perception for Audio-Visual Question Answering | Aug 10, 2023 | Audio-visual Question AnsweringAudio-Visual Question Answering (AVQA) | CodeCode Available | 1 |
| SciGraphQA: A Large-Scale Synthetic Multi-Turn Question-Answering Dataset for Scientific Graphs | Aug 7, 2023 | Question AnsweringVisual Question Answering | CodeCode Available | 1 |
| Expert Knowledge-Aware Image Difference Graph Representation Learning for Difference-Aware Medical Visual Question Answering | Jul 22, 2023 | Graph Representation LearningLanguage Modeling | CodeCode Available | 1 |
| Explaining Autonomous Driving Actions with Visual Question Answering | Jul 19, 2023 | Autonomous DrivingAutonomous Vehicles | CodeCode Available | 1 |
| CAT-ViL: Co-Attention Gated Vision-Language Embedding for Visual Question Localized-Answering in Robotic Surgery | Jul 11, 2023 | Question AnsweringScene Understanding | CodeCode Available | 1 |
| Rad-ReStruct: A Novel VQA Benchmark and Method for Structured Radiology Reporting | Jul 11, 2023 | Medical Visual Question AnsweringQuestion Answering | CodeCode Available | 1 |
| Self-Adaptive Sampling for Efficient Video Question-Answering on Image--Text Models | Jul 9, 2023 | Question AnsweringTGIF-Frame | CodeCode Available | 1 |
| Localized Questions in Medical Visual Question Answering | Jul 3, 2023 | Medical Visual Question AnsweringQuestion Answering | CodeCode Available | 1 |
| Multimodal Prompt Retrieval for Generative Visual Question Answering | Jun 30, 2023 | Domain AdaptationGenerative Visual Question Answering | CodeCode Available | 1 |
| Answer Mining from a Pool of Images: Towards Retrieval-Based Visual Question Answering | Jun 29, 2023 | Answer GenerationQuestion Answering | CodeCode Available | 1 |
| Kosmos-2: Grounding Multimodal Large Language Models to the World | Jun 26, 2023 | Image CaptioningIn-Context Learning | CodeCode Available | 1 |
| Investigating Prompting Techniques for Zero- and Few-Shot Visual Question Answering | Jun 16, 2023 | Image CaptioningQuestion Answering | CodeCode Available | 1 |