| LIVE: Learnable In-Context Vector for Visual Question Answering | Jun 19, 2024 | In-Context LearningQuestion Answering | CodeCode Available | 1 |
| MMNeuron: Discovering Neuron-Level Domain-Specific Interpretation in Multimodal Large Language Model | Jun 17, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| MFC-Bench: Benchmarking Multimodal Fact-Checking with Large Vision-Language Models | Jun 17, 2024 | BenchmarkingFact Checking | CodeCode Available | 1 |
| VANE-Bench: Video Anomaly Evaluation Benchmark for Conversational LMMs | Jun 14, 2024 | Anomaly DetectionBenchmarking | CodeCode Available | 1 |
| Vision-Language Models Meet Meteorology: Developing Models for Extreme Weather Events Detection with Heatmaps | Jun 14, 2024 | Question AnsweringVisual Question Answering | CodeCode Available | 1 |
| Advancing High Resolution Vision-Language Models in Biomedicine | Jun 12, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| VCR: A Task for Pixel-Level Complex Reasoning in Vision Language Models via Restoring Occluded Text | Jun 10, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| Re-ReST: Reflection-Reinforced Self-Training for Language Agents | Jun 3, 2024 | Code GenerationImage Generation | CodeCode Available | 1 |
| Instruction-Guided Visual Masking | May 30, 2024 | Instruction FollowingVisual Grounding | CodeCode Available | 1 |
| Worse than Random? An Embarrassingly Simple Probing Evaluation of Large Multimodal Models in Medical VQA | May 30, 2024 | DiagnosticMedical Diagnosis | CodeCode Available | 1 |
| Reverse Image Retrieval Cues Parametric Memory in Multimodal LLMs | May 29, 2024 | Image RetrievalQuestion Answering | CodeCode Available | 1 |
| PitVQA: Image-grounded Text Embedding LLM for Visual Question Answering in Pituitary Surgery | May 22, 2024 | Question AnsweringVisual Question Answering | CodeCode Available | 1 |
| UniRAG: Universal Retrieval Augmentation for Large Vision Language Models | May 16, 2024 | Image CaptioningImage Generation | CodeCode Available | 1 |
| Understanding Figurative Meaning through Explainable Visual Entailment | May 2, 2024 | Question AnsweringVisual Entailment | CodeCode Available | 1 |
| TableVQA-Bench: A Visual Question Answering Benchmark on Multiple Table Domains | Apr 30, 2024 | Language ModellingLarge Language Model | CodeCode Available | 1 |
| ViOCRVQA: Novel Benchmark Dataset and Vision Reader for Visual Question Answering by Understanding Vietnamese Text in Images | Apr 29, 2024 | Optical Character RecognitionOptical Character Recognition (OCR) | CodeCode Available | 1 |
| LaPA: Latent Prompt Assist Model For Medical Visual Question Answering | Apr 19, 2024 | Medical Visual Question AnsweringQuestion Answering | CodeCode Available | 1 |
| Look, Listen, and Answer: Overcoming Biases for Audio-Visual Question Answering | Apr 18, 2024 | Audio-visual Question AnsweringAudio-Visual Question Answering (AVQA) | CodeCode Available | 1 |
| Enhancing Visual Question Answering through Question-Driven Image Captions as Prompts | Apr 12, 2024 | Image CaptioningQuestion Answering | CodeCode Available | 1 |
| CausalChaos! Dataset for Comprehensive Causal Action Question Answering Over Longer Causal Chains Grounded in Dynamic Visual Scenes | Apr 1, 2024 | Causal DiscoveryCausal Discovery in Video Reasoning | CodeCode Available | 1 |
| JDocQA: Japanese Document Question Answering Dataset for Generative Language Models | Mar 28, 2024 | HallucinationQuestion Answering | CodeCode Available | 1 |
| Beyond Embeddings: The Promise of Visual Table in Visual Reasoning | Mar 27, 2024 | Representation LearningVisual Question Answering | CodeCode Available | 1 |
| Quantifying and Mitigating Unimodal Biases in Multimodal Large Language Models: A Causal Perspective | Mar 27, 2024 | Question AnsweringVisual Question Answering | CodeCode Available | 1 |
| IllusionVQA: A Challenging Optical Illusion Dataset for Vision Language Models | Mar 23, 2024 | Common Sense ReasoningIn-Context Learning | CodeCode Available | 1 |
| Multi-Agent VQA: Exploring Multi-Agent Foundation Models in Zero-Shot Visual Question Answering | Mar 21, 2024 | object-detectionObject Detection | CodeCode Available | 1 |
| Language Repository for Long Video Understanding | Mar 21, 2024 | EgoSchemaQuestion Answering | CodeCode Available | 1 |
| HyperLLaVA: Dynamic Visual and Language Expert Tuning for Multimodal Large Language Models | Mar 20, 2024 | MMEVisual Question Answering | CodeCode Available | 1 |
| SQ-LLaVA: Self-Questioning for Large Vision-Language Assistant | Mar 17, 2024 | Language ModellingQuestion Answering | CodeCode Available | 1 |
| Can We Talk Models Into Seeing the World Differently? | Mar 14, 2024 | Image CaptioningImage Classification | CodeCode Available | 1 |
| Multi-modal Auto-regressive Modeling via Visual Words | Mar 12, 2024 | Visual Question AnsweringVisual Question Answering (VQA) | CodeCode Available | 1 |
| Bridging the Gap between 2D and 3D Visual Question Answering: A Fusion Approach for 3D VQA | Feb 24, 2024 | 3D Question Answering (3D-QA)Question Answering | CodeCode Available | 1 |
| Uncertainty-Aware Evaluation for Vision-Language Models | Feb 22, 2024 | Conformal PredictionLanguage Modeling | CodeCode Available | 1 |
| Visual Hallucinations of Multi-modal Large Language Models | Feb 22, 2024 | DiversityHallucination | CodeCode Available | 1 |
| Cognitive Visual-Language Mapper: Advancing Multimodal Comprehension with Enhanced Visual Knowledge Alignment | Feb 21, 2024 | Language ModellingQuestion Answering | CodeCode Available | 1 |
| Multi-modal Preference Alignment Remedies Degradation of Visual Instruction Tuning on Language Models | Feb 16, 2024 | DiversityInstruction Following | CodeCode Available | 1 |
| Open-ended VQA benchmarking of Vision-Language models by exploiting Classification datasets and their semantic hierarchy | Feb 11, 2024 | Language ModelingOpen Vocabulary Attribute Detection | CodeCode Available | 1 |
| Gemini Goes to Med School: Exploring the Capabilities of Multimodal Large Language Models on Medical Challenge Problems & Hallucinations | Feb 10, 2024 | DiagnosticHallucination | CodeCode Available | 1 |
| Text-Guided Image Clustering | Feb 5, 2024 | ClusteringImage Captioning | CodeCode Available | 1 |
| Q&A Prompts: Discovering Rich Visual Clues through Mining Question-Answer Prompts for VQA requiring Diverse World Knowledge | Jan 19, 2024 | Question AnsweringQuestion Generation | CodeCode Available | 1 |
| Veagle: Advancements in Multimodal Representation Learning | Jan 18, 2024 | Image CaptioningLanguage Modelling | CodeCode Available | 1 |
| Question-Answer Cross Language Image Matching for Weakly Supervised Semantic Segmentation | Jan 18, 2024 | Contrastive LearningPrompt Engineering | CodeCode Available | 1 |
| Cross-modal Retrieval for Knowledge-based Visual Question Answering | Jan 11, 2024 | Cross-Modal RetrievalQuestion Answering | CodeCode Available | 1 |
| MISS: A Generative Pretraining and Finetuning Approach for Med-VQA | Jan 10, 2024 | Medical Visual Question AnsweringMulti-Task Learning | CodeCode Available | 1 |
| CaMML: Context-Aware Multimodal Learner for Large Models | Jan 6, 2024 | Visual Question Answering | CodeCode Available | 1 |
| InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks | Dec 21, 2023 | Image RetrievalImage-to-Text Retrieval | CodeCode Available | 1 |
| EarthVQA: Towards Queryable Earth via Relational Reasoning-Based Remote Sensing Visual Question Answering | Dec 19, 2023 | ObjectObject Counting | CodeCode Available | 1 |
| Gemini: A Family of Highly Capable Multimodal Models | Dec 19, 2023 | 1 Image, 2*2 StitchingArithmetic Reasoning | CodeCode Available | 1 |
| HAAR: Text-Conditioned Generative Model of 3D Strand-based Human Hairstyles | Dec 18, 2023 | Question AnsweringVisual Question Answering | CodeCode Available | 1 |
| Privacy-Aware Document Visual Question Answering | Dec 15, 2023 | document understandingFederated Learning | CodeCode Available | 1 |
| WordScape: a Pipeline to extract multilingual, visually rich Documents with Layout Annotations from Web Crawl Data | Dec 15, 2023 | document understandingQuestion Answering | CodeCode Available | 1 |