| ViLA: Efficient Video-Language Alignment for Video Question Answering | Dec 13, 2023 | cross-modal alignmentLanguage Modeling | CodeCode Available | 1 |
| Remote Sensing Vision-Language Foundation Models without Annotations via Ground Remote Alignment | Dec 12, 2023 | image-classificationImage Classification | —Unverified | 0 |
| Image Content Generation with Causal Reasoning | Dec 12, 2023 | Image GenerationQuestion Answering | CodeCode Available | 0 |
| VILA: On Pre-training for Visual Language Models | Dec 12, 2023 | In-Context LearningLanguage Modelling | CodeCode Available | 4 |
| Hallucination Augmented Contrastive Learning for Multimodal Large Language Model | Dec 12, 2023 | Contrastive LearningHallucination | CodeCode Available | 1 |
| Genixer: Empowering Multimodal Large Language Models as a Powerful Data Generator | Dec 11, 2023 | Image CaptioningQuestion Answering | CodeCode Available | 1 |
| Vary: Scaling up the Vision Vocabulary for Large Vision-Language Models | Dec 11, 2023 | Chart UnderstandingDecoder | CodeCode Available | 3 |
| NuScenes-MQA: Integrated Evaluation of Captions and QA for Autonomous Driving Datasets using Markup Annotations | Dec 11, 2023 | Autonomous DrivingDescriptive | CodeCode Available | 1 |
| Causal-CoG: A Causal-Effect Look at Context Generation for Boosting Multi-modal Language Models | Dec 9, 2023 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Language-Informed Visual Concept Learning | Dec 6, 2023 | DisentanglementNovel Concepts | CodeCode Available | 1 |
| OneLLM: One Framework to Align All Modalities with Language | Dec 6, 2023 | AllQuestion Answering | CodeCode Available | 2 |
| BenchLMM: Benchmarking Cross-style Visual Capability of Large Multimodal Models | Dec 5, 2023 | BenchmarkingVisual Question Answering | CodeCode Available | 1 |
| Recursive Visual Programming | Dec 4, 2023 | Code GenerationQuestion Answering | CodeCode Available | 1 |
| MedXChat: A Unified Multimodal Large Language Model Framework towards CXRs Understanding and Generation | Dec 4, 2023 | Instruction FollowingLanguage Modeling | —Unverified | 0 |
| Good Questions Help Zero-Shot Image Reasoning | Dec 4, 2023 | Fine-Grained Image ClassificationQuestion Answering | CodeCode Available | 1 |
| CLAMP: Contrastive LAnguage Model Prompt-tuning | Dec 4, 2023 | Contrastive LearningImage Captioning | —Unverified | 0 |
| Unleashing the Potential of Large Language Model: Zero-shot VQA for Flood Disaster Scenario | Dec 4, 2023 | Language ModelingLanguage Modelling | —Unverified | 0 |
| How to Configure Good In-Context Sequence for Visual Question Answering | Dec 4, 2023 | In-Context LearningQuestion Answering | CodeCode Available | 1 |
| RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback | Dec 1, 2023 | HallucinationImage Captioning | CodeCode Available | 6 |
| Merlin:Empowering Multimodal LLMs with Foresight Minds | Nov 30, 2023 | Visual Question Answering | —Unverified | 0 |
| DreamSync: Aligning Text-to-Image Generation with Image Understanding Feedback | Nov 29, 2023 | Image GenerationQuestion Answering | —Unverified | 0 |
| Towards Top-Down Reasoning: An Explainable Multi-Agent Approach for Visual Question Answering | Nov 29, 2023 | Common Sense ReasoningQuestion Answering | —Unverified | 0 |
| Emergent Open-Vocabulary Semantic Segmentation from Off-the-shelf Vision-Language Models | Nov 28, 2023 | Image CaptioningImage-text matching | CodeCode Available | 1 |
| LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models | Nov 28, 2023 | Image CaptioningQuestion Answering | CodeCode Available | 2 |
| The curse of language biases in remote sensing VQA: the role of spatial attributes, language diversity, and the need for clear evaluation | Nov 28, 2023 | DiversityQuestion Answering | —Unverified | 0 |
| LLMGA: Multimodal Large Language Model based Generation Assistant | Nov 27, 2023 | Image GenerationLanguage Modeling | CodeCode Available | 2 |
| Fully Authentic Visual Question Answering Dataset from Online Communities | Nov 27, 2023 | Question AnsweringVisual Question Answering | CodeCode Available | 0 |
| EgoThink: Evaluating First-Person Perspective Thinking Capability of Vision-Language Models | Nov 27, 2023 | AttributeQuestion Answering | CodeCode Available | 1 |
| GeoChat: Grounded Large Vision-Language Model for Remote Sensing | Nov 24, 2023 | Instruction FollowingLanguage Modeling | CodeCode Available | 2 |
| Lego: Learning to Disentangle and Invert Personalized Concepts Beyond Object Appearance in Text-to-Image Diffusion Models | Nov 23, 2023 | Language ModellingLarge Language Model | —Unverified | 0 |
| ShareGPT4V: Improving Large Multi-Modal Models with Better Captions | Nov 21, 2023 | DescriptiveMME | CodeCode Available | 0 |
| Filling the Image Information Gap for VQA: Prompting Large Language Models to Proactively Ask Questions | Nov 20, 2023 | Question AnsweringVisual Question Answering | CodeCode Available | 0 |
| Video-LLaVA: Learning United Visual Representation by Alignment Before Projection | Nov 16, 2023 | Language ModelingLanguage Modelling | CodeCode Available | 4 |
| Improving Zero-shot Visual Question Answering via Large Language Models with Reasoning Question Prompts | Nov 15, 2023 | Question AnsweringSentence | CodeCode Available | 0 |
| Attribute Diversity Determines the Systematicity Gap in VQA | Nov 15, 2023 | AttributeDiagnostic | CodeCode Available | 0 |
| Asking More Informative Questions for Grounded Retrieval | Nov 14, 2023 | Question AnsweringQuestion Selection | —Unverified | 0 |
| A Comprehensive Evaluation of GPT-4V on Knowledge-Intensive Visual Question Answering | Nov 13, 2023 | Decision MakingExplanation Generation | CodeCode Available | 1 |
| Volcano: Mitigating Multimodal Hallucination through Self-Feedback Guided Revision | Nov 13, 2023 | HallucinationMM-Vet | CodeCode Available | 1 |
| What Large Language Models Bring to Text-rich VQA? | Nov 13, 2023 | Image ComprehensionOptical Character Recognition (OCR) | —Unverified | 0 |
| SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models | Nov 13, 2023 | Described Object DetectionLanguage Modeling | CodeCode Available | 4 |
| To See is to Believe: Prompting GPT-4V for Better Visual Instruction Tuning | Nov 13, 2023 | Instruction FollowingMM-Vet | CodeCode Available | 2 |
| InfMLLM: A Unified Framework for Visual-Language Tasks | Nov 12, 2023 | GPUImage Captioning | CodeCode Available | 1 |
| Visual Commonsense based Heterogeneous Graph Contrastive Learning | Nov 11, 2023 | Contrastive LearningQuestion Answering | —Unverified | 0 |
| Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models | Nov 11, 2023 | Image CaptioningMMR total | CodeCode Available | 3 |
| LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents | Nov 9, 2023 | Instruction FollowingLLM real-life tasks | CodeCode Available | 2 |
| Zero-shot Translation of Attention Patterns in VQA Models to Natural Language | Nov 8, 2023 | Image CaptioningLanguage Modeling | CodeCode Available | 0 |
| GENOME: GenerativE Neuro-symbOlic visual reasoning by growing and reusing ModulEs | Nov 8, 2023 | Question AnsweringReferring Expression | CodeCode Available | 1 |
| mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration | Nov 7, 2023 | 1 Image, 2*2 StitchingDecoder | CodeCode Available | 4 |
| OtterHD: A High-Resolution Multi-modality Model | Nov 7, 2023 | modelVisual Question Answering | CodeCode Available | 4 |
| CoVLM: Composing Visual Entities and Relationships in Large Language Models Via Communicative Decoding | Nov 6, 2023 | CoLAQuestion Answering | —Unverified | 0 |