| ViLA: Efficient Video-Language Alignment for Video Question Answering | Dec 13, 2023 | cross-modal alignmentLanguage Modeling | CodeCode Available | 1 |
| Image Content Generation with Causal Reasoning | Dec 12, 2023 | Image GenerationQuestion Answering | CodeCode Available | 0 |
| Hallucination Augmented Contrastive Learning for Multimodal Large Language Model | Dec 12, 2023 | Contrastive LearningHallucination | CodeCode Available | 1 |
| Remote Sensing Vision-Language Foundation Models without Annotations via Ground Remote Alignment | Dec 12, 2023 | image-classificationImage Classification | —Unverified | 0 |
| VILA: On Pre-training for Visual Language Models | Dec 12, 2023 | In-Context LearningLanguage Modelling | CodeCode Available | 4 |
| Genixer: Empowering Multimodal Large Language Models as a Powerful Data Generator | Dec 11, 2023 | Image CaptioningQuestion Answering | CodeCode Available | 1 |
| NuScenes-MQA: Integrated Evaluation of Captions and QA for Autonomous Driving Datasets using Markup Annotations | Dec 11, 2023 | Autonomous DrivingDescriptive | CodeCode Available | 1 |
| Vary: Scaling up the Vision Vocabulary for Large Vision-Language Models | Dec 11, 2023 | Chart UnderstandingDecoder | CodeCode Available | 3 |
| Causal-CoG: A Causal-Effect Look at Context Generation for Boosting Multi-modal Language Models | Dec 9, 2023 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Language-Informed Visual Concept Learning | Dec 6, 2023 | DisentanglementNovel Concepts | CodeCode Available | 1 |
| OneLLM: One Framework to Align All Modalities with Language | Dec 6, 2023 | AllQuestion Answering | CodeCode Available | 2 |
| BenchLMM: Benchmarking Cross-style Visual Capability of Large Multimodal Models | Dec 5, 2023 | BenchmarkingVisual Question Answering | CodeCode Available | 1 |
| Recursive Visual Programming | Dec 4, 2023 | Code GenerationQuestion Answering | CodeCode Available | 1 |
| Good Questions Help Zero-Shot Image Reasoning | Dec 4, 2023 | Fine-Grained Image ClassificationQuestion Answering | CodeCode Available | 1 |
| CLAMP: Contrastive LAnguage Model Prompt-tuning | Dec 4, 2023 | Contrastive LearningImage Captioning | —Unverified | 0 |
| MedXChat: A Unified Multimodal Large Language Model Framework towards CXRs Understanding and Generation | Dec 4, 2023 | Instruction FollowingLanguage Modeling | —Unverified | 0 |
| How to Configure Good In-Context Sequence for Visual Question Answering | Dec 4, 2023 | In-Context LearningQuestion Answering | CodeCode Available | 1 |
| Unleashing the Potential of Large Language Model: Zero-shot VQA for Flood Disaster Scenario | Dec 4, 2023 | Language ModelingLanguage Modelling | —Unverified | 0 |
| RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback | Dec 1, 2023 | HallucinationImage Captioning | CodeCode Available | 6 |
| Merlin:Empowering Multimodal LLMs with Foresight Minds | Nov 30, 2023 | Visual Question Answering | —Unverified | 0 |
| DreamSync: Aligning Text-to-Image Generation with Image Understanding Feedback | Nov 29, 2023 | Image GenerationQuestion Answering | —Unverified | 0 |
| Towards Top-Down Reasoning: An Explainable Multi-Agent Approach for Visual Question Answering | Nov 29, 2023 | Common Sense ReasoningQuestion Answering | —Unverified | 0 |
| Emergent Open-Vocabulary Semantic Segmentation from Off-the-shelf Vision-Language Models | Nov 28, 2023 | Image CaptioningImage-text matching | CodeCode Available | 1 |
| The curse of language biases in remote sensing VQA: the role of spatial attributes, language diversity, and the need for clear evaluation | Nov 28, 2023 | DiversityQuestion Answering | —Unverified | 0 |
| LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models | Nov 28, 2023 | Image CaptioningQuestion Answering | CodeCode Available | 2 |