| RS-Agent: Automating Remote Sensing Tasks through Intelligent Agent | Jun 11, 2024 | AI AgentDescriptive | CodeCode Available | 2 |
| From Redundancy to Relevance: Information Flow in LVLMs Across Reasoning Tasks | Jun 4, 2024 | Image CaptioningLanguage Modelling | CodeCode Available | 2 |
| Dragonfly: Multi-Resolution Zoom-In Encoding Enhances Vision-Language Models | Jun 3, 2024 | Image CaptioningLanguage Modelling | CodeCode Available | 2 |
| Enhancing Large Vision Language Models with Self-Training on Image Comprehension | May 30, 2024 | Image ComprehensionVisual Question Answering | CodeCode Available | 2 |
| ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models | May 24, 2024 | Visual Question Answering | CodeCode Available | 2 |
| Enhancing Visual-Language Modality Alignment in Large Vision Language Models via Self-Improvement | May 24, 2024 | HallucinationImage Comprehension | CodeCode Available | 2 |
| Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models | May 24, 2024 | Common Sense ReasoningLanguage Modelling | CodeCode Available | 2 |
| Calibrated Self-Rewarding Vision Language Models | May 23, 2024 | HallucinationLanguage Modelling | CodeCode Available | 2 |
| LOVA3: Learning to Visual Question Answering, Asking and Assessment | May 23, 2024 | Question AnsweringVisual Question Answering | CodeCode Available | 2 |
| Dynamic Mixture of Experts: An Auto-Tuning Approach for Efficient Transformer Models | May 23, 2024 | Mixture-of-ExpertsVisual Question Answering | CodeCode Available | 2 |
| Imp: Highly Capable Large Multimodal Models for Mobile Devices | May 20, 2024 | QuantizationVisual Question Answering | CodeCode Available | 2 |
| MTVQA: Benchmarking Multilingual Text-Centric Visual Question Answering | May 20, 2024 | BenchmarkingQuestion Answering | CodeCode Available | 2 |
| Xmodel-VLM: A Simple Baseline for Multimodal Vision Language Model | May 15, 2024 | GPULanguage Modeling | CodeCode Available | 2 |
| CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts | May 9, 2024 | Image CaptioningInstruction Following | CodeCode Available | 2 |
| List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs | Apr 25, 2024 | Visual GroundingVisual Question Answering | CodeCode Available | 2 |
| GSCo: Towards Generalizable AI in Medicine via Generalist-Specialist Collaboration | Apr 23, 2024 | Collaborative InferenceIn-Context Learning | CodeCode Available | 2 |
| Self-Supervised Visual Preference Alignment | Apr 16, 2024 | 8kMM-Vet | CodeCode Available | 2 |
| VHM: Versatile and Honest Vision Language Model for Remote Sensing Image Analysis | Mar 29, 2024 | HallucinationImage Captioning | CodeCode Available | 2 |
| Unsolvable Problem Detection: Evaluating Trustworthiness of Vision Language Models | Mar 29, 2024 | Question AnsweringVisual Question Answering | CodeCode Available | 2 |
| Multi-Frame, Lightweight & Efficient Vision-Language Models for Question Answering in Autonomous Driving | Mar 28, 2024 | Autonomous DrivingLanguage Modeling | CodeCode Available | 2 |
| LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models | Mar 22, 2024 | Language ModellingLarge Language Model | CodeCode Available | 2 |
| MedPromptX: Grounded Multimodal Prompting for Chest X-ray Diagnosis | Mar 22, 2024 | Medical DiagnosisMedical Visual Question Answering | CodeCode Available | 2 |
| Chain-of-Spot: Interactive Reasoning Improves Large Vision-Language Models | Mar 19, 2024 | Instruction Followingvisual instruction following | CodeCode Available | 2 |
| VL-ICL Bench: The Devil in the Details of Multimodal In-Context Learning | Mar 19, 2024 | BenchmarkingImage Captioning | CodeCode Available | 2 |
| Beyond Text: Frozen Large Language Models in Visual Signal Comprehension | Mar 12, 2024 | DeblurringDecoder | CodeCode Available | 2 |
| CAT: Enhancing Multimodal Large Language Model to Answer Questions in Dynamic Audio-Visual Scenarios | Mar 7, 2024 | Audio-visual Question AnsweringAudio-Visual Question Answering (AVQA) | CodeCode Available | 2 |
| Are Language Models Puzzle Prodigies? Algorithmic Puzzles Unveil Serious Challenges in Multimodal Reasoning | Mar 6, 2024 | Multimodal ReasoningQuestion Answering | CodeCode Available | 2 |
| Aligning Modalities in Vision Large Language Models via Preference Fine-tuning | Feb 18, 2024 | HallucinationInstruction Following | CodeCode Available | 2 |
| CoLLaVO: Crayon Large Language and Vision mOdel | Feb 17, 2024 | Large Language Modelmodel | CodeCode Available | 2 |
| GeReA: Question-Aware Prompt Captions for Knowledge-based Visual Question Answering | Feb 4, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 2 |
| MouSi: Poly-Visual-Expert Vision-Language Models | Jan 30, 2024 | Image SegmentationImage-text matching | CodeCode Available | 2 |
| PeFoMed: Parameter Efficient Fine-tuning of Multimodal Large Language Models for Medical Imaging | Jan 5, 2024 | Medical Report GenerationMedical Visual Question Answering | CodeCode Available | 2 |
| LingoQA: Visual Question Answering for Autonomous Driving | Dec 21, 2023 | Autonomous DrivingDecision Making | CodeCode Available | 2 |
| V*: Guided Visual Search as a Core Mechanism in Multimodal LLMs | Dec 21, 2023 | Visual Question AnsweringWorld Knowledge | CodeCode Available | 2 |
| VCoder: Versatile Vision Encoders for Multimodal Large Language Models | Dec 21, 2023 | Image CaptioningImage Generation | CodeCode Available | 2 |
| OneLLM: One Framework to Align All Modalities with Language | Dec 6, 2023 | AllQuestion Answering | CodeCode Available | 2 |
| LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models | Nov 28, 2023 | Image CaptioningQuestion Answering | CodeCode Available | 2 |
| LLMGA: Multimodal Large Language Model based Generation Assistant | Nov 27, 2023 | Image GenerationLanguage Modeling | CodeCode Available | 2 |
| GeoChat: Grounded Large Vision-Language Model for Remote Sensing | Nov 24, 2023 | Instruction FollowingLanguage Modeling | CodeCode Available | 2 |
| To See is to Believe: Prompting GPT-4V for Better Visual Instruction Tuning | Nov 13, 2023 | Instruction FollowingMM-Vet | CodeCode Available | 2 |
| LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents | Nov 9, 2023 | Instruction FollowingLLM real-life tasks | CodeCode Available | 2 |
| Frozen Transformers in Language Models Are Effective Visual Encoder Layers | Oct 19, 2023 | Action RecognitionImage-text Retrieval | CodeCode Available | 2 |
| From CLIP to DINO: Visual Encoders Shout in Multi-modal Large Language Models | Oct 13, 2023 | HallucinationImage Captioning | CodeCode Available | 2 |
| MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts | Oct 3, 2023 | ChatbotImage Captioning | CodeCode Available | 2 |
| Fine-grained Late-interaction Multi-modal Retrieval for Retrieval Augmented Visual Question Answering | Sep 29, 2023 | Image to textPassage Retrieval | CodeCode Available | 2 |
| DreamLLM: Synergistic Multimodal Comprehension and Creation | Sep 20, 2023 | multimodal generationVisual Question Answering | CodeCode Available | 2 |
| BLIVA: A Simple Multimodal LLM for Better Handling of Text-Rich Visual Questions | Aug 19, 2023 | MMEOptical Character Recognition (OCR) | CodeCode Available | 2 |
| TeCH: Text-guided Reconstruction of Lifelike Clothed Humans | Aug 16, 2023 | DescriptiveQuestion Answering | CodeCode Available | 2 |
| Towards Generalist Foundation Model for Radiology by Leveraging Web-scale 2D&3D Medical Data | Aug 4, 2023 | Question AnsweringVisual Question Answering | CodeCode Available | 2 |
| RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control | Jul 28, 2023 | ObjectQuestion Answering | CodeCode Available | 2 |