| MMMModal -- Multi-Images Multi-Audio Multi-turn Multi-Modal | Feb 17, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Agent Smith: A Single Image Can Jailbreak One Million Multimodal LLM Agents Exponentially Fast | Feb 13, 2024 | Language ModellingLarge Language Model | CodeCode Available | 2 |
| Visual Question Answering Instruction: Unlocking Multimodal Large Language Model To Domain-Specific Visual Multitasks | Feb 13, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Lumos : Empowering Multimodal LLMs with Scene Text Recognition | Feb 12, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| LLaVA-Docent: Instruction Tuning with Multimodal Large Language Model to Support Art Appreciation Education | Feb 9, 2024 | BenchmarkingChatbot | —Unverified | 0 |
| Jailbreaking Attack against Multimodal Large Language Model | Feb 4, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 2 |
| GeReA: Question-Aware Prompt Captions for Knowledge-based Visual Question Answering | Feb 4, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 2 |
| LLaVA-MoLE: Sparse Mixture of LoRA Experts for Mitigating Data Conflicts in Instruction Finetuning MLLMs | Jan 29, 2024 | Language ModellingLarge Language Model | —Unverified | 0 |
| UNIMO-G: Unified Image Generation through Multimodal Conditional Diffusion | Jan 24, 2024 | Conditional Image GenerationDenoising | —Unverified | 0 |
| MLLMReID: Multimodal Large Language Model-based Person Re-identification | Jan 24, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Mementos: A Comprehensive Benchmark for Multimodal Large Language Model Reasoning over Image Sequences | Jan 19, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| MLLM-Tool: A Multimodal Large Language Model For Tool Agent Learning | Jan 19, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 2 |
| CoDi-2: In-Context Interleaved and Interactive Any-to-Any Generation | Jan 1, 2024 | Image GenerationLanguage Modeling | —Unverified | 0 |
| LION: Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge | Jan 1, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 2 |
| AllSpark: A Multimodal Spatio-Temporal General Intelligence Model with Ten Modalities via Language as a Reference Framework | Dec 31, 2023 | Large Language ModelMultimodal Large Language Model | CodeCode Available | 1 |
| TinyGPT-V: Efficient Multimodal Large Language Model via Small Backbones | Dec 28, 2023 | Computational EfficiencyImage Captioning | CodeCode Available | 3 |
| ManipLLM: Embodied Multimodal Large Language Model for Object-Centric Robotic Manipulation | Dec 24, 2023 | Common Sense ReasoningLanguage Modeling | —Unverified | 0 |
| StarVector: Generating Scalable Vector Graphics Code from Images and Text | Dec 17, 2023 | Code GenerationLanguage Modeling | CodeCode Available | 5 |
| Hallucination Augmented Contrastive Learning for Multimodal Large Language Model | Dec 12, 2023 | Contrastive LearningHallucination | CodeCode Available | 1 |
| Audio-Visual LLM for Video Understanding | Dec 11, 2023 | AudioCapsLanguage Modeling | —Unverified | 0 |
| EtC: Temporal Boundary Expand then Clarify for Weakly Supervised Video Grounding with Multimodal Large Language Model | Dec 5, 2023 | Boundary DetectionLanguage Modeling | —Unverified | 0 |
| MedXChat: A Unified Multimodal Large Language Model Framework towards CXRs Understanding and Generation | Dec 4, 2023 | Instruction FollowingLanguage Modeling | —Unverified | 0 |
| TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding | Dec 4, 2023 | Dense CaptioningHighlight Detection | CodeCode Available | 2 |
| mPLUG-PaperOwl: Scientific Diagram Analysis with the Multimodal Large Language Model | Nov 30, 2023 | Language ModelingLanguage Modelling | CodeCode Available | 0 |
| CoDi-2: In-Context, Interleaved, and Interactive Any-to-Any Generation | Nov 30, 2023 | Image GenerationIn-Context Learning | —Unverified | 0 |
| LLMGA: Multimodal Large Language Model based Generation Assistant | Nov 27, 2023 | Image GenerationLanguage Modeling | CodeCode Available | 2 |
| GPT4Video: A Unified Multimodal Large Language Model for lnstruction-Followed Understanding and Safety-Aware Generation | Nov 25, 2023 | Instruction FollowingLanguage Modeling | —Unverified | 0 |
| LION : Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge | Nov 20, 2023 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| How to Bridge the Gap between Modalities: Survey on Multimodal Large Language Model | Nov 10, 2023 | Image CaptioningLanguage Modeling | —Unverified | 0 |
| Chain of Images for Intuitively Reasoning | Nov 9, 2023 | Common Sense ReasoningLanguage Modelling | CodeCode Available | 1 |
| Multimodal ChatGPT for Medical Applications: an Experimental Study of GPT-4V | Oct 29, 2023 | DiagnosticLanguage Modeling | CodeCode Available | 1 |
| CXR-LLAVA: a multimodal large language model for interpreting chest X-ray images | Oct 22, 2023 | DiagnosticLanguage Modeling | CodeCode Available | 1 |
| Multimodal Large Language Model for Visual Navigation | Oct 12, 2023 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Ferret: Refer and Ground Anything Anywhere at Any Granularity | Oct 11, 2023 | HallucinationLanguage Modeling | CodeCode Available | 5 |
| UReader: Universal OCR-free Visually-situated Language Understanding with Multimodal Large Language Model | Oct 8, 2023 | DecoderLanguage Modeling | CodeCode Available | 1 |
| Comics for Everyone: Generating Accessible Text Descriptions for Comic Strips | Oct 1, 2023 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Investigating the Catastrophic Forgetting in Multimodal Large Language Models | Sep 19, 2023 | image-classificationImage Classification | —Unverified | 0 |
| Imaginations of WALL-E : Reconstructing Experiences with an Imagination-Inspired Module for Advanced AI Systems | Aug 20, 2023 | Emotion RecognitionLanguage Modelling | —Unverified | 0 |
| FinVis-GPT: A Multimodal Large Language Model for Financial Chart Analysis | Jul 31, 2023 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| ChatSpot: Bootstrapping Multimodal LLMs via Precise Referring Instruction Tuning | Jul 18, 2023 | Instruction FollowingLanguage Modeling | —Unverified | 0 |
| mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding | Jul 4, 2023 | document understandingLanguage Modeling | CodeCode Available | 0 |
| Kosmos-2: Grounding Multimodal Large Language Models to the World | Jun 26, 2023 | Image CaptioningIn-Context Learning | CodeCode Available | 1 |
| MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models | Jun 23, 2023 | BenchmarkingLanguage Modeling | CodeCode Available | 2 |
| A Survey on Multimodal Large Language Models | Jun 23, 2023 | HallucinationIn-Context Learning | CodeCode Available | 0 |
| Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks | Jun 7, 2023 | Cross-Modal RetrievalLanguage Modelling | CodeCode Available | 2 |
| LMEye: An Interactive Perception Network for Large Language Models | May 5, 2023 | Language ModellingLarge Language Model | CodeCode Available | 1 |
| Language Is Not All You Need: Aligning Perception with Language Models | Feb 27, 2023 | AllImage Captioning | CodeCode Available | 0 |