| VITA: Towards Open-Source Interactive Omni Multimodal LLM | Aug 9, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 7 |
| Hunyuan-DiT: A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding | May 14, 2024 | Image GenerationLanguage Modeling | CodeCode Available | 7 |
| MagicQuill: An Intelligent Interactive Image Editing System | Nov 14, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 7 |
| StarVector: Generating Scalable Vector Graphics Code from Images and Text | Dec 17, 2023 | Code GenerationLanguage Modeling | CodeCode Available | 5 |
| R1-Omni: Explainable Omni-Multimodal Emotion Recognition with Reinforcement Learning | Mar 7, 2025 | Emotion RecognitionLanguage Modeling | CodeCode Available | 5 |
| Ovis: Structural Embedding Alignment for Multimodal Large Language Model | May 31, 2024 | Language ModelingMultimodal Large Language Model | CodeCode Available | 5 |
| Ferret: Refer and Ground Anything Anywhere at Any Granularity | Oct 11, 2023 | HallucinationLanguage Modeling | CodeCode Available | 5 |
| VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks | Jun 12, 2024 | Image GenerationLanguage Modeling | CodeCode Available | 5 |
| ThinkSound: Chain-of-Thought Reasoning in Multimodal Large Language Models for Audio Generation and Editing | Jun 26, 2025 | Audio GenerationLarge Language Model | CodeCode Available | 5 |
| SEED-Data-Edit Technical Report: A Hybrid Dataset for Instructional Image Editing | May 7, 2024 | Image ManipulationLanguage Modeling | CodeCode Available | 4 |
| R1-Onevision:An Open-Source Multimodal Large Language Model Capable of Deep Reasoning | Feb 24, 2025 | Language ModelingLanguage Modelling | CodeCode Available | 4 |
| MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with Interleaved Visual-Textual Tokens | Apr 4, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 4 |
| Liquid: Language Models are Scalable Multi-modal Generators | Dec 5, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 4 |
| SEED-Story: Multimodal Long Story Generation with Large Language Model | Jul 11, 2024 | Image GenerationLanguage Modeling | CodeCode Available | 4 |
| Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models | Apr 19, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 4 |
| Baichuan-Omni Technical Report | Oct 11, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 3 |
| Multimodal Table Understanding | Jun 12, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 3 |
| GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing | Mar 13, 2025 | Image GenerationLanguage Modeling | CodeCode Available | 3 |
| AsymLoRA: Harmonizing Data Conflicts and Commonalities in MLLMs | Feb 27, 2025 | Language ModelingLanguage Modelling | CodeCode Available | 3 |
| Remote Sensing Temporal Vision-Language Models: A Comprehensive Survey | Dec 3, 2024 | Change DetectionDescriptive | CodeCode Available | 3 |
| VARGPT: Unified Understanding and Generation in a Visual Autoregressive Multimodal Large Language Model | Jan 21, 2025 | Image GenerationInstruction Following | CodeCode Available | 3 |
| MoMA: Multimodal LLM Adapter for Fast Personalized Image Generation | Apr 8, 2024 | Image GenerationImage-to-Image Translation | CodeCode Available | 3 |
| TinyGPT-V: Efficient Multimodal Large Language Model via Small Backbones | Dec 28, 2023 | Computational EfficiencyImage Captioning | CodeCode Available | 3 |
| Deep Learning and LLM-based Methods Applied to Stellar Lightcurve Classification | Apr 16, 2024 | Feature EngineeringLanguage Modeling | CodeCode Available | 3 |
| ShareGPT-4o-Image: Aligning Multimodal Models with GPT-4o-Level Image Generation | Jun 22, 2025 | GPUImage Generation | CodeCode Available | 3 |
| Valley2: Exploring Multimodal Models with Scalable Vision-Language Design | Jan 10, 2025 | Image CaptioningLanguage Modeling | CodeCode Available | 3 |
| ShapeLLM: Universal 3D Object Understanding for Embodied Interaction | Feb 27, 2024 | 3D geometry3D Object Captioning | CodeCode Available | 3 |
| LaVy: Vietnamese Multimodal Large Language Model | Apr 11, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 2 |
| Jailbreaking Attack against Multimodal Large Language Model | Feb 4, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 2 |
| Keeping Yourself is Important in Downstream Tuning Multimodal Large Language Model | Mar 6, 2025 | General KnowledgeImage Captioning | CodeCode Available | 2 |
| LHRS-Bot-Nova: Improved Multimodal Large Language Model for Remote Sensing Vision-Language Interpretation | Nov 14, 2024 | Earth ObservationInstruction Following | CodeCode Available | 2 |
| ChemVLM: Exploring the Power of Multimodal Large Language Models in Chemistry Area | Aug 14, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 2 |
| Referring to Any Person | Mar 11, 2025 | Large Language ModelMultimodal Large Language Model | CodeCode Available | 2 |
| Holmes-VAD: Towards Unbiased and Explainable Video Anomaly Detection via Multi-modal LLM | Jun 18, 2024 | Anomaly DetectionAnomaly Localization | CodeCode Available | 2 |
| Parameter-Inverted Image Pyramid Networks for Visual Perception and Multimodal Understanding | Jan 14, 2025 | image-classificationImage Classification | CodeCode Available | 2 |
| Introducing Visual Perception Token into Multimodal Large Language Model | Feb 24, 2025 | Language ModelingLanguage Modelling | CodeCode Available | 2 |
| PDF-WuKong: A Large Multimodal Model for Efficient Long PDF Reading with End-to-End Sparse Sampling | Oct 8, 2024 | document understandingLanguage Modeling | CodeCode Available | 2 |
| Protecting Privacy in Multimodal Large Language Models with MLLMU-Bench | Oct 29, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 2 |
| One Token to Seg Them All: Language Instructed Reasoning Segmentation in Videos | Sep 29, 2024 | AllImage Segmentation | CodeCode Available | 2 |
| Next Token Is Enough: Realistic Image Quality and Aesthetic Scoring with Multimodal Large Language Model | Mar 8, 2025 | Image Quality AssessmentLanguage Modeling | CodeCode Available | 2 |
| OpenAD: Open-World Autonomous Driving Benchmark for 3D Object Detection | Nov 26, 2024 | 3D Object DetectionAutonomous Driving | CodeCode Available | 2 |
| MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models | Jun 23, 2023 | BenchmarkingLanguage Modeling | CodeCode Available | 2 |
| Explore the Limits of Omni-modal Pretraining at Scale | Jun 13, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 2 |
| Agent Smith: A Single Image Can Jailbreak One Million Multimodal LLM Agents Exponentially Fast | Feb 13, 2024 | Language ModellingLarge Language Model | CodeCode Available | 2 |
| CLIP-MoE: Towards Building Mixture of Experts for CLIP with Diversified Multiplet Upcycling | Sep 28, 2024 | image-classificationImage Classification | CodeCode Available | 2 |
| GeReA: Question-Aware Prompt Captions for Knowledge-based Visual Question Answering | Feb 4, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 2 |
| Dimple: Discrete Diffusion Multimodal Large Language Model with Parallel Decoding | May 22, 2025 | Language ModelingLanguage Modelling | CodeCode Available | 2 |
| MedTsLLM: Leveraging LLMs for Multimodal Medical Time Series Analysis | Aug 14, 2024 | Anomaly DetectionBoundary Detection | CodeCode Available | 2 |
| ChartCoder: Advancing Multimodal Large Language Model for Chart-to-Code Generation | Jan 11, 2025 | Chart UnderstandingCode Generation | CodeCode Available | 2 |
| Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want | Mar 29, 2024 | Instruction FollowingLanguage Modelling | CodeCode Available | 2 |