| MagicQuill: An Intelligent Interactive Image Editing System | Nov 14, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 7 |
| VITA: Towards Open-Source Interactive Omni Multimodal LLM | Aug 9, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 7 |
| Hunyuan-DiT: A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding | May 14, 2024 | Image GenerationLanguage Modeling | CodeCode Available | 7 |
| Ovis: Structural Embedding Alignment for Multimodal Large Language Model | May 31, 2024 | Language ModelingMultimodal Large Language Model | CodeCode Available | 5 |
| R1-Omni: Explainable Omni-Multimodal Emotion Recognition with Reinforcement Learning | Mar 7, 2025 | Emotion RecognitionLanguage Modeling | CodeCode Available | 5 |
| VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks | Jun 12, 2024 | Image GenerationLanguage Modeling | CodeCode Available | 5 |
| ThinkSound: Chain-of-Thought Reasoning in Multimodal Large Language Models for Audio Generation and Editing | Jun 26, 2025 | Audio GenerationLarge Language Model | CodeCode Available | 5 |
| Ferret: Refer and Ground Anything Anywhere at Any Granularity | Oct 11, 2023 | HallucinationLanguage Modeling | CodeCode Available | 5 |
| StarVector: Generating Scalable Vector Graphics Code from Images and Text | Dec 17, 2023 | Code GenerationLanguage Modeling | CodeCode Available | 5 |
| SEED-Story: Multimodal Long Story Generation with Large Language Model | Jul 11, 2024 | Image GenerationLanguage Modeling | CodeCode Available | 4 |
| Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models | Apr 19, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 4 |
| MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with Interleaved Visual-Textual Tokens | Apr 4, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 4 |
| SEED-Data-Edit Technical Report: A Hybrid Dataset for Instructional Image Editing | May 7, 2024 | Image ManipulationLanguage Modeling | CodeCode Available | 4 |
| Liquid: Language Models are Scalable Multi-modal Generators | Dec 5, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 4 |
| R1-Onevision:An Open-Source Multimodal Large Language Model Capable of Deep Reasoning | Feb 24, 2025 | Language ModelingLanguage Modelling | CodeCode Available | 4 |
| TinyGPT-V: Efficient Multimodal Large Language Model via Small Backbones | Dec 28, 2023 | Computational EfficiencyImage Captioning | CodeCode Available | 3 |
| ShapeLLM: Universal 3D Object Understanding for Embodied Interaction | Feb 27, 2024 | 3D geometry3D Object Captioning | CodeCode Available | 3 |
| Deep Learning and LLM-based Methods Applied to Stellar Lightcurve Classification | Apr 16, 2024 | Feature EngineeringLanguage Modeling | CodeCode Available | 3 |
| ShareGPT-4o-Image: Aligning Multimodal Models with GPT-4o-Level Image Generation | Jun 22, 2025 | GPUImage Generation | CodeCode Available | 3 |
| Valley2: Exploring Multimodal Models with Scalable Vision-Language Design | Jan 10, 2025 | Image CaptioningLanguage Modeling | CodeCode Available | 3 |
| Baichuan-Omni Technical Report | Oct 11, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 3 |
| Remote Sensing Temporal Vision-Language Models: A Comprehensive Survey | Dec 3, 2024 | Change DetectionDescriptive | CodeCode Available | 3 |
| AsymLoRA: Harmonizing Data Conflicts and Commonalities in MLLMs | Feb 27, 2025 | Language ModelingLanguage Modelling | CodeCode Available | 3 |
| MoMA: Multimodal LLM Adapter for Fast Personalized Image Generation | Apr 8, 2024 | Image GenerationImage-to-Image Translation | CodeCode Available | 3 |
| GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing | Mar 13, 2025 | Image GenerationLanguage Modeling | CodeCode Available | 3 |