Multimodal Large Language Model

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 1–50 of 347 papers

Title	Date	Tasks	Status	Hype
VITA: Towards Open-Source Interactive Omni Multimodal LLM	Aug 9, 2024	Language ModelingLanguage Modelling	CodeCode Available	7
Hunyuan-DiT: A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding	May 14, 2024	Image GenerationLanguage Modeling	CodeCode Available	7
MagicQuill: An Intelligent Interactive Image Editing System	Nov 14, 2024	Language ModelingLanguage Modelling	CodeCode Available	7
StarVector: Generating Scalable Vector Graphics Code from Images and Text	Dec 17, 2023	Code GenerationLanguage Modeling	CodeCode Available	5
R1-Omni: Explainable Omni-Multimodal Emotion Recognition with Reinforcement Learning	Mar 7, 2025	Emotion RecognitionLanguage Modeling	CodeCode Available	5
Ovis: Structural Embedding Alignment for Multimodal Large Language Model	May 31, 2024	Language ModelingMultimodal Large Language Model	CodeCode Available	5
Ferret: Refer and Ground Anything Anywhere at Any Granularity	Oct 11, 2023	HallucinationLanguage Modeling	CodeCode Available	5
VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks	Jun 12, 2024	Image GenerationLanguage Modeling	CodeCode Available	5
ThinkSound: Chain-of-Thought Reasoning in Multimodal Large Language Models for Audio Generation and Editing	Jun 26, 2025	Audio GenerationLarge Language Model	CodeCode Available	5
SEED-Data-Edit Technical Report: A Hybrid Dataset for Instructional Image Editing	May 7, 2024	Image ManipulationLanguage Modeling	CodeCode Available	4
R1-Onevision：An Open-Source Multimodal Large Language Model Capable of Deep Reasoning	Feb 24, 2025	Language ModelingLanguage Modelling	CodeCode Available	4
MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with Interleaved Visual-Textual Tokens	Apr 4, 2024	Language ModelingLanguage Modelling	CodeCode Available	4
Liquid: Language Models are Scalable Multi-modal Generators	Dec 5, 2024	Language ModelingLanguage Modelling	CodeCode Available	4
SEED-Story: Multimodal Long Story Generation with Large Language Model	Jul 11, 2024	Image GenerationLanguage Modeling	CodeCode Available	4
Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models	Apr 19, 2024	Language ModelingLanguage Modelling	CodeCode Available	4
Baichuan-Omni Technical Report	Oct 11, 2024	Language ModelingLanguage Modelling	CodeCode Available	3
Multimodal Table Understanding	Jun 12, 2024	Language ModelingLanguage Modelling	CodeCode Available	3
GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing	Mar 13, 2025	Image GenerationLanguage Modeling	CodeCode Available	3
AsymLoRA: Harmonizing Data Conflicts and Commonalities in MLLMs	Feb 27, 2025	Language ModelingLanguage Modelling	CodeCode Available	3
Remote Sensing Temporal Vision-Language Models: A Comprehensive Survey	Dec 3, 2024	Change DetectionDescriptive	CodeCode Available	3
VARGPT: Unified Understanding and Generation in a Visual Autoregressive Multimodal Large Language Model	Jan 21, 2025	Image GenerationInstruction Following	CodeCode Available	3
MoMA: Multimodal LLM Adapter for Fast Personalized Image Generation	Apr 8, 2024	Image GenerationImage-to-Image Translation	CodeCode Available	3
TinyGPT-V: Efficient Multimodal Large Language Model via Small Backbones	Dec 28, 2023	Computational EfficiencyImage Captioning	CodeCode Available	3
Deep Learning and LLM-based Methods Applied to Stellar Lightcurve Classification	Apr 16, 2024	Feature EngineeringLanguage Modeling	CodeCode Available	3
ShareGPT-4o-Image: Aligning Multimodal Models with GPT-4o-Level Image Generation	Jun 22, 2025	GPUImage Generation	CodeCode Available	3
Valley2: Exploring Multimodal Models with Scalable Vision-Language Design	Jan 10, 2025	Image CaptioningLanguage Modeling	CodeCode Available	3
ShapeLLM: Universal 3D Object Understanding for Embodied Interaction	Feb 27, 2024	3D geometry3D Object Captioning	CodeCode Available	3
LaVy: Vietnamese Multimodal Large Language Model	Apr 11, 2024	Language ModelingLanguage Modelling	CodeCode Available	2
Jailbreaking Attack against Multimodal Large Language Model	Feb 4, 2024	Language ModelingLanguage Modelling	CodeCode Available	2
Keeping Yourself is Important in Downstream Tuning Multimodal Large Language Model	Mar 6, 2025	General KnowledgeImage Captioning	CodeCode Available	2
LHRS-Bot-Nova: Improved Multimodal Large Language Model for Remote Sensing Vision-Language Interpretation	Nov 14, 2024	Earth ObservationInstruction Following	CodeCode Available	2
ChemVLM: Exploring the Power of Multimodal Large Language Models in Chemistry Area	Aug 14, 2024	Language ModelingLanguage Modelling	CodeCode Available	2
Referring to Any Person	Mar 11, 2025	Large Language ModelMultimodal Large Language Model	CodeCode Available	2
Holmes-VAD: Towards Unbiased and Explainable Video Anomaly Detection via Multi-modal LLM	Jun 18, 2024	Anomaly DetectionAnomaly Localization	CodeCode Available	2
Parameter-Inverted Image Pyramid Networks for Visual Perception and Multimodal Understanding	Jan 14, 2025	image-classificationImage Classification	CodeCode Available	2
Introducing Visual Perception Token into Multimodal Large Language Model	Feb 24, 2025	Language ModelingLanguage Modelling	CodeCode Available	2
PDF-WuKong: A Large Multimodal Model for Efficient Long PDF Reading with End-to-End Sparse Sampling	Oct 8, 2024	document understandingLanguage Modeling	CodeCode Available	2
Protecting Privacy in Multimodal Large Language Models with MLLMU-Bench	Oct 29, 2024	Language ModelingLanguage Modelling	CodeCode Available	2
One Token to Seg Them All: Language Instructed Reasoning Segmentation in Videos	Sep 29, 2024	AllImage Segmentation	CodeCode Available	2
Next Token Is Enough: Realistic Image Quality and Aesthetic Scoring with Multimodal Large Language Model	Mar 8, 2025	Image Quality AssessmentLanguage Modeling	CodeCode Available	2
OpenAD: Open-World Autonomous Driving Benchmark for 3D Object Detection	Nov 26, 2024	3D Object DetectionAutonomous Driving	CodeCode Available	2
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models	Jun 23, 2023	BenchmarkingLanguage Modeling	CodeCode Available	2
Explore the Limits of Omni-modal Pretraining at Scale	Jun 13, 2024	Language ModelingLanguage Modelling	CodeCode Available	2
Agent Smith: A Single Image Can Jailbreak One Million Multimodal LLM Agents Exponentially Fast	Feb 13, 2024	Language ModellingLarge Language Model	CodeCode Available	2
CLIP-MoE: Towards Building Mixture of Experts for CLIP with Diversified Multiplet Upcycling	Sep 28, 2024	image-classificationImage Classification	CodeCode Available	2
GeReA: Question-Aware Prompt Captions for Knowledge-based Visual Question Answering	Feb 4, 2024	Language ModelingLanguage Modelling	CodeCode Available	2
Dimple: Discrete Diffusion Multimodal Large Language Model with Parallel Decoding	May 22, 2025	Language ModelingLanguage Modelling	CodeCode Available	2
MedTsLLM: Leveraging LLMs for Multimodal Medical Time Series Analysis	Aug 14, 2024	Anomaly DetectionBoundary Detection	CodeCode Available	2
ChartCoder: Advancing Multimodal Large Language Model for Chart-to-Code Generation	Jan 11, 2025	Chart UnderstandingCode Generation	CodeCode Available	2
Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want	Mar 29, 2024	Instruction FollowingLanguage Modelling	CodeCode Available	2

Show:10 25 50

← PrevPage 1 of 7Next →

No leaderboard results yet.