Multimodal Large Language Model

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 1–50 of 347 papers

Title	Date	Tasks	Status	Hype	Score
Hunyuan-DiT: A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding	May 14, 2024	Image GenerationLanguage Modeling	CodeCode Available	7	5
VITA: Towards Open-Source Interactive Omni Multimodal LLM	Aug 9, 2024	Language ModelingLanguage Modelling	CodeCode Available	7	5
MagicQuill: An Intelligent Interactive Image Editing System	Nov 14, 2024	Language ModelingLanguage Modelling	CodeCode Available	7	5
R1-Omni: Explainable Omni-Multimodal Emotion Recognition with Reinforcement Learning	Mar 7, 2025	Emotion RecognitionLanguage Modeling	CodeCode Available	5	5
Ovis: Structural Embedding Alignment for Multimodal Large Language Model	May 31, 2024	Language ModelingMultimodal Large Language Model	CodeCode Available	5	5
VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks	Jun 12, 2024	Image GenerationLanguage Modeling	CodeCode Available	5	5
StarVector: Generating Scalable Vector Graphics Code from Images and Text	Dec 17, 2023	Code GenerationLanguage Modeling	CodeCode Available	5	5
ThinkSound: Chain-of-Thought Reasoning in Multimodal Large Language Models for Audio Generation and Editing	Jun 26, 2025	Audio GenerationLarge Language Model	CodeCode Available	5	5
Ferret: Refer and Ground Anything Anywhere at Any Granularity	Oct 11, 2023	HallucinationLanguage Modeling	CodeCode Available	5	5
SEED-Data-Edit Technical Report: A Hybrid Dataset for Instructional Image Editing	May 7, 2024	Image ManipulationLanguage Modeling	CodeCode Available	4	5
Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models	Apr 19, 2024	Language ModelingLanguage Modelling	CodeCode Available	4	5
Liquid: Language Models are Scalable Multi-modal Generators	Dec 5, 2024	Language ModelingLanguage Modelling	CodeCode Available	4	5
MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with Interleaved Visual-Textual Tokens	Apr 4, 2024	Language ModelingLanguage Modelling	CodeCode Available	4	5
SEED-Story: Multimodal Long Story Generation with Large Language Model	Jul 11, 2024	Image GenerationLanguage Modeling	CodeCode Available	4	5
R1-Onevision：An Open-Source Multimodal Large Language Model Capable of Deep Reasoning	Feb 24, 2025	Language ModelingLanguage Modelling	CodeCode Available	4	5
Remote Sensing Temporal Vision-Language Models: A Comprehensive Survey	Dec 3, 2024	Change DetectionDescriptive	CodeCode Available	3	5
GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing	Mar 13, 2025	Image GenerationLanguage Modeling	CodeCode Available	3	5
Baichuan-Omni Technical Report	Oct 11, 2024	Language ModelingLanguage Modelling	CodeCode Available	3	5
MoMA: Multimodal LLM Adapter for Fast Personalized Image Generation	Apr 8, 2024	Image GenerationImage-to-Image Translation	CodeCode Available	3	5
Multimodal Table Understanding	Jun 12, 2024	Language ModelingLanguage Modelling	CodeCode Available	3	5
Deep Learning and LLM-based Methods Applied to Stellar Lightcurve Classification	Apr 16, 2024	Feature EngineeringLanguage Modeling	CodeCode Available	3	5
AsymLoRA: Harmonizing Data Conflicts and Commonalities in MLLMs	Feb 27, 2025	Language ModelingLanguage Modelling	CodeCode Available	3	5
VARGPT: Unified Understanding and Generation in a Visual Autoregressive Multimodal Large Language Model	Jan 21, 2025	Image GenerationInstruction Following	CodeCode Available	3	5
TinyGPT-V: Efficient Multimodal Large Language Model via Small Backbones	Dec 28, 2023	Computational EfficiencyImage Captioning	CodeCode Available	3	5
ShareGPT-4o-Image: Aligning Multimodal Models with GPT-4o-Level Image Generation	Jun 22, 2025	GPUImage Generation	CodeCode Available	3	5
Valley2: Exploring Multimodal Models with Scalable Vision-Language Design	Jan 10, 2025	Image CaptioningLanguage Modeling	CodeCode Available	3	5
ShapeLLM: Universal 3D Object Understanding for Embodied Interaction	Feb 27, 2024	3D geometry3D Object Captioning	CodeCode Available	3	5
Holmes-VAD: Towards Unbiased and Explainable Video Anomaly Detection via Multi-modal LLM	Jun 18, 2024	Anomaly DetectionAnomaly Localization	CodeCode Available	2	5
LHRS-Bot-Nova: Improved Multimodal Large Language Model for Remote Sensing Vision-Language Interpretation	Nov 14, 2024	Earth ObservationInstruction Following	CodeCode Available	2	5
ChemVLM: Exploring the Power of Multimodal Large Language Models in Chemistry Area	Aug 14, 2024	Language ModelingLanguage Modelling	CodeCode Available	2	5
Protecting Privacy in Multimodal Large Language Models with MLLMU-Bench	Oct 29, 2024	Language ModelingLanguage Modelling	CodeCode Available	2	5
Parameter-Inverted Image Pyramid Networks for Visual Perception and Multimodal Understanding	Jan 14, 2025	image-classificationImage Classification	CodeCode Available	2	5
Paint by Inpaint: Learning to Add Image Objects by Removing Them First	Apr 28, 2024	Image InpaintingLanguage Modeling	CodeCode Available	2	5
PDF-WuKong: A Large Multimodal Model for Efficient Long PDF Reading with End-to-End Sparse Sampling	Oct 8, 2024	document understandingLanguage Modeling	CodeCode Available	2	5
Referring to Any Person	Mar 11, 2025	Large Language ModelMultimodal Large Language Model	CodeCode Available	2	5
Next Token Is Enough: Realistic Image Quality and Aesthetic Scoring with Multimodal Large Language Model	Mar 8, 2025	Image Quality AssessmentLanguage Modeling	CodeCode Available	2	5
Keeping Yourself is Important in Downstream Tuning Multimodal Large Language Model	Mar 6, 2025	General KnowledgeImage Captioning	CodeCode Available	2	5
One Token to Seg Them All: Language Instructed Reasoning Segmentation in Videos	Sep 29, 2024	AllImage Segmentation	CodeCode Available	2	5
Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want	Mar 29, 2024	Instruction FollowingLanguage Modelling	CodeCode Available	2	5
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models	Jun 23, 2023	BenchmarkingLanguage Modeling	CodeCode Available	2	5
Dimple: Discrete Diffusion Multimodal Large Language Model with Parallel Decoding	May 22, 2025	Language ModelingLanguage Modelling	CodeCode Available	2	5
Agent Smith: A Single Image Can Jailbreak One Million Multimodal LLM Agents Exponentially Fast	Feb 13, 2024	Language ModellingLarge Language Model	CodeCode Available	2	5
Jailbreaking Attack against Multimodal Large Language Model	Feb 4, 2024	Language ModelingLanguage Modelling	CodeCode Available	2	5
OpenAD: Open-World Autonomous Driving Benchmark for 3D Object Detection	Nov 26, 2024	3D Object DetectionAutonomous Driving	CodeCode Available	2	5
CLIP-MoE: Towards Building Mixture of Experts for CLIP with Diversified Multiplet Upcycling	Sep 28, 2024	image-classificationImage Classification	CodeCode Available	2	5
Explore the Limits of Omni-modal Pretraining at Scale	Jun 13, 2024	Language ModelingLanguage Modelling	CodeCode Available	2	5
MedTsLLM: Leveraging LLMs for Multimodal Medical Time Series Analysis	Aug 14, 2024	Anomaly DetectionBoundary Detection	CodeCode Available	2	5
ChartCoder: Advancing Multimodal Large Language Model for Chart-to-Code Generation	Jan 11, 2025	Chart UnderstandingCode Generation	CodeCode Available	2	5
GeReA: Question-Aware Prompt Captions for Knowledge-based Visual Question Answering	Feb 4, 2024	Language ModelingLanguage Modelling	CodeCode Available	2	5
LLMGA: Multimodal Large Language Model based Generation Assistant	Nov 27, 2023	Image GenerationLanguage Modeling	CodeCode Available	2	5

Show:10 25 50

← PrevPage 1 of 7Next →

No leaderboard results yet.