SOTAVerified

Multimodal Large Language Model

Papers

Showing 150 of 347 papers

TitleStatusHype
MagicQuill: An Intelligent Interactive Image Editing SystemCode7
VITA: Towards Open-Source Interactive Omni Multimodal LLMCode7
Hunyuan-DiT: A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese UnderstandingCode7
ThinkSound: Chain-of-Thought Reasoning in Multimodal Large Language Models for Audio Generation and EditingCode5
R1-Omni: Explainable Omni-Multimodal Emotion Recognition with Reinforcement LearningCode5
VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language TasksCode5
Ovis: Structural Embedding Alignment for Multimodal Large Language ModelCode5
StarVector: Generating Scalable Vector Graphics Code from Images and TextCode5
Ferret: Refer and Ground Anything Anywhere at Any GranularityCode5
R1-Onevision:An Open-Source Multimodal Large Language Model Capable of Deep ReasoningCode4
Liquid: Language Models are Scalable Multi-modal GeneratorsCode4
SEED-Story: Multimodal Long Story Generation with Large Language ModelCode4
SEED-Data-Edit Technical Report: A Hybrid Dataset for Instructional Image EditingCode4
Groma: Localized Visual Tokenization for Grounding Multimodal Large Language ModelsCode4
MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with Interleaved Visual-Textual TokensCode4
ShareGPT-4o-Image: Aligning Multimodal Models with GPT-4o-Level Image GenerationCode3
GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and EditingCode3
AsymLoRA: Harmonizing Data Conflicts and Commonalities in MLLMsCode3
VARGPT: Unified Understanding and Generation in a Visual Autoregressive Multimodal Large Language ModelCode3
Valley2: Exploring Multimodal Models with Scalable Vision-Language DesignCode3
Remote Sensing Temporal Vision-Language Models: A Comprehensive SurveyCode3
Baichuan-Omni Technical ReportCode3
Multimodal Table UnderstandingCode3
Deep Learning and LLM-based Methods Applied to Stellar Lightcurve ClassificationCode3
MoMA: Multimodal LLM Adapter for Fast Personalized Image GenerationCode3
ShapeLLM: Universal 3D Object Understanding for Embodied InteractionCode3
TinyGPT-V: Efficient Multimodal Large Language Model via Small BackbonesCode3
Dimple: Discrete Diffusion Multimodal Large Language Model with Parallel DecodingCode2
Web-Shepherd: Advancing PRMs for Reinforcing Web AgentsCode2
The Scalability of Simplicity: Empirical Analysis of Vision-Language Learning with a Single TransformerCode2
Referring to Any PersonCode2
Next Token Is Enough: Realistic Image Quality and Aesthetic Scoring with Multimodal Large Language ModelCode2
Keeping Yourself is Important in Downstream Tuning Multimodal Large Language ModelCode2
Introducing Visual Perception Token into Multimodal Large Language ModelCode2
mmE5: Improving Multimodal Multilingual Embeddings via High-quality Synthetic DataCode2
Parameter-Inverted Image Pyramid Networks for Visual Perception and Multimodal UnderstandingCode2
LLaVA-ST: A Multimodal Large Language Model for Fine-Grained Spatial-Temporal UnderstandingCode2
ChartCoder: Advancing Multimodal Large Language Model for Chart-to-Code GenerationCode2
Towards a Multimodal Large Language Model with Pixel-Level Insight for BiomedicineCode2
OpenAD: Open-World Autonomous Driving Benchmark for 3D Object DetectionCode2
LHRS-Bot-Nova: Improved Multimodal Large Language Model for Remote Sensing Vision-Language InterpretationCode2
StoryTeller: Improving Long Video Description through Global Audio-Visual Character IdentificationCode2
Protecting Privacy in Multimodal Large Language Models with MLLMU-BenchCode2
SPA-Bench: A Comprehensive Benchmark for SmartPhone Agent EvaluationCode2
PDF-WuKong: A Large Multimodal Model for Efficient Long PDF Reading with End-to-End Sparse SamplingCode2
One Token to Seg Them All: Language Instructed Reasoning Segmentation in VideosCode2
CLIP-MoE: Towards Building Mixture of Experts for CLIP with Diversified Multiplet UpcyclingCode2
ChemVLM: Exploring the Power of Multimodal Large Language Models in Chemistry AreaCode2
MedTsLLM: Leveraging LLMs for Multimodal Medical Time Series AnalysisCode2
T2V-CompBench: A Comprehensive Benchmark for Compositional Text-to-video GenerationCode2
Show:102550
← PrevPage 1 of 7Next →

No leaderboard results yet.