SOTAVerified

Multimodal Large Language Model

Papers

Showing 125 of 347 papers

TitleStatusHype
MagicQuill: An Intelligent Interactive Image Editing SystemCode7
VITA: Towards Open-Source Interactive Omni Multimodal LLMCode7
Hunyuan-DiT: A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese UnderstandingCode7
Ovis: Structural Embedding Alignment for Multimodal Large Language ModelCode5
R1-Omni: Explainable Omni-Multimodal Emotion Recognition with Reinforcement LearningCode5
VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language TasksCode5
ThinkSound: Chain-of-Thought Reasoning in Multimodal Large Language Models for Audio Generation and EditingCode5
Ferret: Refer and Ground Anything Anywhere at Any GranularityCode5
StarVector: Generating Scalable Vector Graphics Code from Images and TextCode5
SEED-Story: Multimodal Long Story Generation with Large Language ModelCode4
Groma: Localized Visual Tokenization for Grounding Multimodal Large Language ModelsCode4
MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with Interleaved Visual-Textual TokensCode4
SEED-Data-Edit Technical Report: A Hybrid Dataset for Instructional Image EditingCode4
Liquid: Language Models are Scalable Multi-modal GeneratorsCode4
R1-Onevision:An Open-Source Multimodal Large Language Model Capable of Deep ReasoningCode4
TinyGPT-V: Efficient Multimodal Large Language Model via Small BackbonesCode3
ShapeLLM: Universal 3D Object Understanding for Embodied InteractionCode3
Deep Learning and LLM-based Methods Applied to Stellar Lightcurve ClassificationCode3
ShareGPT-4o-Image: Aligning Multimodal Models with GPT-4o-Level Image GenerationCode3
Valley2: Exploring Multimodal Models with Scalable Vision-Language DesignCode3
Baichuan-Omni Technical ReportCode3
Remote Sensing Temporal Vision-Language Models: A Comprehensive SurveyCode3
AsymLoRA: Harmonizing Data Conflicts and Commonalities in MLLMsCode3
MoMA: Multimodal LLM Adapter for Fast Personalized Image GenerationCode3
GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and EditingCode3
Show:102550
← PrevPage 1 of 14Next →

No leaderboard results yet.