SOTAVerified

Multimodal Large Language Model

Papers

Showing 150 of 347 papers

TitleStatusHype
Hunyuan-DiT: A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese UnderstandingCode7
VITA: Towards Open-Source Interactive Omni Multimodal LLMCode7
MagicQuill: An Intelligent Interactive Image Editing SystemCode7
R1-Omni: Explainable Omni-Multimodal Emotion Recognition with Reinforcement LearningCode5
Ovis: Structural Embedding Alignment for Multimodal Large Language ModelCode5
VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language TasksCode5
StarVector: Generating Scalable Vector Graphics Code from Images and TextCode5
ThinkSound: Chain-of-Thought Reasoning in Multimodal Large Language Models for Audio Generation and EditingCode5
Ferret: Refer and Ground Anything Anywhere at Any GranularityCode5
SEED-Data-Edit Technical Report: A Hybrid Dataset for Instructional Image EditingCode4
Groma: Localized Visual Tokenization for Grounding Multimodal Large Language ModelsCode4
Liquid: Language Models are Scalable Multi-modal GeneratorsCode4
MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with Interleaved Visual-Textual TokensCode4
SEED-Story: Multimodal Long Story Generation with Large Language ModelCode4
R1-Onevision:An Open-Source Multimodal Large Language Model Capable of Deep ReasoningCode4
Remote Sensing Temporal Vision-Language Models: A Comprehensive SurveyCode3
GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and EditingCode3
Baichuan-Omni Technical ReportCode3
MoMA: Multimodal LLM Adapter for Fast Personalized Image GenerationCode3
Multimodal Table UnderstandingCode3
Deep Learning and LLM-based Methods Applied to Stellar Lightcurve ClassificationCode3
AsymLoRA: Harmonizing Data Conflicts and Commonalities in MLLMsCode3
VARGPT: Unified Understanding and Generation in a Visual Autoregressive Multimodal Large Language ModelCode3
TinyGPT-V: Efficient Multimodal Large Language Model via Small BackbonesCode3
ShareGPT-4o-Image: Aligning Multimodal Models with GPT-4o-Level Image GenerationCode3
Valley2: Exploring Multimodal Models with Scalable Vision-Language DesignCode3
ShapeLLM: Universal 3D Object Understanding for Embodied InteractionCode3
Holmes-VAD: Towards Unbiased and Explainable Video Anomaly Detection via Multi-modal LLMCode2
LHRS-Bot-Nova: Improved Multimodal Large Language Model for Remote Sensing Vision-Language InterpretationCode2
ChemVLM: Exploring the Power of Multimodal Large Language Models in Chemistry AreaCode2
Protecting Privacy in Multimodal Large Language Models with MLLMU-BenchCode2
Parameter-Inverted Image Pyramid Networks for Visual Perception and Multimodal UnderstandingCode2
Paint by Inpaint: Learning to Add Image Objects by Removing Them FirstCode2
PDF-WuKong: A Large Multimodal Model for Efficient Long PDF Reading with End-to-End Sparse SamplingCode2
Referring to Any PersonCode2
Next Token Is Enough: Realistic Image Quality and Aesthetic Scoring with Multimodal Large Language ModelCode2
Keeping Yourself is Important in Downstream Tuning Multimodal Large Language ModelCode2
One Token to Seg Them All: Language Instructed Reasoning Segmentation in VideosCode2
Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You WantCode2
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language ModelsCode2
Dimple: Discrete Diffusion Multimodal Large Language Model with Parallel DecodingCode2
Agent Smith: A Single Image Can Jailbreak One Million Multimodal LLM Agents Exponentially FastCode2
Jailbreaking Attack against Multimodal Large Language ModelCode2
OpenAD: Open-World Autonomous Driving Benchmark for 3D Object DetectionCode2
CLIP-MoE: Towards Building Mixture of Experts for CLIP with Diversified Multiplet UpcyclingCode2
Explore the Limits of Omni-modal Pretraining at ScaleCode2
MedTsLLM: Leveraging LLMs for Multimodal Medical Time Series AnalysisCode2
ChartCoder: Advancing Multimodal Large Language Model for Chart-to-Code GenerationCode2
GeReA: Question-Aware Prompt Captions for Knowledge-based Visual Question AnsweringCode2
LLMGA: Multimodal Large Language Model based Generation AssistantCode2
Show:102550
← PrevPage 1 of 7Next →

No leaderboard results yet.