SOTAVerified

Multimodal Large Language Model

Papers

Showing 101125 of 347 papers

TitleStatusHype
Optimus-2: Multimodal Minecraft Agent with Goal-Observation-Action Conditioned Policy0
AsymLoRA: Harmonizing Data Conflicts and Commonalities in MLLMsCode3
Introducing Visual Perception Token into Multimodal Large Language ModelCode2
R1-Onevision:An Open-Source Multimodal Large Language Model Capable of Deep ReasoningCode4
OmniParser V2: Structured-Points-of-Thought for Unified Visual Text Parsing and Its Generality to Multimodal Large Language ModelsCode0
Gesture-Aware Zero-Shot Speech Recognition for Patients with Language Disorders0
Towards Text-Image Interleaved RetrievalCode1
MMRC: A Large-Scale Benchmark for Understanding Multimodal Large Language Model in Real-World Conversation0
Leveraging Multimodal-LLMs Assisted by Instance Segmentation for Intelligent Traffic Monitoring0
Distraction is All You Need for Multimodal Large Language Model Jailbreaking0
mmE5: Improving Multimodal Multilingual Embeddings via High-quality Synthetic DataCode2
On Fairness of Unified Multimodal Large Language Model for Image Generation0
MPIC: Position-Independent Multimodal Context Caching System for Efficient MLLM Serving0
Leveraging Multimodal LLM for Inspirational User Interface SearchCode0
Learning Free Token Reduction for Multi-Modal Large Language Models0
HumanOmni: A Large Vision-Speech Language Model for Human-Centric Video Understanding0
PatentLMM: Large Multimodal Model for Generating Descriptions for Patent FiguresCode1
EventVL: Understand Event Streams via Multimodal Large Language Model0
VARGPT: Unified Understanding and Generation in a Visual Autoregressive Multimodal Large Language ModelCode3
EndoChat: Grounded Multimodal Large Language Model for Endoscopic SurgeryCode1
When language and vision meet road safety: leveraging multimodal large language models for video-based traffic accident analysisCode1
Interpretable Droplet Digital PCR Assay for Trustworthy Molecular Diagnostics0
LLaVA-ST: A Multimodal Large Language Model for Fine-Grained Spatial-Temporal UnderstandingCode2
3UR-LLM: An End-to-End Multimodal Large Language Model for 3D Scene UnderstandingCode1
Omni-RGPT: Unifying Image and Video Region-level Understanding via Token Marks0
Show:102550
← PrevPage 5 of 14Next →

No leaderboard results yet.