SOTAVerified

Multimodal Large Language Model

Papers

Showing 101150 of 347 papers

TitleStatusHype
Optimus-2: Multimodal Minecraft Agent with Goal-Observation-Action Conditioned Policy0
AsymLoRA: Harmonizing Data Conflicts and Commonalities in MLLMsCode3
Introducing Visual Perception Token into Multimodal Large Language ModelCode2
R1-Onevision:An Open-Source Multimodal Large Language Model Capable of Deep ReasoningCode4
OmniParser V2: Structured-Points-of-Thought for Unified Visual Text Parsing and Its Generality to Multimodal Large Language ModelsCode0
Towards Text-Image Interleaved RetrievalCode1
Gesture-Aware Zero-Shot Speech Recognition for Patients with Language Disorders0
MMRC: A Large-Scale Benchmark for Understanding Multimodal Large Language Model in Real-World Conversation0
Leveraging Multimodal-LLMs Assisted by Instance Segmentation for Intelligent Traffic Monitoring0
Distraction is All You Need for Multimodal Large Language Model Jailbreaking0
mmE5: Improving Multimodal Multilingual Embeddings via High-quality Synthetic DataCode2
On Fairness of Unified Multimodal Large Language Model for Image Generation0
MPIC: Position-Independent Multimodal Context Caching System for Efficient MLLM Serving0
Leveraging Multimodal LLM for Inspirational User Interface SearchCode0
Learning Free Token Reduction for Multi-Modal Large Language Models0
PatentLMM: Large Multimodal Model for Generating Descriptions for Patent FiguresCode1
HumanOmni: A Large Vision-Speech Language Model for Human-Centric Video Understanding0
EventVL: Understand Event Streams via Multimodal Large Language Model0
VARGPT: Unified Understanding and Generation in a Visual Autoregressive Multimodal Large Language ModelCode3
EndoChat: Grounded Multimodal Large Language Model for Endoscopic SurgeryCode1
When language and vision meet road safety: leveraging multimodal large language models for video-based traffic accident analysisCode1
Interpretable Droplet Digital PCR Assay for Trustworthy Molecular Diagnostics0
LLaVA-ST: A Multimodal Large Language Model for Fine-Grained Spatial-Temporal UnderstandingCode2
3UR-LLM: An End-to-End Multimodal Large Language Model for 3D Scene UnderstandingCode1
Parameter-Inverted Image Pyramid Networks for Visual Perception and Multimodal UnderstandingCode2
Omni-RGPT: Unifying Image and Video Region-level Understanding via Token Marks0
ChartCoder: Advancing Multimodal Large Language Model for Chart-to-Code GenerationCode2
Valley2: Exploring Multimodal Models with Scalable Vision-Language DesignCode3
MinMo: A Multimodal Large Language Model for Seamless Voice Interaction0
LLaVA-Octopus: Unlocking Instruction-Driven Adaptive Projector Fusion for Video Understanding0
Interpretable Face Anti-Spoofing: Enhancing Generalization with Multimodal Large Language Models0
GroundingFace: Fine-grained Face Understanding via Pixel Grounding Multimodal Large Language Model0
Notes-guided MLLM Reasoning: Enhancing MLLM with Knowledge and Visual Notes for Visual Question AnsweringCode1
S4-Driver: Scalable Self-Supervised Driving Multimodal Large Language Model with Spatio-Temporal Visual Representation0
Beyond Text: Implementing Multimodal Large Language Model-Powered Multi-Agent Systems Using a No-Code Platform0
ST^3: Accelerating Multimodal Large Language Model by Spatial-Temporal Visual Token Trimming0
MLLM-SUL: Multimodal Large Language Model for Semantic Scene Understanding and Localization in Traffic ScenariosCode0
A Large-scale Interpretable Multi-modality Benchmark for Facial Image Forgery Localization0
SubstationAI: Multimodal Large Model-Based Approaches for Analyzing Substation Equipment Faults0
MiniGPT-Pancreas: Multimodal Large Language Model for Pancreas Cancer Classification and DetectionCode1
J-EDI QA: Benchmark for deep-sea organism-specific multimodal LLM0
Multimodal Hypothetical Summary for Retrieval-based Multi-image Question AnsweringCode0
Make Imagination Clearer! Stable Diffusion-based Visual Imagination for Multimodal Machine Translation0
A Survey of Mathematical Reasoning in the Era of Multimodal Large Language Model: Benchmark, Method & Challenges0
IDEA-Bench: How Far are Generative Models from Professional Designing?Code1
MERaLiON-SpeechEncoder: Towards a Speech Foundation Model for Singapore and Beyond0
EasyRef: Omni-Generalized Group Image Reference for Diffusion Models via Multimodal LLM0
Towards a Multimodal Large Language Model with Pixel-Level Insight for BiomedicineCode2
COEF-VQ: Cost-Efficient Video Quality Understanding through a Cascaded Multimodal LLM Framework0
DiffSensei: Bridging Multi-Modal LLMs and Diffusion Models for Customized Manga Generation0
Show:102550
← PrevPage 3 of 7Next →

No leaderboard results yet.