SOTAVerified

Multimodal Large Language Model

Papers

Showing 150 of 347 papers

TitleStatusHype
KptLLM++: Towards Generic Keypoint Comprehension with Large Language Model0
MFGDiffusion: Mask-Guided Smoke Synthesis for Enhanced Forest Fire DetectionCode0
LRMR: LLM-Driven Relational Multi-node Ranking for Lymph Node Metastasis Assessment in Rectal Cancer0
Chat with AI: The Surprising Turn of Real-time Video Communication from Human to AI0
TalkFashion: Intelligent Virtual Try-On Assistant Based on Multimodal Large Language Model0
BlueLM-2.5-3B Technical Report0
CoT-lized Diffusion: Let's Reinforce T2I Generation Step-by-step0
Mask-aware Text-to-Image Retrieval: Referring Expression Segmentation Meets Cross-modal Retrieval0
OracleFusion: Assisting the Decipherment of Oracle Bone Script with Structurally Constrained Semantic TypographyCode0
ThinkSound: Chain-of-Thought Reasoning in Multimodal Large Language Models for Audio Generation and EditingCode5
MedTVT-R1: A Multimodal LLM Empowering Medical Reasoning and DiagnosisCode1
ShareGPT-4o-Image: Aligning Multimodal Models with GPT-4o-Level Image GenerationCode3
DreamJourney: Perpetual View Generation with Video Diffusion Models0
The Condition Number as a Scale-Invariant Proxy for Information Encoding in Neural UnitsCode1
ASCD: Attention-Steerable Contrastive Decoding for Reducing Hallucination in MLLM0
CFBenchmark-MM: Chinese Financial Assistant Benchmark for Multimodal Large Language Model0
VIS-Shepherd: Constructing Critic for LLM-based Data Visualization GenerationCode0
VGR: Visual Grounded Reasoning0
PHRASED: Phrase Dictionary Biasing for Speech Translation0
Parking, Perception, and Retail: Street-Level Determinants of Community Vitality in Harbin0
The NTNU System at the S&I Challenge 2025 SLA Open Track0
Towards LLM-Centric Multimodal Fusion: A Survey on Integration Strategies and Techniques0
A Novel Data Augmentation Approach for Automatic Speaking Assessment on Opinion Expressions0
From Street Views to Urban Science: Discovering Road Safety Factors with Multimodal Large Language Models0
Period-LLM: Extending the Periodic Capability of Multimodal Large Language ModelCode1
un^2CLIP: Improving CLIP's Visual Detail Capturing Ability via Inverting unCLIPCode1
S4-Driver: Scalable Self-Supervised Driving Multimodal Large Language Modelwith Spatio-Temporal Visual Representation0
Cross-modal RAG: Sub-dimensional Retrieval-Augmented Text-to-Image GenerationCode0
Think Before You Diffuse: LLMs-Guided Physics-Aware Video Generation0
GeoLLaVA-8K: Scaling Remote-Sensing Multimodal Large Language Models to 8K ResolutionCode1
OmniResponse: Online Multimodal Conversational Response Generation in Dyadic Interactions0
What Changed? Detecting and Evaluating Instruction-Guided Image Edits with Multimodal Large Language Models0
Unifying Multimodal Large Language Model Capabilities and Modalities via Model MergingCode1
Diagnosing and Mitigating Modality Interference in Multimodal Large Language ModelsCode0
Guard Me If You Know Me: Protecting Specific Face-Identity from Deepfakes0
MLLM-Guided VLM Fine-Tuning with Joint Inference for Zero-Shot Composed Image Retrieval0
Multimodal LLM-Guided Semantic Correction in Text-to-Image DiffusionCode1
OpenHOI: Open-World Hand-Object Interaction Synthesis with Multimodal Large Language Model0
HoloLLM: Multisensory Foundation Model for Language-Grounded Human Sensing and Reasoning0
LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning0
ChemMLLM: Chemical Multimodal Large Language ModelCode1
Dimple: Discrete Diffusion Multimodal Large Language Model with Parallel DecodingCode2
Human-centered Interactive Learning via MLLMs for Text-to-Image Person Re-identification0
Highlighting What Matters: Promptable Embeddings for Attribute-Focused Image Retrieval0
Web-Shepherd: Advancing PRMs for Reinforcing Web AgentsCode2
MIKU-PAL: An Automated and Standardized Multi-Modal Method for Speech Paralinguistic and Affect Labeling0
CAFES: A Collaborative Multi-Agent Framework for Multi-Granular Multimodal Essay Scoring0
UniGen: Enhanced Training & Test-Time Strategies for Unified Multimodal Understanding and Generation0
UniVG-R1: Reasoning Guided Universal Visual Grounding with Reinforcement Learning0
BusterX: MLLM-Powered AI-Generated Video Forgery Detection and ExplanationCode1
Show:102550
← PrevPage 1 of 7Next →

No leaderboard results yet.