SOTAVerified

Multimodal Large Language Model

Papers

Showing 201250 of 347 papers

TitleStatusHype
CFBenchmark-MM: Chinese Financial Assistant Benchmark for Multimodal Large Language Model0
ChatEXAONEPath: An Expert-level Multimodal Large Language Model for Histopathology Using Whole Slide Images0
ChatGPT Meets Iris Biometrics0
ChatSpot: Bootstrapping Multimodal LLMs via Precise Referring Instruction Tuning0
ChatTracker: Enhancing Visual Tracking Performance via Chatting with Multimodal Large Language Model0
Chat with AI: The Surprising Turn of Real-time Video Communication from Human to AI0
CINEMA: Coherent Multi-Subject Video Generation via MLLM-Based Guidance0
CleanMAP: Distilling Multimodal LLMs for Confidence-Driven Crowdsourced HD Map Updates0
CL-MoE: Enhancing Multimodal Large Language Model with Dual Momentum Mixture-of-Experts for Continual Visual Question Answering0
CLSP: High-Fidelity Contrastive Language-State Pre-training for Agent State Representation0
CoDi-2: In-Context, Interleaved, and Interactive Any-to-Any Generation0
CoDi-2: In-Context Interleaved and Interactive Any-to-Any Generation0
COEF-VQ: Cost-Efficient Video Quality Understanding through a Cascaded Multimodal LLM Framework0
Comics for Everyone: Generating Accessible Text Descriptions for Comic Strips0
CoT-lized Diffusion: Let's Reinforce T2I Generation Step-by-step0
CUE-M: Contextual Understanding and Enhanced Search with Multimodal Large Language Model0
Dallah: A Dialect-Aware Multimodal Large Language Model for Arabic0
Decoding Style: Efficient Fine-Tuning of LLMs for Image-Guided Outfit Recommendation with Preference0
DiffSensei: Bridging Multi-Modal LLMs and Diffusion Models for Customized Manga Generation0
Distraction is All You Need for Multimodal Large Language Model Jailbreaking0
DPDEdit: Detail-Preserved Diffusion Models for Multimodal Fashion Image Editing0
DreamJourney: Perpetual View Generation with Video Diffusion Models0
DynamicControl: Adaptive Condition Selection for Improved Text-to-Image Generation0
EAGLE: Egocentric AGgregated Language-video Engine0
EasyRef: Omni-Generalized Group Image Reference for Diffusion Models via Multimodal LLM0
EditScout: Locating Forged Regions from Diffusion-based Edited Images with Multimodal LLM0
EE-MLLM: A Data-Efficient and Compute-Efficient Multimodal Large Language Model0
Efficient Indirect LLM Jailbreak via Multimodal-LLM Jailbreak0
EgoPlan-Bench2: A Benchmark for Multimodal Large Language Model Planning in Real-World Scenarios0
EtC: Temporal Boundary Expand then Clarify for Weakly Supervised Video Grounding with Multimodal Large Language Model0
EventVL: Understand Event Streams via Multimodal Large Language Model0
FaceInsight: A Multimodal Large Language Model for Face Perception0
Face-LLaVA: Facial Expression and Attribute Understanding through Instruction Tuning0
Ferret-UI 2: Mastering Universal User Interface Understanding Across Platforms0
ForgeryGPT: Multimodal Large Language Model For Explainable Image Forgery Detection and Localization0
From Street Views to Urban Science: Discovering Road Safety Factors with Multimodal Large Language Models0
GenArtist: Multimodal LLM as an Agent for Unified Image Generation and Editing0
GeoRSMLLM: A Multimodal Large Language Model for Vision-Language Tasks in Geoscience and Remote Sensing0
Gesture-Aware Zero-Shot Speech Recognition for Patients with Language Disorders0
GPT4Video: A Unified Multimodal Large Language Model for lnstruction-Followed Understanding and Safety-Aware Generation0
Graph-based Unsupervised Disentangled Representation Learning via Multimodal Large Language Models0
GroundingFace: Fine-grained Face Understanding via Pixel Grounding Multimodal Large Language Model0
Guard Me If You Know Me: Protecting Specific Face-Identity from Deepfakes0
Guardrails for avoiding harmful medical product recommendations and off-label promotion in generative AI models0
GUIDE: Graphical User Interface Data for Execution0
Hear Me, See Me, Understand Me: Audio-Visual Autism Behavior Recognition0
HiDe-LLaVA: Hierarchical Decoupling for Continual Instruction Tuning of Multimodal Large Language Model0
Highlighting What Matters: Promptable Embeddings for Attribute-Focused Image Retrieval0
HoloLLM: Multisensory Foundation Model for Language-Grounded Human Sensing and Reasoning0
How to Bridge the Gap between Modalities: Survey on Multimodal Large Language Model0
Show:102550
← PrevPage 5 of 7Next →

No leaderboard results yet.