SOTAVerified

Multimodal Large Language Model

Papers

Showing 101150 of 347 papers

TitleStatusHype
TextToucher: Fine-Grained Text-to-Touch GenerationCode1
MultiMath: Bridging Visual and Mathematical Reasoning for Large Language ModelsCode1
ProteinGPT: Multimodal LLM for Protein Property Prediction and Structure UnderstandingCode1
Harnessing Multimodal Large Language Models for Multimodal Sequential RecommendationCode1
FFAA: Multimodal Large Language Model based Explainable Open-World Face Forgery Analysis AssistantCode1
Caution for the Environment: Multimodal Agents are Susceptible to Environmental DistractionsCode1
INF-LLaVA: Dual-perspective Perception for High-Resolution Multimodal Large Language ModelCode1
A Refer-and-Ground Multimodal Large Language Model for BiomedicineCode1
DaLPSR: Leverage Degradation-Aligned Language Prompt for Real-World Image Super-ResolutionCode1
LLaSA: A Multimodal LLM for Human Activity Analysis Through Wearable and Smartphone SensorsCode1
MMNeuron: Discovering Neuron-Level Domain-Specific Interpretation in Multimodal Large Language ModelCode1
VIP: Versatile Image Outpainting Empowered by Multimodal Large Language ModelCode1
Voice Jailbreak Attacks Against GPT-4oCode1
From Text to Pixel: Advancing Long-Context Understanding in MLLMsCode1
LITE: Modeling Environmental Ecosystems with Multimodal Large Language ModelsCode1
Multi-modal Instruction Tuned LLMs with Fine-grained Visual PerceptionCode1
Mementos: A Comprehensive Benchmark for Multimodal Large Language Model Reasoning over Image SequencesCode1
AllSpark: A Multimodal Spatio-Temporal General Intelligence Model with Ten Modalities via Language as a Reference FrameworkCode1
Hallucination Augmented Contrastive Learning for Multimodal Large Language ModelCode1
LION : Empowering Multimodal Large Language Model with Dual-Level Visual KnowledgeCode1
Chain of Images for Intuitively ReasoningCode1
Multimodal ChatGPT for Medical Applications: an Experimental Study of GPT-4VCode1
CXR-LLAVA: a multimodal large language model for interpreting chest X-ray imagesCode1
UReader: Universal OCR-free Visually-situated Language Understanding with Multimodal Large Language ModelCode1
FinVis-GPT: A Multimodal Large Language Model for Financial Chart AnalysisCode1
Kosmos-2: Grounding Multimodal Large Language Models to the WorldCode1
LMEye: An Interactive Perception Network for Large Language ModelsCode1
LRMR: LLM-Driven Relational Multi-node Ranking for Lymph Node Metastasis Assessment in Rectal Cancer0
MFGDiffusion: Mask-Guided Smoke Synthesis for Enhanced Forest Fire DetectionCode0
KptLLM++: Towards Generic Keypoint Comprehension with Large Language Model0
Chat with AI: The Surprising Turn of Real-time Video Communication from Human to AI0
TalkFashion: Intelligent Virtual Try-On Assistant Based on Multimodal Large Language Model0
BlueLM-2.5-3B Technical Report0
CoT-lized Diffusion: Let's Reinforce T2I Generation Step-by-step0
Mask-aware Text-to-Image Retrieval: Referring Expression Segmentation Meets Cross-modal Retrieval0
OracleFusion: Assisting the Decipherment of Oracle Bone Script with Structurally Constrained Semantic TypographyCode0
DreamJourney: Perpetual View Generation with Video Diffusion Models0
ASCD: Attention-Steerable Contrastive Decoding for Reducing Hallucination in MLLM0
VIS-Shepherd: Constructing Critic for LLM-based Data Visualization GenerationCode0
CFBenchmark-MM: Chinese Financial Assistant Benchmark for Multimodal Large Language Model0
VGR: Visual Grounded Reasoning0
PHRASED: Phrase Dictionary Biasing for Speech Translation0
Parking, Perception, and Retail: Street-Level Determinants of Community Vitality in Harbin0
Towards LLM-Centric Multimodal Fusion: A Survey on Integration Strategies and Techniques0
The NTNU System at the S&I Challenge 2025 SLA Open Track0
A Novel Data Augmentation Approach for Automatic Speaking Assessment on Opinion Expressions0
From Street Views to Urban Science: Discovering Road Safety Factors with Multimodal Large Language Models0
S4-Driver: Scalable Self-Supervised Driving Multimodal Large Language Modelwith Spatio-Temporal Visual Representation0
Cross-modal RAG: Sub-dimensional Retrieval-Augmented Text-to-Image GenerationCode0
Think Before You Diffuse: LLMs-Guided Physics-Aware Video Generation0
Show:102550
← PrevPage 3 of 7Next →

No leaderboard results yet.