SOTAVerified

Multimodal Large Language Model

Papers

Showing 101150 of 347 papers

TitleStatusHype
Multi-modal Instruction Tuned LLMs with Fine-grained Visual PerceptionCode1
MultiMath: Bridging Visual and Mathematical Reasoning for Large Language ModelsCode1
Multimodal ChatGPT for Medical Applications: an Experimental Study of GPT-4VCode1
Multi-Stage Vision Token Dropping: Towards Efficient Multimodal Large Language ModelCode1
Enhancing Time Series Forecasting via Multi-Level Text Alignment with LLMsCode1
EndoChat: Grounded Multimodal Large Language Model for Endoscopic SurgeryCode1
MobA: Multifaceted Memory-Enhanced Adaptive Planning for Efficient Mobile Task AutomationCode1
Unifying Segment Anything in Microscopy with Multimodal Large Language ModelCode1
MMNeuron: Discovering Neuron-Level Domain-Specific Interpretation in Multimodal Large Language ModelCode1
When language and vision meet road safety: leveraging multimodal large language models for video-based traffic accident analysisCode1
LION : Empowering Multimodal Large Language Model with Dual-Level Visual KnowledgeCode1
Voice Jailbreak Attacks Against GPT-4oCode1
Chain of Images for Intuitively ReasoningCode1
MiniGPT-Pancreas: Multimodal Large Language Model for Pancreas Cancer Classification and DetectionCode1
MedTVT-R1: A Multimodal LLM Empowering Medical Reasoning and DiagnosisCode1
AllSpark: A Multimodal Spatio-Temporal General Intelligence Model with Ten Modalities via Language as a Reference FrameworkCode1
UReader: Universal OCR-free Visually-situated Language Understanding with Multimodal Large Language ModelCode1
Mementos: A Comprehensive Benchmark for Multimodal Large Language Model Reasoning over Image SequencesCode1
Meaning Typed Prompting: A Technique for Efficient, Reliable Structured Output GenerationCode1
VIP: Versatile Image Outpainting Empowered by Multimodal Large Language ModelCode1
Caution for the Environment: Multimodal Agents are Susceptible to Environmental DistractionsCode1
Distributed LLMs and Multimodal Large Language Models: A Survey on Advances, Challenges, and Future DirectionsCode1
LMEye: An Interactive Perception Network for Large Language ModelsCode1
LLaSA: A Multimodal LLM for Human Activity Analysis Through Wearable and Smartphone SensorsCode1
LLaVA-SpaceSGG: Visual Instruct Tuning for Open-vocabulary Scene Graph Generation with Enhanced Spatial RelationsCode1
LITE: Modeling Environmental Ecosystems with Multimodal Large Language ModelsCode1
Leveraging MLLM Embeddings and Attribute Smoothing for Compositional Zero-Shot LearningCode1
Towards Real Zero-Shot Camouflaged Object Segmentation without Camouflaged AnnotationsCode0
Towards Unified Referring Expression Segmentation Across Omni-Level Visual Target GranularitiesCode0
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal ModelsCode0
Diagnosing and Mitigating Modality Interference in Multimodal Large Language ModelsCode0
TRINS: Towards Multimodal Language Models that Can ReadCode0
TourSynbio-Search: A Large Language Model Driven Agent Framework for Unified Search Method for Protein EngineeringCode0
AdaptVision: Dynamic Input Scaling in MLLMs for Versatile Scene UnderstandingCode0
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source SuitesCode0
Cross-modal RAG: Sub-dimensional Retrieval-Augmented Text-to-Image GenerationCode0
Consistency-aware Fake Videos Detection on Short Video PlatformsCode0
SCA: Improve Semantic Consistent in Unrestricted Adversarial Attacks via DDPM InversionCode0
Value-Spectrum: Quantifying Preferences of Vision-Language Models via Value Decomposition in Social Media ContextsCode0
Batch Augmentation with Unimodal Fine-tuning for Multimodal LearningCode0
PP-DocBee: Improving Multimodal Document Understanding Through a Bag of TricksCode0
OracleFusion: Assisting the Decipherment of Oracle Bone Script with Structurally Constrained Semantic TypographyCode0
OmniParser V2: Structured-Points-of-Thought for Unified Visual Text Parsing and Its Generality to Multimodal Large Language ModelsCode0
Multimodal Hypothetical Summary for Retrieval-based Multi-image Question AnsweringCode0
Automatically Generating Visual Hallucination Test Cases for Multimodal Large Language ModelsCode0
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time ScalingCode0
mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document UnderstandingCode0
mPLUG-PaperOwl: Scientific Diagram Analysis with the Multimodal Large Language ModelCode0
A Survey on Multimodal Large Language ModelsCode0
MIP-GAF: A MLLM-annotated Benchmark for Most Important Person Localization and Group Context UnderstandingCode0
Show:102550
← PrevPage 3 of 7Next →

No leaderboard results yet.