SOTAVerified

Multimodal Large Language Model

Papers

Showing 51100 of 347 papers

TitleStatusHype
ORQA: A Benchmark and Foundation Model for Holistic Operating Room Modeling0
MindOmni: Unleashing Reasoning Generation in Vision Language Models with RGPOCode0
Beyond Retrieval: Joint Supervision and Multimodal Document Ranking for Textbook Question Answering0
Unifying Segment Anything in Microscopy with Multimodal Large Language ModelCode1
Batch Augmentation with Unimodal Fine-tuning for Multimodal LearningCode0
Is your multimodal large language model a good science tutor?0
MonetGPT: Solving Puzzles Enhances MLLMs' Image Retouching Skills0
On Path to Multimodal Generalist: General-Level and General-Bench0
Consistency-aware Fake Videos Detection on Short Video PlatformsCode0
TimeSoccer: An End-to-End Multimodal Large Language Model for Soccer Commentary Generation0
FaceInsight: A Multimodal Large Language Model for Face Perception0
ChatEXAONEPath: An Expert-level Multimodal Large Language Model for Histopathology Using Whole Slide Images0
SmartFreeEdit: Mask-Free Spatial-Aware Image Editing with Complex Instruction UnderstandingCode1
AnomalyR1: A GRPO-based End-to-end MLLM for Industrial Anomaly DetectionCode1
Mavors: Multi-granularity Video Representation for Multimodal Large Language Model0
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal ModelsCode0
The Scalability of Simplicity: Empirical Analysis of Vision-Language Learning with a Single TransformerCode2
CleanMAP: Distilling Multimodal LLMs for Confidence-Driven Crowdsourced HD Map Updates0
Marmot: Multi-Agent Reasoning for Multi-Object Self-Correcting in Improving Image-Text Alignment0
Enhancing Time Series Forecasting via Multi-Level Text Alignment with LLMsCode1
MovSAM: A Single-image Moving Object Segmentation Framework Based on Deep ThinkingCode0
Face-LLaVA: Facial Expression and Attribute Understanding through Instruction Tuning0
Q-Agent: Quality-Driven Chain-of-Thought Image Restoration Agent through Robust Multimodal Large Language Model0
Towards Visual Text Grounding of Multimodal Large Language Model0
Universal Item Tokenization for Transferable Generative Recommendation0
Towards Unified Referring Expression Segmentation Across Omni-Level Visual Target GranularitiesCode0
Open-Qwen2VL: Compute-Efficient Pre-Training of Fully-Open Multimodal LLMs on Academic Resources0
Orchestrate Multimodal Data with Batch Post-Balancing to Accelerate Multimodal Large Language Model Training0
Dynamic Pyramid Network for Efficient Multimodal Large Language ModelCode0
MLLM-For3D: Adapting Multimodal Large Language Model for 3D Reasoning Segmentation0
Distributed LLMs and Multimodal Large Language Models: A Survey on Advances, Challenges, and Future DirectionsCode1
LEGION: Learning to Ground and Explain for Synthetic Image Detection0
UPME: An Unsupervised Peer Review Framework for Multimodal Large Language Model Evaluation0
SpaceVLLM: Endowing Multimodal Large Language Model with Spatio-Temporal Video Grounding Capability0
HiDe-LLaVA: Hierarchical Decoupling for Continual Instruction Tuning of Multimodal Large Language Model0
GeoRSMLLM: A Multimodal Large Language Model for Vision-Language Tasks in Geoscience and Remote Sensing0
When neural implant meets multimodal LLM: A dual-loop system for neuromodulation and naturalistic neuralbehavioral research0
Open3DVQA: A Benchmark for Comprehensive Spatial Reasoning with Multimodal Large Language Model in Open SpaceCode1
OmniDiff: A Comprehensive Benchmark for Fine-grained Image Difference Captioning0
GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and EditingCode3
CINEMA: Coherent Multi-Subject Video Generation via MLLM-Based Guidance0
Hybrid Agents for Image Restoration0
Referring to Any PersonCode2
Lightweight Multimodal Artificial Intelligence Framework for Maritime Multi-Scene Recognition0
Next Token Is Enough: Realistic Image Quality and Aesthetic Scoring with Multimodal Large Language ModelCode2
R1-Omni: Explainable Omni-Multimodal Emotion Recognition with Reinforcement LearningCode5
Keeping Yourself is Important in Downstream Tuning Multimodal Large Language ModelCode2
PP-DocBee: Improving Multimodal Document Understanding Through a Bag of TricksCode0
CL-MoE: Enhancing Multimodal Large Language Model with Dual Momentum Mixture-of-Experts for Continual Visual Question Answering0
Towards General Visual-Linguistic Face Forgery Detection(V2)Code1
Show:102550
← PrevPage 2 of 7Next →

No leaderboard results yet.