SOTAVerified

MME

MME is a comprehensive evaluation benchmark for multimodal large language models. It measures both perception and cognition abilities on a total of 14 subtasks, including existence, count, position, color, poster, celebrity, scene, landmark, artwork, OCR, commonsense reasoning, numerical calculation, text translation, and code reasoning.

Papers

Showing 2650 of 95 papers

TitleStatusHype
Dynamic Multimodal Evaluation with Flexible Complexity by Vision-Language BootstrappingCode1
To Preserve or To Compress: An In-Depth Study of Connector Selection in Multimodal Large Language ModelsCode1
ParGo: Bridging Vision-Language with Partial and Global ViewsCode1
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video AnalysisCode1
Pensieve: Retrospect-then-Compare Mitigates Visual HallucinationCode1
HyperLLaVA: Dynamic Visual and Language Expert Tuning for Multimodal Large Language ModelsCode1
Prompt Highlighter: Interactive Control for Multi-Modal LLMsCode1
Beyond Hallucinations: Enhancing LVLMs through Hallucination-Aware Direct Preference OptimizationCode1
What Makes for Good Visual Instructions? Synthesizing Complex Visual Reasoning Instructions for Visual Instruction TuningCode1
Masked Motion Encoding for Self-Supervised Video Representation LearningCode1
Semi-supervised Domain Adaptation via Minimax EntropyCode1
Q-Frame: Query-aware Frame Selection and Multi-Resolution Adaptation for Video-LLMs0
Language-Vision Planner and Executor for Text-to-Visual Reasoning0
DynTok: Dynamic Compression of Visual Tokens for Efficient and Effective Video Understanding0
Fast or Slow? Integrating Fast Intuition and Deliberate Thinking for Enhancing Visual Question Answering0
EvoMoE: Expert Evolution in Mixture of Experts for Multimodal Large Language Models0
MME-Reasoning: A Comprehensive Benchmark for Logical Reasoning in MLLMs0
Enhancing Visual Reliance in Text Generation: A Bayesian Perspective on Mitigating Hallucination in Large Vision-Language Models0
Mitigating Hallucinations via Inter-Layer Consistency Aggregation in Large Vision-Language Models0
VISTA: Enhancing Vision-Text Alignment in MLLMs via Cross-Modal Mutual Information Maximization0
Visual Instruction Tuning with Chain of Region-of-Interest0
An LMM for Efficient Video Understanding via Reinforced Compression of Video Cubes0
MME-Unify: A Comprehensive Benchmark for Unified Multimodal Understanding and Generation Models0
Instruction-Aligned Visual Attention for Mitigating Hallucinations in Large Vision-Language ModelsCode0
Improving LLM Video Understanding with 16 Frames Per Second0
Show:102550
← PrevPage 2 of 4Next →

No leaderboard results yet.