SOTAVerified

MME

MME is a comprehensive evaluation benchmark for multimodal large language models. It measures both perception and cognition abilities on a total of 14 subtasks, including existence, count, position, color, poster, celebrity, scene, landmark, artwork, OCR, commonsense reasoning, numerical calculation, text translation, and code reasoning.

Papers

Showing 150 of 95 papers

TitleStatusHype
High-Resolution Visual Reasoning via Multi-Turn Grounding-Based Reinforcement LearningCode2
Flash-VStream: Efficient Real-Time Understanding for Long Video StreamsCode3
Q-Frame: Query-aware Frame Selection and Multi-Resolution Adaptation for Video-LLMs0
VideoDeepResearch: Long Video Understanding With Agentic Tool UsingCode2
Language-Vision Planner and Executor for Text-to-Visual Reasoning0
DynTok: Dynamic Compression of Visual Tokens for Efficient and Effective Video Understanding0
Fast or Slow? Integrating Fast Intuition and Deliberate Thinking for Enhancing Visual Question Answering0
SiLVR: A Simple Language-based Video Reasoning FrameworkCode1
EvoMoE: Expert Evolution in Mixture of Experts for Multimodal Large Language Models0
MME-Reasoning: A Comprehensive Benchmark for Logical Reasoning in MLLMs0
Enhancing Visual Reliance in Text Generation: A Bayesian Perspective on Mitigating Hallucination in Large Vision-Language Models0
VideoEval-Pro: Robust and Realistic Long Video Understanding EvaluationCode4
Mitigating Hallucinations via Inter-Layer Consistency Aggregation in Large Vision-Language Models0
VISTA: Enhancing Vision-Text Alignment in MLLMs via Cross-Modal Mutual Information Maximization0
Visual Instruction Tuning with Chain of Region-of-Interest0
TimeChat-Online: 80% Visual Tokens are Naturally Redundant in Streaming VideosCode3
FRAG: Frame Selection Augmented Generation for Long Video and Long Document UnderstandingCode1
Eagle 2.5: Boosting Long-Context Post-Training for Frontier Vision-Language ModelsCode4
An LMM for Efficient Video Understanding via Reinforced Compression of Video Cubes0
MME-Unify: A Comprehensive Benchmark for Unified Multimodal Understanding and Generation Models0
SpaceR: Reinforcing MLLMs in Video Spatial ReasoningCode2
BOLT: Boost Large Vision-Language Model Without Training for Long-form Video UnderstandingCode1
Instruction-Aligned Visual Attention for Mitigating Hallucinations in Large Vision-Language ModelsCode0
Improving LLM Video Understanding with 16 Frames Per Second0
Logic-in-Frames: Dynamic Keyframe Search via Visual Semantic-Logical Verification for Long Video Understanding0
QuoTA: Query-oriented Token Assignment via CoT Query Decouple for Long Video ComprehensionCode2
Re-Imagining Multimodal Instruction Tuning: A Representation ViewCode0
Ultra-High-Frequency Harmony: mmWave Radar and Event Camera Orchestrate Accurate Drone Landing0
Towards Text-Image Interleaved RetrievalCode1
MME-CoT: Benchmarking Chain-of-Thought in Large Multimodal Models for Reasoning Quality, Robustness, and Efficiency0
AIDE: Agentically Improve Visual Language Model with Domain Experts0
Hummingbird: High Fidelity Image Generation via Multimodal Context Alignment0
Mitigating Hallucinations in Large Vision-Language Models with Internal Fact-based Contrastive Decoding0
MME-Industry: A Cross-Industry Multimodal Evaluation Benchmark0
Temporal Preference Optimization for Long-Form Video Understanding0
Expand VSR Benchmark for VLLM to Expertize in Spatial RulesCode0
GFormer: Accelerating Large Language Models with Optimized Transformers on Gaudi Processors0
Apollo: An Exploration of Video Understanding in Large Multimodal Models0
Lyra: An Efficient and Speech-Centric Framework for Omni-CognitionCode3
EACO: Enhancing Alignment in Multimodal LLMs via Critical Observation0
Orthus: Autoregressive Interleaved Image-Text Generation with Modality-Specific Heads0
SAVEn-Vid: Synergistic Audio-Visual Integration for Enhanced Understanding in Long Video Context0
Enhancing Instruction-Following Capability of Visual-Language Models by Reducing Image Redundancy0
MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMsCode3
Video-RAG: Visually-aligned Retrieval-Augmented Long Video ComprehensionCode3
The economic value of empowering older patients transitioning from hospital to home: Evidence from the 'Your Care Needs You' intervention0
MME-Finance: A Multimodal Finance Benchmark for Expert-level Understanding and Reasoning0
ZipVL: Efficient Large Vision-Language Models with Dynamic Token Sparsification0
Dynamic Multimodal Evaluation with Flexible Complexity by Vision-Language BootstrappingCode1
To Preserve or To Compress: An In-Depth Study of Connector Selection in Multimodal Large Language ModelsCode1
Show:102550
← PrevPage 1 of 2Next →

No leaderboard results yet.