SOTAVerified

MME

MME is a comprehensive evaluation benchmark for multimodal large language models. It measures both perception and cognition abilities on a total of 14 subtasks, including existence, count, position, color, poster, celebrity, scene, landmark, artwork, OCR, commonsense reasoning, numerical calculation, text translation, and code reasoning.

Papers

Showing 5195 of 95 papers

TitleStatusHype
An LMM for Efficient Video Understanding via Reinforced Compression of Video Cubes0
Apollo: An Exploration of Video Understanding in Large Multimodal Models0
Benchmarking and In-depth Performance Study of Large Language Models on Habana Gaudi Processors0
DAMRO: Dive into the Attention Mechanism of LVLM to Reduce Object Hallucination0
Deep Learning for Hybrid 5G Services in Mobile Edge Computing Systems: Learn from a Digital Twin0
Domain Adaptation via Minimax Entropy for Real/Bogus Classification of Astronomical Alerts0
Don't Miss the Forest for the Trees: Attentional Vision Calibration for Large Vision Language Models0
DrVideo: Document Retrieval Based Long Video Understanding0
DynTok: Dynamic Compression of Visual Tokens for Efficient and Effective Video Understanding0
EACO: Enhancing Alignment in Multimodal LLMs via Critical Observation0
The economic value of empowering older patients transitioning from hospital to home: Evidence from the 'Your Care Needs You' intervention0
Enhancing Instruction-Following Capability of Visual-Language Models by Reducing Image Redundancy0
Enhancing the Spatial Awareness Capability of Multi-Modal Large Language Model0
Enhancing Visual Reliance in Text Generation: A Bayesian Perspective on Mitigating Hallucination in Large Vision-Language Models0
EvoMoE: Expert Evolution in Mixture of Experts for Multimodal Large Language Models0
Fast or Slow? Integrating Fast Intuition and Deliberate Thinking for Enhancing Visual Question Answering0
GFormer: Accelerating Large Language Models with Optimized Transformers on Gaudi Processors0
Hummingbird: High Fidelity Image Generation via Multimodal Context Alignment0
Improving LLM Video Understanding with 16 Frames Per Second0
Language-Vision Planner and Executor for Text-to-Visual Reasoning0
Learning Multilingual Meta-Embeddings for Code-Switching Named Entity Recognition0
Logic-in-Frames: Dynamic Keyframe Search via Visual Semantic-Logical Verification for Long Video Understanding0
Machine Learning Methods for Inferring the Number of UAV Emitters via Massive MIMO Receive Array0
Mitigating Hallucinations in Large Vision-Language Models with Internal Fact-based Contrastive Decoding0
Mitigating Hallucinations via Inter-Layer Consistency Aggregation in Large Vision-Language Models0
MME-CoT: Benchmarking Chain-of-Thought in Large Multimodal Models for Reasoning Quality, Robustness, and Efficiency0
MME-CRS: Multi-Metric Evaluation Based on Correlation Re-Scaling for Evaluating Open-Domain Dialogue0
MME-Finance: A Multimodal Finance Benchmark for Expert-level Understanding and Reasoning0
MME-Industry: A Cross-Industry Multimodal Evaluation Benchmark0
MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?0
MME-Reasoning: A Comprehensive Benchmark for Logical Reasoning in MLLMs0
MME-Unify: A Comprehensive Benchmark for Unified Multimodal Understanding and Generation Models0
Multi-Modal Evaluation Approach for Medical Image Segmentation0
Online Meta-Learning for Multi-Source and Semi-Supervised Domain Adaptation0
Orthus: Autoregressive Interleaved Image-Text Generation with Modality-Specific Heads0
Q-Frame: Query-aware Frame Selection and Multi-Resolution Adaptation for Video-LLMs0
RITUAL: Random Image Transformations as a Universal Anti-hallucination Lever in Large Vision Language Models0
SAVEn-Vid: Synergistic Audio-Visual Integration for Enhanced Understanding in Long Video Context0
Scalable K-Medoids via True Error Bound and Familywise Bandits0
Silkie: Preference Distillation for Large Visual Language Models0
Temporal Preference Optimization for Long-Form Video Understanding0
Temporal Reasoning Transfer from Text to Video0
The Use of Symmetry for Models with Variable-size Variables0
Ultra-High-Frequency Harmony: mmWave Radar and Event Camera Orchestrate Accurate Drone Landing0
Visual Instruction Tuning with Chain of Region-of-Interest0
Show:102550
← PrevPage 2 of 2Next →

No leaderboard results yet.