SOTAVerified

MME

MME is a comprehensive evaluation benchmark for multimodal large language models. It measures both perception and cognition abilities on a total of 14 subtasks, including existence, count, position, color, poster, celebrity, scene, landmark, artwork, OCR, commonsense reasoning, numerical calculation, text translation, and code reasoning.

Papers

Showing 2650 of 95 papers

TitleStatusHype
Dynamic Multimodal Evaluation with Flexible Complexity by Vision-Language BootstrappingCode1
FRAG: Frame Selection Augmented Generation for Long Video and Long Document UnderstandingCode1
BOLT: Boost Large Vision-Language Model Without Training for Long-form Video UnderstandingCode1
Semi-supervised Domain Adaptation via Minimax EntropyCode1
HyperLLaVA: Dynamic Visual and Language Expert Tuning for Multimodal Large Language ModelsCode1
Beyond Hallucinations: Enhancing LVLMs through Hallucination-Aware Direct Preference OptimizationCode1
Towards Text-Image Interleaved RetrievalCode1
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video AnalysisCode1
What Makes for Good Visual Instructions? Synthesizing Complex Visual Reasoning Instructions for Visual Instruction TuningCode1
ParGo: Bridging Vision-Language with Partial and Global ViewsCode1
Pensieve: Retrospect-then-Compare Mitigates Visual HallucinationCode1
Online Meta-Learning for Multi-Source and Semi-Supervised Domain Adaptation0
Orthus: Autoregressive Interleaved Image-Text Generation with Modality-Specific Heads0
Q-Frame: Query-aware Frame Selection and Multi-Resolution Adaptation for Video-LLMs0
RITUAL: Random Image Transformations as a Universal Anti-hallucination Lever in Large Vision Language Models0
SAVEn-Vid: Synergistic Audio-Visual Integration for Enhanced Understanding in Long Video Context0
Scalable K-Medoids via True Error Bound and Familywise Bandits0
Silkie: Preference Distillation for Large Visual Language Models0
Temporal Preference Optimization for Long-Form Video Understanding0
Temporal Reasoning Transfer from Text to Video0
The Use of Symmetry for Models with Variable-size Variables0
ZipVL: Efficient Large Vision-Language Models with Dynamic Token Sparsification0
A Challenger to GPT-4V? Early Explorations of Gemini in Visual Expertise0
AIDE: Agentically Improve Visual Language Model with Domain Experts0
An LMM for Efficient Video Understanding via Reinforced Compression of Video Cubes0
Show:102550
← PrevPage 2 of 4Next →

No leaderboard results yet.