SOTAVerified

MME

MME is a comprehensive evaluation benchmark for multimodal large language models. It measures both perception and cognition abilities on a total of 14 subtasks, including existence, count, position, color, poster, celebrity, scene, landmark, artwork, OCR, commonsense reasoning, numerical calculation, text translation, and code reasoning.

Papers

Showing 110 of 95 papers

TitleStatusHype
Eagle 2.5: Boosting Long-Context Post-Training for Frontier Vision-Language ModelsCode4
VideoEval-Pro: Robust and Realistic Long Video Understanding EvaluationCode4
Long Context Transfer from Language to VisionCode4
Lyra: An Efficient and Speech-Centric Framework for Omni-CognitionCode3
TimeChat-Online: 80% Visual Tokens are Naturally Redundant in Streaming VideosCode3
Video-RAG: Visually-aligned Retrieval-Augmented Long Video ComprehensionCode3
MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMsCode3
Flash-VStream: Efficient Real-Time Understanding for Long Video StreamsCode3
High-Resolution Visual Reasoning via Multi-Turn Grounding-Based Reinforcement LearningCode2
Fine-tuning Multimodal LLMs to Follow Zero-shot Demonstrative InstructionsCode2
Show:102550
← PrevPage 1 of 10Next →

No leaderboard results yet.