SOTAVerified

MME

MME is a comprehensive evaluation benchmark for multimodal large language models. It measures both perception and cognition abilities on a total of 14 subtasks, including existence, count, position, color, poster, celebrity, scene, landmark, artwork, OCR, commonsense reasoning, numerical calculation, text translation, and code reasoning.

Papers

Showing 125 of 95 papers

TitleStatusHype
VideoEval-Pro: Robust and Realistic Long Video Understanding EvaluationCode4
Long Context Transfer from Language to VisionCode4
Eagle 2.5: Boosting Long-Context Post-Training for Frontier Vision-Language ModelsCode4
Flash-VStream: Efficient Real-Time Understanding for Long Video StreamsCode3
Video-RAG: Visually-aligned Retrieval-Augmented Long Video ComprehensionCode3
Lyra: An Efficient and Speech-Centric Framework for Omni-CognitionCode3
MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMsCode3
TimeChat-Online: 80% Visual Tokens are Naturally Redundant in Streaming VideosCode3
Mitigating Modality Prior-Induced Hallucinations in Multimodal Large Language Models via Deciphering Attention CausalityCode2
L4DR: LiDAR-4DRadar Fusion for Weather-Robust 3D Object DetectionCode2
QuoTA: Query-oriented Token Assignment via CoT Query Decouple for Long Video ComprehensionCode2
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language ModelsCode2
MMICL: Empowering Vision-language Model with Multi-Modal In-Context LearningCode2
Mitigating Hallucinations in Large Vision-Language Models with Instruction Contrastive DecodingCode2
Honeybee: Locality-enhanced Projector for Multimodal LLMCode2
VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long VideosCode2
BLIVA: A Simple Multimodal LLM for Better Handling of Text-Rich Visual QuestionsCode2
SpaceR: Reinforcing MLLMs in Video Spatial ReasoningCode2
High-Resolution Visual Reasoning via Multi-Turn Grounding-Based Reinforcement LearningCode2
Fine-tuning Multimodal LLMs to Follow Zero-shot Demonstrative InstructionsCode2
VideoDeepResearch: Long Video Understanding With Agentic Tool UsingCode2
Beyond Hallucinations: Enhancing LVLMs through Hallucination-Aware Direct Preference OptimizationCode1
Masked Motion Encoding for Self-Supervised Video Representation LearningCode1
HyperLLaVA: Dynamic Visual and Language Expert Tuning for Multimodal Large Language ModelsCode1
ParGo: Bridging Vision-Language with Partial and Global ViewsCode1
Show:102550
← PrevPage 1 of 4Next →

No leaderboard results yet.