SOTAVerified

MME

MME is a comprehensive evaluation benchmark for multimodal large language models. It measures both perception and cognition abilities on a total of 14 subtasks, including existence, count, position, color, poster, celebrity, scene, landmark, artwork, OCR, commonsense reasoning, numerical calculation, text translation, and code reasoning.

Papers

Showing 150 of 95 papers

TitleStatusHype
Eagle 2.5: Boosting Long-Context Post-Training for Frontier Vision-Language ModelsCode4
VideoEval-Pro: Robust and Realistic Long Video Understanding EvaluationCode4
Long Context Transfer from Language to VisionCode4
Lyra: An Efficient and Speech-Centric Framework for Omni-CognitionCode3
Flash-VStream: Efficient Real-Time Understanding for Long Video StreamsCode3
TimeChat-Online: 80% Visual Tokens are Naturally Redundant in Streaming VideosCode3
MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMsCode3
Video-RAG: Visually-aligned Retrieval-Augmented Long Video ComprehensionCode3
L4DR: LiDAR-4DRadar Fusion for Weather-Robust 3D Object DetectionCode2
VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long VideosCode2
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language ModelsCode2
High-Resolution Visual Reasoning via Multi-Turn Grounding-Based Reinforcement LearningCode2
SpaceR: Reinforcing MLLMs in Video Spatial ReasoningCode2
Honeybee: Locality-enhanced Projector for Multimodal LLMCode2
Mitigating Modality Prior-Induced Hallucinations in Multimodal Large Language Models via Deciphering Attention CausalityCode2
BLIVA: A Simple Multimodal LLM for Better Handling of Text-Rich Visual QuestionsCode2
MMICL: Empowering Vision-language Model with Multi-Modal In-Context LearningCode2
Mitigating Hallucinations in Large Vision-Language Models with Instruction Contrastive DecodingCode2
VideoDeepResearch: Long Video Understanding With Agentic Tool UsingCode2
QuoTA: Query-oriented Token Assignment via CoT Query Decouple for Long Video ComprehensionCode2
Fine-tuning Multimodal LLMs to Follow Zero-shot Demonstrative InstructionsCode2
SiLVR: A Simple Language-based Video Reasoning FrameworkCode1
Masked Motion Encoding for Self-Supervised Video Representation LearningCode1
Towards Text-Image Interleaved RetrievalCode1
Beyond Hallucinations: Enhancing LVLMs through Hallucination-Aware Direct Preference OptimizationCode1
FRAG: Frame Selection Augmented Generation for Long Video and Long Document UnderstandingCode1
HyperLLaVA: Dynamic Visual and Language Expert Tuning for Multimodal Large Language ModelsCode1
To Preserve or To Compress: An In-Depth Study of Connector Selection in Multimodal Large Language ModelsCode1
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video AnalysisCode1
ParGo: Bridging Vision-Language with Partial and Global ViewsCode1
Pensieve: Retrospect-then-Compare Mitigates Visual HallucinationCode1
Prompt Highlighter: Interactive Control for Multi-Modal LLMsCode1
BOLT: Boost Large Vision-Language Model Without Training for Long-form Video UnderstandingCode1
What Makes for Good Visual Instructions? Synthesizing Complex Visual Reasoning Instructions for Visual Instruction TuningCode1
Semi-supervised Domain Adaptation via Minimax EntropyCode1
Dynamic Multimodal Evaluation with Flexible Complexity by Vision-Language BootstrappingCode1
Orthus: Autoregressive Interleaved Image-Text Generation with Modality-Specific Heads0
Q-Frame: Query-aware Frame Selection and Multi-Resolution Adaptation for Video-LLMs0
RITUAL: Random Image Transformations as a Universal Anti-hallucination Lever in Large Vision Language Models0
SAVEn-Vid: Synergistic Audio-Visual Integration for Enhanced Understanding in Long Video Context0
Scalable K-Medoids via True Error Bound and Familywise Bandits0
Silkie: Preference Distillation for Large Visual Language Models0
ZipVL: Efficient Large Vision-Language Models with Dynamic Token Sparsification0
AIDE: Agentically Improve Visual Language Model with Domain Experts0
An LMM for Efficient Video Understanding via Reinforced Compression of Video Cubes0
Apollo: An Exploration of Video Understanding in Large Multimodal Models0
Benchmarking and In-depth Performance Study of Large Language Models on Habana Gaudi Processors0
DAMRO: Dive into the Attention Mechanism of LVLM to Reduce Object Hallucination0
Deep Learning for Hybrid 5G Services in Mobile Edge Computing Systems: Learn from a Digital Twin0
Domain Adaptation via Minimax Entropy for Real/Bogus Classification of Astronomical Alerts0
Show:102550
← PrevPage 1 of 2Next →

No leaderboard results yet.