SOTAVerified

MME

MME is a comprehensive evaluation benchmark for multimodal large language models. It measures both perception and cognition abilities on a total of 14 subtasks, including existence, count, position, color, poster, celebrity, scene, landmark, artwork, OCR, commonsense reasoning, numerical calculation, text translation, and code reasoning.

Papers

Showing 2650 of 95 papers

TitleStatusHype
ParGo: Bridging Vision-Language with Partial and Global ViewsCode1
Pensieve: Retrospect-then-Compare Mitigates Visual HallucinationCode1
Prompt Highlighter: Interactive Control for Multi-Modal LLMsCode1
BOLT: Boost Large Vision-Language Model Without Training for Long-form Video UnderstandingCode1
Semi-supervised Domain Adaptation via Minimax EntropyCode1
Dynamic Multimodal Evaluation with Flexible Complexity by Vision-Language BootstrappingCode1
SiLVR: A Simple Language-based Video Reasoning FrameworkCode1
Masked Motion Encoding for Self-Supervised Video Representation LearningCode1
To Preserve or To Compress: An In-Depth Study of Connector Selection in Multimodal Large Language ModelsCode1
Towards Text-Image Interleaved RetrievalCode1
What Makes for Good Visual Instructions? Synthesizing Complex Visual Reasoning Instructions for Visual Instruction TuningCode1
MM-GNN: Mix-Moment Graph Neural Network towards Modeling Neighborhood Feature DistributionCode0
TUBench: Benchmarking Large Vision-Language Models on Trustworthiness with Unanswerable QuestionsCode0
MAAL: Multimodality-Aware Autoencoder-Based Affordance Learning for 3D Articulated ObjectsCode0
Re-Imagining Multimodal Instruction Tuning: A Representation ViewCode0
Joint Visual and Text Prompting for Improved Object-Centric Perception with Multimodal Large Language ModelsCode0
Expand VSR Benchmark for VLLM to Expertize in Spatial RulesCode0
InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and CompositionCode0
ShareGPT4V: Improving Large Multi-Modal Models with Better CaptionsCode0
Instruction-Aligned Visual Attention for Mitigating Hallucinations in Large Vision-Language ModelsCode0
A Challenger to GPT-4V? Early Explorations of Gemini in Visual ExpertiseCode0
Decoding Multilingual Moral Preferences: Unveiling LLM's Biases Through the Moral Machine ExperimentCode0
VISTA: Enhancing Vision-Text Alignment in MLLMs via Cross-Modal Mutual Information Maximization0
ZipVL: Efficient Large Vision-Language Models with Dynamic Token Sparsification0
AIDE: Agentically Improve Visual Language Model with Domain Experts0
Show:102550
← PrevPage 2 of 4Next →

No leaderboard results yet.