SOTAVerified

Benchmarking

Papers

Showing 33763400 of 5548 papers

TitleStatusHype
Microvasculature Segmentation in Human BioMolecular Atlas Program (HuBMAP)0
MileBench: Benchmarking MLLMs in Long Context0
MiLQ: Benchmarking IR Models for Bilingual Web Search with Mixed Language Queries0
Mind2Web 2: Evaluating Agentic Search with Agent-as-a-Judge0
Are Machines Better at Complex Reasoning? Unveiling Human-Machine Inference Gaps in Entailment Verification0
Mind the Retrosynthesis Gap: Bridging the divide between Single-step and Multi-step Retrosynthesis Prediction0
Mind Your Theory: Theory of Mind Goes Deeper Than Reasoning0
MIRAI: Evaluating LLM Agents for Event Forecasting0
MIR-Bench: Can Your LLM Recognize Complicated Patterns via Many-Shot In-Context Reasoning?0
Mitigating severe over-parameterization in deep convolutional neural networks through forced feature abstraction and compression with an entropy-based heuristic0
Mixed-Precision Quantization for Federated Learning on Resource-Constrained Heterogeneous Devices0
MJ-VIDEO: Fine-Grained Benchmarking and Rewarding Video Preferences in Video Generation0
MLAR: Multi-layer Large Language Model-based Robotic Process Automation Applicant Tracking0
MLHarness: A Scalable Benchmarking System for MLCommons0
MLModelScope: A Distributed Platform for ML Model Evaluation and Benchmarking at Scale0
MLModelScope: A Distributed Platform for Model Evaluation and Benchmarking at Scale0
MLPerf HPC: A Holistic Benchmark Suite for Scientific Machine Learning on HPC Systems0
mlr3proba: An R Package for Machine Learning in Survival Analysis0
ML-SUPERB 2.0: Benchmarking Multilingual Speech Models Across Modeling Constraints, Languages, and Datasets0
MMDocBench: Benchmarking Large Vision-Language Models for Fine-Grained Visual Document Understanding0
MMDocIR: Benchmarking Multi-Modal Retrieval for Long Documents0
MME-CoT: Benchmarking Chain-of-Thought in Large Multimodal Models for Reasoning Quality, Robustness, and Efficiency0
MME-Unify: A Comprehensive Benchmark for Unified Multimodal Understanding and Generation Models0
MMInA: Benchmarking Multihop Multimodal Internet Agents0
MMMG: a Comprehensive and Reliable Evaluation Suite for Multitask Multimodal Generation0
Show:102550
← PrevPage 136 of 222Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified