SOTAVerified

Benchmarking

Papers

Showing 2650 of 5548 papers

TitleStatusHype
Building reliable sim driving agents by scaling self-playCode4
Recent Advances in Large Langauge Model Benchmarks against Data Contamination: From Static to Dynamic EvaluationCode4
BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex InstructionsCode4
OpenUnlearning: Accelerating LLM Unlearning via Unified Benchmarking of Methods and MetricsCode4
OpenAGI: When LLM Meets Domain ExpertsCode4
Pearl: A Production-ready Reinforcement Learning AgentCode4
OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and ReasoningCode4
Benchmarking Neural Network Training AlgorithmsCode4
MTEB: Massive Text Embedding BenchmarkCode4
Rankify: A Comprehensive Python Toolkit for Retrieval, Re-Ranking, and Retrieval-Augmented GenerationCode4
Benchmarking Graphormer on Large-Scale Molecular Modeling DatasetsCode4
LLMC: Benchmarking Large Language Model Quantization with a Versatile Compression ToolkitCode4
Meta Audiobox Aesthetics: Unified Automatic Quality Assessment for Speech, Music, and SoundCode4
Aequitas Flow: Streamlining Fair ML ExperimentationCode4
Benchmarking Retrieval-Augmented Generation for MedicineCode4
Accelerating Data Processing and Benchmarking of AI Models for PathologyCode4
I Think, Therefore I am: Benchmarking Awareness of Large Language Models Using AwareBenchCode4
Benchopt: Reproducible, efficient and collaborative optimization benchmarksCode4
AndroidWorld: A Dynamic Benchmarking Environment for Autonomous AgentsCode4
MLPerf Power: Benchmarking the Energy Efficiency of Machine Learning Systems from Microwatts to Megawatts for Sustainable AICode4
Dora: Sampling and Benchmarking for 3D Shape Variational Auto-EncodersCode4
Bench2Drive: Towards Multi-Ability Benchmarking of Closed-Loop End-To-End Autonomous DrivingCode4
A deep learning framework for efficient pathology image analysisCode4
Enabling more efficient and cost-effective AI/ML systems with Collective Mind, virtualized MLOps, MLPerf, Collective Knowledge Playground and reproducible optimization tournamentsCode4
Molecular-driven Foundation Model for Oncologic PathologyCode4
Show:102550
← PrevPage 2 of 222Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified