Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 26–50 of 5548 papers

Title	Date	Tasks	Status	Hype	Score
Building reliable sim driving agents by scaling self-play	Feb 20, 2025	Autonomous VehiclesBenchmarking	CodeCode Available	4	5
Recent Advances in Large Langauge Model Benchmarks against Data Contamination: From Static to Dynamic Evaluation	Feb 23, 2025	Benchmarking	CodeCode Available	4	5
BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions	Jun 22, 2024	BenchmarkingCode Generation	CodeCode Available	4	5
OpenUnlearning: Accelerating LLM Unlearning via Unified Benchmarking of Methods and Metrics	Jun 14, 2025	Benchmarking	CodeCode Available	4	5
OpenAGI: When LLM Meets Domain Experts	Apr 10, 2023	BenchmarkingNatural Language Queries	CodeCode Available	4	5
Pearl: A Production-ready Reinforcement Learning Agent	Dec 6, 2023	Benchmarkingreinforcement-learning	CodeCode Available	4	5
OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning	Dec 31, 2024	BenchmarkingLogical Reasoning	CodeCode Available	4	5
Benchmarking Neural Network Training Algorithms	Jun 12, 2023	Benchmarking	CodeCode Available	4	5
MTEB: Massive Text Embedding Benchmark	Oct 13, 2022	BenchmarkingInformation Retrieval	CodeCode Available	4	5
Rankify: A Comprehensive Python Toolkit for Retrieval, Re-Ranking, and Retrieval-Augmented Generation	Feb 4, 2025	BenchmarkingInformation Retrieval	CodeCode Available	4	5
Benchmarking Graphormer on Large-Scale Molecular Modeling Datasets	Mar 9, 2022	BenchmarkingGraph Regression	CodeCode Available	4	5
LLMC: Benchmarking Large Language Model Quantization with a Versatile Compression Toolkit	May 9, 2024	BenchmarkingComputational Efficiency	CodeCode Available	4	5
Meta Audiobox Aesthetics: Unified Automatic Quality Assessment for Speech, Music, and Sound	Feb 7, 2025	Benchmarking	CodeCode Available	4	5
Aequitas Flow: Streamlining Fair ML Experimentation	May 9, 2024	BenchmarkingFairness	CodeCode Available	4	5
Benchmarking Retrieval-Augmented Generation for Medicine	Feb 20, 2024	BenchmarkingInformation Retrieval	CodeCode Available	4	5
Accelerating Data Processing and Benchmarking of AI Models for Pathology	Feb 10, 2025	Benchmarking	CodeCode Available	4	5
I Think, Therefore I am: Benchmarking Awareness of Large Language Models Using AwareBench	Jan 31, 2024	BenchmarkingMultiple-choice	CodeCode Available	4	5
Benchopt: Reproducible, efficient and collaborative optimization benchmarks	Jun 27, 2022	Benchmarkingimage-classification	CodeCode Available	4	5
AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents	May 23, 2024	Benchmarking	CodeCode Available	4	5
MLPerf Power: Benchmarking the Energy Efficiency of Machine Learning Systems from Microwatts to Megawatts for Sustainable AI	Oct 15, 2024	Benchmarking	CodeCode Available	4	5
Dora: Sampling and Benchmarking for 3D Shape Variational Auto-Encoders	Dec 23, 2024	3D Shape ModelingBenchmarking	CodeCode Available	4	5
Bench2Drive: Towards Multi-Ability Benchmarking of Closed-Loop End-To-End Autonomous Driving	Jun 6, 2024	Autonomous DrivingBench2Drive	CodeCode Available	4	5
A deep learning framework for efficient pathology image analysis	Feb 18, 2025	BenchmarkingDeep Learning	CodeCode Available	4	5
Enabling more efficient and cost-effective AI/ML systems with Collective Mind, virtualized MLOps, MLPerf, Collective Knowledge Playground and reproducible optimization tournaments	Jun 24, 2024	Benchmarking	CodeCode Available	4	5
Molecular-driven Foundation Model for Oncologic Pathology	Jan 28, 2025	BenchmarkingDiagnostic	CodeCode Available	4	5

Show:10 25 50

← PrevPage 2 of 222Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified