Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 1–50 of 5548 papers

Title	Date	Tasks	Status	Hype	Score
WebWalker: Benchmarking LLMs in Web Traversal	Jan 13, 2025	BenchmarkingOpen-Domain Question Answering	CodeCode Available	11	5
StableToolBench: Towards Stable Large-Scale Benchmarking on Tool Learning of Large Language Models	Mar 12, 2024	Benchmarking	CodeCode Available	9	5
NAVSIM: Data-Driven Non-Reactive Autonomous Vehicle Simulation and Benchmarking	Jun 21, 2024	Autonomous DrivingBenchmarking	CodeCode Available	7	5
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments	Apr 11, 2024	Benchmarking	CodeCode Available	7	5
SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models	Feb 8, 2024	BenchmarkingDiversity	CodeCode Available	7	5
Segment Anything in Medical Images and Videos: Benchmark and Deployment	Aug 6, 2024	BenchmarkingSegmentation	CodeCode Available	7	5
EvoRL: A GPU-accelerated Framework for Evolutionary Reinforcement Learning	Jan 25, 2025	BenchmarkingEvolutionary Algorithms	CodeCode Available	7	5
DeepSpeed-FastGen: High-throughput Text Generation for LLMs via MII and DeepSpeed-Inference	Jan 9, 2024	BenchmarkingText Generation	CodeCode Available	7	5
ECCO: Can We Improve Model-Generated Code Efficiency Without Sacrificing Functional Correctness?	Jul 19, 2024	BenchmarkingCode Generation	CodeCode Available	7	5
Better than classical? The subtle art of benchmarking quantum machine learning models	Mar 11, 2024	BenchmarkingBinary Classification	CodeCode Available	7	5
CALE: Continuous Arcade Learning Environment	Oct 31, 2024	Atari GamesBenchmarking	CodeCode Available	7	5
TaskBench: Benchmarking Large Language Models for Task Automation	Nov 30, 2023	BenchmarkingParameter Prediction	CodeCode Available	6	5
The BrowserGym Ecosystem for Web Agent Research	Dec 6, 2024	Benchmarking	CodeCode Available	5	5
SMPLest-X: Ultimate Scaling for Expressive Human Pose and Shape Estimation	Jan 16, 2025	Benchmarking	CodeCode Available	5	5
Segment Anything Model for Medical Image Segmentation: Current Applications and Future Directions	Jan 7, 2024	BenchmarkingImage Segmentation	CodeCode Available	5	5
TFB: Towards Comprehensive and Fair Benchmarking of Time Series Forecasting Methods	Mar 29, 2024	BenchmarkingMultivariate Time Series Forecasting	CodeCode Available	5	5
Benchmarking the Myopic Trap: Positional Bias in Information Retrieval	May 20, 2025	BenchmarkingInformation Retrieval	CodeCode Available	5	5
OmniDocBench: Benchmarking Diverse PDF Document Parsing with Comprehensive Annotations	Dec 10, 2024	AttributeBenchmarking	CodeCode Available	5	5
VBench++: Comprehensive and Versatile Benchmark Suite for Video Generative Models	Nov 20, 2024	BenchmarkingImage Generation	CodeCode Available	5	5
AssetOpsBench: Benchmarking AI Agents for Task Automation in Industrial Asset Operations and Maintenance	Jun 4, 2025	BenchmarkingScheduling	CodeCode Available	5	5
CodeGeeX: A Pre-Trained Model for Code Generation with Multilingual Benchmarking on HumanEval-X	Mar 30, 2023	BenchmarkingCode Generation	CodeCode Available	5	5
TerraTorch: The Geospatial Foundation Models Toolkit	Mar 26, 2025	BenchmarkingDecoder	CodeCode Available	4	5
Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models	Mar 20, 2025	BenchmarkingReinforcement Learning (RL)	CodeCode Available	4	5
TableGPT2: A Large Multimodal Model with Tabular Data Integration	Nov 4, 2024	BenchmarkingData Integration	CodeCode Available	4	5
shapiq: Shapley Interactions for Machine Learning	Oct 2, 2024	BenchmarkingData Valuation	CodeCode Available	4	5
Building reliable sim driving agents by scaling self-play	Feb 20, 2025	Autonomous VehiclesBenchmarking	CodeCode Available	4	5
Recent Advances in Large Langauge Model Benchmarks against Data Contamination: From Static to Dynamic Evaluation	Feb 23, 2025	Benchmarking	CodeCode Available	4	5
BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions	Jun 22, 2024	BenchmarkingCode Generation	CodeCode Available	4	5
OpenUnlearning: Accelerating LLM Unlearning via Unified Benchmarking of Methods and Metrics	Jun 14, 2025	Benchmarking	CodeCode Available	4	5
OpenAGI: When LLM Meets Domain Experts	Apr 10, 2023	BenchmarkingNatural Language Queries	CodeCode Available	4	5
Pearl: A Production-ready Reinforcement Learning Agent	Dec 6, 2023	Benchmarkingreinforcement-learning	CodeCode Available	4	5
OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning	Dec 31, 2024	BenchmarkingLogical Reasoning	CodeCode Available	4	5
Benchmarking Neural Network Training Algorithms	Jun 12, 2023	Benchmarking	CodeCode Available	4	5
MTEB: Massive Text Embedding Benchmark	Oct 13, 2022	BenchmarkingInformation Retrieval	CodeCode Available	4	5
Rankify: A Comprehensive Python Toolkit for Retrieval, Re-Ranking, and Retrieval-Augmented Generation	Feb 4, 2025	BenchmarkingInformation Retrieval	CodeCode Available	4	5
Benchmarking Graphormer on Large-Scale Molecular Modeling Datasets	Mar 9, 2022	BenchmarkingGraph Regression	CodeCode Available	4	5
LLMC: Benchmarking Large Language Model Quantization with a Versatile Compression Toolkit	May 9, 2024	BenchmarkingComputational Efficiency	CodeCode Available	4	5
Meta Audiobox Aesthetics: Unified Automatic Quality Assessment for Speech, Music, and Sound	Feb 7, 2025	Benchmarking	CodeCode Available	4	5
Aequitas Flow: Streamlining Fair ML Experimentation	May 9, 2024	BenchmarkingFairness	CodeCode Available	4	5
Benchmarking Retrieval-Augmented Generation for Medicine	Feb 20, 2024	BenchmarkingInformation Retrieval	CodeCode Available	4	5
Accelerating Data Processing and Benchmarking of AI Models for Pathology	Feb 10, 2025	Benchmarking	CodeCode Available	4	5
I Think, Therefore I am: Benchmarking Awareness of Large Language Models Using AwareBench	Jan 31, 2024	BenchmarkingMultiple-choice	CodeCode Available	4	5
Benchopt: Reproducible, efficient and collaborative optimization benchmarks	Jun 27, 2022	Benchmarkingimage-classification	CodeCode Available	4	5
AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents	May 23, 2024	Benchmarking	CodeCode Available	4	5
MLPerf Power: Benchmarking the Energy Efficiency of Machine Learning Systems from Microwatts to Megawatts for Sustainable AI	Oct 15, 2024	Benchmarking	CodeCode Available	4	5
Dora: Sampling and Benchmarking for 3D Shape Variational Auto-Encoders	Dec 23, 2024	3D Shape ModelingBenchmarking	CodeCode Available	4	5
Bench2Drive: Towards Multi-Ability Benchmarking of Closed-Loop End-To-End Autonomous Driving	Jun 6, 2024	Autonomous DrivingBench2Drive	CodeCode Available	4	5
A deep learning framework for efficient pathology image analysis	Feb 18, 2025	BenchmarkingDeep Learning	CodeCode Available	4	5
Enabling more efficient and cost-effective AI/ML systems with Collective Mind, virtualized MLOps, MLPerf, Collective Knowledge Playground and reproducible optimization tournaments	Jun 24, 2024	Benchmarking	CodeCode Available	4	5
Molecular-driven Foundation Model for Oncologic Pathology	Jan 28, 2025	BenchmarkingDiagnostic	CodeCode Available	4	5

Show:10 25 50

← PrevPage 1 of 111Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified