Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 51–75 of 5548 papers

Title	Date	Tasks	Status	Hype
Benchopt: Reproducible, efficient and collaborative optimization benchmarks	Jun 27, 2022	Benchmarkingimage-classification	CodeCode Available	4
RecBole 2.0: Towards a More Up-to-Date Recommendation Library	Jun 15, 2022	BenchmarkingData Augmentation	CodeCode Available	4
Benchmarking Graphormer on Large-Scale Molecular Modeling Datasets	Mar 9, 2022	BenchmarkingGraph Regression	CodeCode Available	4
TabArena: A Living Benchmark for Machine Learning on Tabular Data	Jun 20, 2025	Benchmarking	CodeCode Available	3
ANIRA: An Architecture for Neural Network Inference in Real-Time Audio Applications	Jun 14, 2025	Benchmarking	CodeCode Available	3
ChartGalaxy: A Dataset for Infographic Chart Understanding and Generation	May 24, 2025	BenchmarkingChart Understanding	CodeCode Available	3
IFEval-Audio: Benchmarking Instruction-Following Capability in Audio-based Large Language Models	May 22, 2025	BenchmarkingInstruction Following	CodeCode Available	3
AudioTrust: Benchmarking the Multifaceted Trustworthiness of Audio Large Language Models	May 22, 2025	BenchmarkingFairness	CodeCode Available	3
OmniGenBench: A Modular Platform for Reproducible Genomic Foundation Models Benchmarking	May 20, 2025	Benchmarking	CodeCode Available	3
Time Travel is Cheating: Going Live with DeepFund for Real-Time Fund Investment Benchmarking	May 16, 2025	BenchmarkingManagement	CodeCode Available	3
The ML.ENERGY Benchmark: Toward Automated Inference Energy Measurement and Optimization	May 9, 2025	Benchmarking	CodeCode Available	3
REAL: Benchmarking Autonomous Agents on Deterministic Simulations of Real Websites	Apr 15, 2025	Autonomous Web NavigationBenchmarking	CodeCode Available	3
StableToolBench-MirrorAPI: Modeling Tool Environments as Mirrors of 7,000+ Real-World APIs	Mar 26, 2025	Benchmarking	CodeCode Available	3
SceneSplat: Gaussian Splatting-based Scene Understanding with Vision-Language Pretraining	Mar 23, 2025	3DGSBenchmarking	CodeCode Available	3
Robust Latent Matters: Boosting Image Generation with Sampling Error	Mar 11, 2025	BenchmarkingImage Generation	CodeCode Available	3
nnInteractive: Redefining 3D Promptable Segmentation	Mar 11, 2025	BenchmarkingInteractive Segmentation	CodeCode Available	3
OpenTAD: A Unified Framework and Comprehensive Study of Temporal Action Detection	Feb 27, 2025	Action DetectionBenchmarking	CodeCode Available	3
BatteryLife: A Comprehensive Dataset and Benchmark for Battery Life Prediction	Feb 26, 2025	BenchmarkingTime Series	CodeCode Available	3
ITBench: Evaluating AI Agents across Diverse Real-World IT Automation Tasks	Feb 7, 2025	Benchmarking	CodeCode Available	3
MedAgentBench: A Realistic Virtual EHR Environment to Benchmark Medical LLM Agents	Jan 24, 2025	Benchmarking	CodeCode Available	3
Prithvi-EO-2.0: A Versatile Multi-Temporal Foundation Model for Earth Observation Applications	Dec 3, 2024	BenchmarkingDisaster Response	CodeCode Available	3
Caravan MultiMet: Extending Caravan with Multiple Weather Nowcasts and Forecasts	Nov 14, 2024	Benchmarking	CodeCode Available	3
General Geospatial Inference with a Population Dynamics Foundation Model	Nov 11, 2024	BenchmarkingGraph Neural Network	CodeCode Available	3
Benchmarking Multimodal Retrieval Augmented Generation with Dynamic VQA Dataset and Self-adaptive Planning Agent	Nov 5, 2024	BenchmarkingHallucination	CodeCode Available	3
AndroidLab: Training and Systematic Benchmarking of Android Autonomous Agents	Oct 31, 2024	Benchmarking	CodeCode Available	3

Show:10 25 50

← PrevPage 3 of 222Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified