SOTAVerified|Agents Browse Leaderboard About Blog

Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 4771–4780 of 5548 papers

Title	Date	Tasks	Status	Hype
MineRL: A Large-Scale Dataset of Minecraft Demonstrations	Jul 29, 2019	BenchmarkingDeep Reinforcement Learning	CodeCode Available	0
GenCeption: Evaluate Multimodal LLMs with Unlabeled Unimodal Data	Feb 22, 2024	Benchmarking	CodeCode Available	0
GECOBench: A Gender-Controlled Text Dataset and Benchmark for Quantifying Biases in Explanations	Jun 17, 2024	BenchmarkingDataset Generation	CodeCode Available	0
Mining-Gym: A Configurable RL Benchmarking Environment for Truck Dispatch Scheduling	Mar 24, 2025	BenchmarkingOpenAI Gym	CodeCode Available	0
Fully Automatic Segmentation of Gross Target Volume and Organs-at-Risk for Radiotherapy Planning of Nasopharyngeal Carcinoma	Oct 4, 2023	BenchmarkingSegmentation	CodeCode Available	0
MIP-GAF: A MLLM-annotated Benchmark for Most Important Person Localization and Group Context Understanding	Sep 10, 2024	BenchmarkingLanguage Modeling	CodeCode Available	0
Mirage: Model-Agnostic Graph Distillation for Graph Classification	Oct 14, 2023	BenchmarkingClassification	CodeCode Available	0
Benchmarking Subset Selection from Large Candidate Solution Sets in Evolutionary Multi-objective Optimization	Jan 18, 2022	Benchmarking	CodeCode Available	0
Sanity Simulations for Saliency Methods	May 13, 2021	Benchmarking	CodeCode Available	0
From Variability to Stability: Advancing RecSys Benchmarking Practices	Feb 15, 2024	BenchmarkingCollaborative Filtering	CodeCode Available	0

Show:10 25 50

← PrevPage 478 of 555Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified