SOTAVerified|Agents Browse Leaderboard About

Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 501–510 of 5548 papers

Title	Date	Tasks	Status	Hype	Score
CodeIF: Benchmarking the Instruction-Following Capabilities of Large Language Models for Code Generation	Feb 26, 2025	BenchmarkingCode Generation	CodeCode Available	1	5
Benchmarking Large Multimodal Models against Common Corruptions	Jan 22, 2024	BenchmarkingImage to text	CodeCode Available	1	5
CODEMENV: Benchmarking Large Language Models on Code Migration	Jun 1, 2025	Benchmarking	CodeCode Available	1	5
Benchmarking Llama2, Mistral, Gemma and GPT for Factuality, Toxicity, Bias and Propensity for Hallucinations	Apr 15, 2024	BenchmarkingBias Detection	CodeCode Available	1	5
CODEBench: A Neural Architecture and Hardware Accelerator Co-Design Framework	Dec 7, 2022	Benchmarking	CodeCode Available	1	5
Application-Oriented Benchmarking of Quantum Generative Learning Using QUARK	Aug 8, 2023	BenchmarkingGPU	CodeCode Available	1	5
Benchmarking Differential Privacy and Federated Learning for BERT Models	Jun 26, 2021	BenchmarkingFederated Learning	CodeCode Available	1	5
OptiBench Meets ReSocratic: Measure and Improve LLMs for Optimization Modeling	Jul 13, 2024	BenchmarkingMath	CodeCode Available	1	5
CodeReef: an open platform for portable MLOps, reusable automation actions and reproducible benchmarking	Jan 22, 2020	Benchmarkingobject-detection	CodeCode Available	1	5
Benchmarking Deep Reinforcement Learning for Navigation in Denied Sensor Environments	Oct 18, 2024	Autonomous NavigationBenchmarking	CodeCode Available	1	5

Show:10 25 50

← PrevPage 51 of 555Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified