SOTAVerified|Agents Browse Leaderboard About

Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 681–690 of 5548 papers

Title	Date	Tasks	Status	Hype	Score
Decentralized Arena: Towards Democratic and Scalable Automatic Evaluation of Language Models	May 19, 2025	BenchmarkingChatbot	CodeCode Available	1	5
Depth-Driven Geometric Prompt Learning for Laparoscopic Liver Landmark Detection	Jun 25, 2024	BenchmarkingPrompt Learning	CodeCode Available	1	5
Benchmarking Language Model Creativity: A Case Study on Code Generation	Jul 12, 2024	BenchmarkingCode Generation	CodeCode Available	1	5
Benchmarking Large Language Models on CMExam -- A Comprehensive Chinese Medical Exam Dataset	Jun 5, 2023	BenchmarkingMultiple-choice	CodeCode Available	1	5
Descending through a Crowded Valley — Benchmarking Deep Learning Optimizers	Jan 1, 2021	BenchmarkingDeep Learning	CodeCode Available	1	5
AutoDetect: Towards a Unified Framework for Automated Weakness Detection in Large Language Models	Jun 24, 2024	BenchmarkingData Augmentation	CodeCode Available	1	5
A Comprehensive Study on Large-Scale Graph Training: Benchmarking and Rethinking	Oct 14, 2022	BenchmarkingGPU	CodeCode Available	1	5
Benchmarking Llama2, Mistral, Gemma and GPT for Factuality, Toxicity, Bias and Propensity for Hallucinations	Apr 15, 2024	BenchmarkingBias Detection	CodeCode Available	1	5
DFGC 2021: A DeepFake Game Competition	Jun 2, 2021	BenchmarkingDeepFake Detection	CodeCode Available	1	5
Data Splits and Metrics for Method Benchmarking on Surgical Action Triplet Datasets	Apr 11, 2022	Action Triplet RecognitionBenchmarking	CodeCode Available	1	5

Show:10 25 50

← PrevPage 69 of 555Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified