SOTAVerified|Agents Browse Leaderboard About

Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 1121–1130 of 5548 papers

Title	Date	Tasks	Status	Hype	Score
Benchmarking Large Language Models on Answering and Explaining Challenging Medical Questions	Feb 28, 2024	BenchmarkingMultiple-choice	CodeCode Available	1	5
Benchmarking the Combinatorial Generalizability of Complex Query Answering on Knowledge Graphs	Sep 18, 2021	BenchmarkingComplex Query Answering	CodeCode Available	1	5
Are Vision Language Models Ready for Clinical Diagnosis? A 3D Medical Benchmark for Tumor-centric Visual Question Answering	May 25, 2025	AnatomyBenchmarking	CodeCode Available	1	5
Graphs, Constraints, and Search for the Abstraction and Reasoning Corpus	Oct 18, 2022	ARCBenchmarking	CodeCode Available	1	5
Should we be going MAD? A Look at Multi-Agent Debate Strategies for LLMs	Nov 29, 2023	Benchmarking	CodeCode Available	1	5
Benchmarking LLMs' Swarm intelligence	May 7, 2025	Benchmarking	CodeCode Available	1	5
Benchmarking Relief-Based Feature Selection Methods for Bioinformatics Data Mining	Nov 22, 2017	Benchmarkingfeature selection	CodeCode Available	1	5
Benchmarking Large Language Models on Controllable Generation under Diversified Instructions	Jan 1, 2024	BenchmarkingInstruction Following	CodeCode Available	1	5
Benchmarking the CoW with the TopCoW Challenge: Topology-Aware Anatomical Segmentation of the Circle of Willis for CTA and MRA	Dec 29, 2023	AnatomyBenchmarking	CodeCode Available	1	5
Benchmarking Robustness to Adversarial Image Obfuscations	Jan 30, 2023	Benchmarking	CodeCode Available	1	5

Show:10 25 50

← PrevPage 113 of 555Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified