SOTAVerified|Agents Browse Leaderboard About Blog

Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 2751–2760 of 5548 papers

Title	Date	Tasks	Status	Hype
Towards Ground-truth-free Evaluation of Any Segmentation in Medical Images	Sep 23, 2024	BenchmarkingSegmentation	CodeCode Available	0
Style Outweighs Substance: Failure Modes of LLM Judges in Alignment Benchmarking	Sep 23, 2024	BenchmarkingDiversity	CodeCode Available	0
Benchmarking Edge AI Platforms for High-Performance ML Inference	Sep 23, 2024	BenchmarkingCPU	—Unverified	0
Building a continuous benchmarking ecosystem in bioinformatics	Sep 23, 2024	Benchmarking	—Unverified	0
AlphaZip: Neural Network-Enhanced Lossless Text Compression	Sep 23, 2024	BenchmarkingData Compression	CodeCode Available	0
The Ability of Large Language Models to Evaluate Constraint-satisfaction in Agent Responses to Open-ended Requests	Sep 22, 2024	Benchmarking	—Unverified	0
Investigating the Impact of Hard Samples on Accuracy Reveals In-class Data Imbalance	Sep 22, 2024	AutoMLBenchmarking	CodeCode Available	0
Margin-bounded Confidence Scores for Out-of-Distribution Detection	Sep 22, 2024	Autonomous DrivingBenchmarking	CodeCode Available	0
Sketch 'n Solve: An Efficient Python Package for Large-Scale Least Squares Using Randomized Numerical Linear Algebra	Sep 22, 2024	Benchmarking	—Unverified	0
Can LLMs replace Neil deGrasse Tyson? Evaluating the Reliability of LLMs as Science Communicators	Sep 21, 2024	Benchmarking	CodeCode Available	0

Show:10 25 50

← PrevPage 276 of 555Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified