SOTAVerified|Agents Browse Leaderboard About Blog

Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 1671–1680 of 5548 papers

Title	Date	Tasks	Status	Hype	Score
Is Your Model Fairly Certain? Uncertainty-Aware Fairness Evaluation for LLMs	May 29, 2025	BenchmarkingFairness	CodeCode Available	0	5
JALMBench: Benchmarking Jailbreak Vulnerabilities in Audio Language Models	May 23, 2025	BenchmarkingDiversity	CodeCode Available	0	5
A Benchmark on Extremely Weakly Supervised Text Classification: Reconcile Seed Matching and Prompting Approaches	May 22, 2023	BenchmarkingClassification	CodeCode Available	0	5
Certifiable Black-Box Attacks with Randomized Adversarial Examples: Breaking Defenses with Provable Confidence	Apr 10, 2023	Benchmarkingspeech-recognition	CodeCode Available	0	5
CEBench: A Benchmarking Toolkit for the Cost-Effectiveness of LLM Pipelines	Jun 20, 2024	BenchmarkingDecision Making	CodeCode Available	0	5
Benchmarking and Improving Compositional Generalization of Multi-aspect Controllable Text Generation	Apr 5, 2024	AttributeBenchmarking	CodeCode Available	0	5
ISImed: A Framework for Self-Supervised Learning using Intrinsic Spatial Information in Medical Images	Oct 22, 2024	BenchmarkingSelf-Supervised Learning	CodeCode Available	0	5
DyKnow: Dynamically Verifying Time-Sensitive Factual Knowledge in LLMs	Apr 10, 2024	Benchmarkingknowledge editing	CodeCode Available	0	5
Joint Multi-Scale Tone Mapping and Denoising for HDR Image Enhancement	Mar 16, 2023	BenchmarkingDemosaicking	CodeCode Available	0	5
Keep Security! Benchmarking Security Policy Preservation in Large Language Model Contexts Against Indirect Attacks in Question Answering	May 21, 2025	BenchmarkingLanguage Modeling	CodeCode Available	0	5

Show:10 25 50

← PrevPage 168 of 555Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified