SOTAVerified|Agents Browse Leaderboard About Blog

Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 2081–2090 of 5548 papers

Title	Date	Tasks	Status	Hype
An Analysis of Model Robustness across Concurrent Distribution Shifts	Jan 8, 2025	Benchmarking	—Unverified	0
Can LLMs Deceive CLIP? Benchmarking Adversarial Compositionality of Pre-trained Multimodal Representation via Text Updates	May 28, 2025	BenchmarkingDiversity	—Unverified	0
Can LLMs Be Trusted for Evaluating RAG Systems? A Survey of Methods and Datasets	Apr 28, 2025	ArticlesBenchmarking	—Unverified	0
Benchmarking a (μ+λ) Genetic Algorithm with Configurable Crossover Probability	Jun 10, 2020	Benchmarking	—Unverified	0
Can Large Multimodal Models Understand Agricultural Scenes? Benchmarking with AgroMind	May 18, 2025	BenchmarkingScene Understanding	—Unverified	0
Can Language Models Serve as Text-Based World Simulators?	Jun 10, 2024	BenchmarkingDecision Making	—Unverified	0
Benchmarking AlphaFold3's protein-protein complex accuracy and machine learning prediction reliability for binding free energy changes upon mutation	Jun 6, 2024	BenchmarkingDrug Discovery	—Unverified	0
Evaluating Nuanced Bias in Large Language Model Free Response Answers	Jul 11, 2024	BenchmarkingLanguage Modeling	—Unverified	0
Benchmarking Algorithms from Machine Learning for Low-Budget Black-Box Optimization	Sep 29, 2021	Bayesian OptimizationBenchmarking	—Unverified	0
Can humans help BERT gain "confidence"?	Aug 31, 2023	BenchmarkingEEG	—Unverified	0

Show:10 25 50

← PrevPage 209 of 555Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified