SOTAVerified|Agents Browse Leaderboard About Blog

Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 2111–2120 of 5548 papers

Title	Date	Tasks	Status	Hype
Scenarios and Approaches for Situated Natural Language Explanations	Jun 7, 2024	BenchmarkingIn-Context Learning	—Unverified	0
WildBench: Benchmarking LLMs with Challenging Tasks from Real Users in the Wild	Jun 7, 2024	BenchmarkingChatbot	CodeCode Available	3
Time Sensitive Knowledge Editing through Efficient Finetuning	Jun 6, 2024	Benchmarkingknowledge editing	—Unverified	0
Omni6DPose: A Benchmark and Model for Universal 6D Object Pose Estimation and Tracking	Jun 6, 2024	6D Pose Estimation using RGBBenchmarking	—Unverified	0
NATURAL PLAN: Benchmarking LLMs on Natural Language Planning	Jun 6, 2024	BenchmarkingScheduling	—Unverified	0
Benchmarking AlphaFold3's protein-protein complex accuracy and machine learning prediction reliability for binding free energy changes upon mutation	Jun 6, 2024	BenchmarkingDrug Discovery	—Unverified	0
Performance of large language models in numerical vs. semantic medical knowledge: Benchmarking on evidence-based Q&As	Jun 6, 2024	ArticlesBenchmarking	—Unverified	0
Bench2Drive: Towards Multi-Ability Benchmarking of Closed-Loop End-To-End Autonomous Driving	Jun 6, 2024	Autonomous DrivingBench2Drive	CodeCode Available	4
Statistical Multicriteria Benchmarking via the GSD-Front	Jun 6, 2024	Benchmarking	—Unverified	0
Better Late Than Never: Formulating and Benchmarking Recommendation Editing	Jun 6, 2024	BenchmarkingRecommendation Systems	CodeCode Available	0

Show:10 25 50

← PrevPage 212 of 555Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified