SOTAVerified|Agents Browse Leaderboard About Blog

Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 1701–1710 of 5548 papers

Title	Date	Tasks	Status	Hype
From Grounding to Planning: Benchmarking Bottlenecks in Web Agents	Sep 3, 2024	Benchmarking	—Unverified	0
A practical generalization metric for deep networks benchmarking	Sep 2, 2024	BenchmarkingDiversity	—Unverified	0
Landscape-Aware Automated Algorithm Configuration using Multi-output Mixed Regression and Classification	Sep 2, 2024	Benchmarking	—Unverified	0
Towards Student Actions in Classroom Scenes: New Dataset and Baseline	Sep 2, 2024	Action DetectionBenchmarking	CodeCode Available	1
Revisiting Safe Exploration in Safe Reinforcement learning	Sep 2, 2024	Benchmarkingreinforcement-learning	—Unverified	0
ComfyBench: Benchmarking LLM-based Agents in ComfyUI for Autonomously Designing Collaborative AI Systems	Sep 2, 2024	BenchmarkingInstruction Following	CodeCode Available	3
Benchmarking LLM Code Generation for Audio Programming with Visual Dataflow Languages	Sep 1, 2024	BenchmarkingCode Generation	—Unverified	0
Accelerating the discovery of steady-states of planetary interior dynamics with machine learning	Aug 30, 2024	Benchmarking	—Unverified	0
Understanding the User: An Intent-Based Ranking Dataset	Aug 30, 2024	BenchmarkingInformation Retrieval	—Unverified	0
SYNTHEVAL: Hybrid Behavioral Testing of NLP Models with Synthetic CheckLists	Aug 30, 2024	BenchmarkingSentiment Analysis	CodeCode Available	0

Show:10 25 50

← PrevPage 171 of 555Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified