SOTAVerified

Benchmarking

Papers

Showing 17011710 of 5548 papers

TitleStatusHype
From Grounding to Planning: Benchmarking Bottlenecks in Web Agents0
A practical generalization metric for deep networks benchmarking0
Landscape-Aware Automated Algorithm Configuration using Multi-output Mixed Regression and Classification0
Towards Student Actions in Classroom Scenes: New Dataset and BaselineCode1
Revisiting Safe Exploration in Safe Reinforcement learning0
ComfyBench: Benchmarking LLM-based Agents in ComfyUI for Autonomously Designing Collaborative AI SystemsCode3
Benchmarking LLM Code Generation for Audio Programming with Visual Dataflow Languages0
Accelerating the discovery of steady-states of planetary interior dynamics with machine learning0
Understanding the User: An Intent-Based Ranking Dataset0
SYNTHEVAL: Hybrid Behavioral Testing of NLP Models with Synthetic CheckListsCode0
Show:102550
← PrevPage 171 of 555Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified