SOTAVerified

Benchmarking

Papers

Showing 8190 of 5548 papers

TitleStatusHype
Embodied Agent Interface: Benchmarking LLMs for Embodied Decision MakingCode3
ComfyBench: Benchmarking LLM-based Agents in ComfyUI for Autonomously Designing Collaborative AI SystemsCode3
HEST-1k: A Dataset for Spatial Transcriptomics and Histology Image AnalysisCode3
AER: Auto-Encoder with Regression for Time Series Anomaly DetectionCode3
Language Model Council: Democratically Benchmarking Foundation Models on Highly Subjective TasksCode3
Benchmarking and Improving Bird's Eye View Perception Robustness in Autonomous DrivingCode3
Automatic Intrinsic Reward Shaping for Exploration in Deep Reinforcement LearningCode3
Agent Security Bench (ASB): Formalizing and Benchmarking Attacks and Defenses in LLM-based AgentsCode3
AbdomenAtlas: A Large-Scale, Detailed-Annotated, & Multi-Center Dataset for Efficient Transfer Learning and Open Algorithmic BenchmarkingCode3
DISCOVERYWORLD: A Virtual Environment for Developing and Evaluating Automated Scientific Discovery AgentsCode3
Show:102550
← PrevPage 9 of 555Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified