SOTAVerified

Benchmarking

Papers

Showing 201210 of 5548 papers

TitleStatusHype
ODRL: A Benchmark for Off-Dynamics Reinforcement LearningCode2
CoqPilot, a plugin for LLM-based generation of proofsCode2
Open6DOR: Benchmarking Open-instruction 6-DoF Object Rearrangement and A VLM-based ApproachCode2
RM-Bench: Benchmarking Reward Models of Language Models with Subtlety and StyleCode2
Multi-IF: Benchmarking LLMs on Multi-Turn and Multilingual Instructions FollowingCode2
IntersectionZoo: Eco-driving for Benchmarking Multi-Agent Contextual Reinforcement LearningCode2
SPA-Bench: A Comprehensive Benchmark for SmartPhone Agent EvaluationCode2
LLM-Based Multi-Agent Systems are Scalable Graph Generative ModelsCode2
Benchmarking Agentic Workflow GenerationCode2
COMPL-AI Framework: A Technical Interpretation and LLM Benchmarking Suite for the EU Artificial Intelligence ActCode2
Show:102550
← PrevPage 21 of 555Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified