SOTAVerified

Benchmarking

Papers

Showing 881890 of 5548 papers

TitleStatusHype
DialogueLLM: Context and Emotion Knowledge-Tuned Large Language Models for Emotion Recognition in ConversationsCode1
EvalCrafter: Benchmarking and Evaluating Large Video Generation ModelsCode1
3DYoga90: A Hierarchical Video Dataset for Yoga Pose UnderstandingCode1
pose-format: Library for Viewing, Augmenting, and Handling .pose FilesCode1
Welfare Diplomacy: Benchmarking Language Model CooperationCode1
"Kelly is a Warm Person, Joseph is a Role Model": Gender Biases in LLM-Generated Reference LettersCode1
GeSS: Benchmarking Geometric Deep Learning under Scientific Applications with Distribution ShiftsCode1
MetaBox: A Benchmark Platform for Meta-Black-Box Optimization with Reinforcement LearningCode1
Towards Evaluating Generalist Agents: An Automated Benchmark in Open WorldCode1
Benchmarking and Explaining Large Language Model-based Code Generation: A Causality-Centric ApproachCode1
Show:102550
← PrevPage 89 of 555Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified