SOTAVerified

Benchmarking

Papers

Showing 876900 of 5548 papers

TitleStatusHype
Fast hyperboloid decision tree algorithmsCode1
OODRobustBench: a Benchmark and Large-Scale Analysis of Adversarial Robustness under Distribution ShiftCode1
To Generate or Not? Safety-Driven Unlearned Diffusion Models Are Still Easy To Generate Unsafe Images ... For NowCode1
FactCHD: Benchmarking Fact-Conflicting Hallucination DetectionCode1
Object-aware Inversion and Reassembly for Image EditingCode1
DialogueLLM: Context and Emotion Knowledge-Tuned Large Language Models for Emotion Recognition in ConversationsCode1
EvalCrafter: Benchmarking and Evaluating Large Video Generation ModelsCode1
3DYoga90: A Hierarchical Video Dataset for Yoga Pose UnderstandingCode1
Welfare Diplomacy: Benchmarking Language Model CooperationCode1
pose-format: Library for Viewing, Augmenting, and Handling .pose FilesCode1
"Kelly is a Warm Person, Joseph is a Role Model": Gender Biases in LLM-Generated Reference LettersCode1
Towards Evaluating Generalist Agents: An Automated Benchmark in Open WorldCode1
GeSS: Benchmarking Geometric Deep Learning under Scientific Applications with Distribution ShiftsCode1
MetaBox: A Benchmark Platform for Meta-Black-Box Optimization with Reinforcement LearningCode1
What If the TV Was Off? Examining Counterfactual Reasoning Abilities of Multi-modal Language ModelsCode1
Benchmarking and Explaining Large Language Model-based Code Generation: A Causality-Centric ApproachCode1
PepMLM: Target Sequence-Conditioned Generation of Therapeutic Peptide Binders via Span Masked Language ModelingCode1
Can Language Models Employ the Socratic Method? Experiments with Code DebuggingCode1
GNNX-BENCH: Unravelling the Utility of Perturbation-based GNN Explainers through In-depth BenchmarkingCode1
CausalTime: Realistically Generated Time-series for Benchmarking of Causal DiscoveryCode1
PGDQN: Preference-Guided Deep Q-NetworkCode1
Who is ChatGPT? Benchmarking LLMs' Psychological Portrayal Using PsychoBenchCode1
NewsRecLib: A PyTorch-Lightning Library for Neural News RecommendationCode1
FELM: Benchmarking Factuality Evaluation of Large Language ModelsCode1
Benchmarking Cognitive Biases in Large Language Models as EvaluatorsCode1
Show:102550
← PrevPage 36 of 222Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified