SOTAVerified

Benchmarking

Papers

Showing 891900 of 5548 papers

TitleStatusHype
Benchmarking and Explaining Large Language Model-based Code Generation: A Causality-Centric ApproachCode1
PepMLM: Target Sequence-Conditioned Generation of Therapeutic Peptide Binders via Span Masked Language ModelingCode1
Can Language Models Employ the Socratic Method? Experiments with Code DebuggingCode1
GNNX-BENCH: Unravelling the Utility of Perturbation-based GNN Explainers through In-depth BenchmarkingCode1
CausalTime: Realistically Generated Time-series for Benchmarking of Causal DiscoveryCode1
PGDQN: Preference-Guided Deep Q-NetworkCode1
Who is ChatGPT? Benchmarking LLMs' Psychological Portrayal Using PsychoBenchCode1
NewsRecLib: A PyTorch-Lightning Library for Neural News RecommendationCode1
FELM: Benchmarking Factuality Evaluation of Large Language ModelsCode1
Benchmarking Cognitive Biases in Large Language Models as EvaluatorsCode1
Show:102550
← PrevPage 90 of 555Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified