SOTAVerified

Benchmarking

Papers

Showing 891900 of 5548 papers

TitleStatusHype
Benchmarking Large Language Models on CMExam -- A Comprehensive Chinese Medical Exam DatasetCode1
JOBSKAPE: A Framework for Generating Synthetic Job Postings to Enhance Skill MatchingCode1
An Exploration of Embodied Visual ExplorationCode1
Benchmarking Cognitive Biases in Large Language Models as EvaluatorsCode1
GEOM-Drugs Revisited: Toward More Chemically Accurate Benchmarks for 3D Molecule GenerationCode1
Coarse-to-Fine Q-attention with Learned Path RankingCode1
CodeUpdateArena: Benchmarking Knowledge Editing on API UpdatesCode1
An Extended Benchmarking of Multi-Agent Reinforcement Learning Algorithms in Complex Fully Cooperative TasksCode1
Benchmarking Large Vision-Language Models via Directed Scene Graph for Comprehensive Image CaptioningCode1
CIPCaD-Bench: Continuous Industrial Process datasets for benchmarking Causal Discovery methodsCode1
Show:102550
← PrevPage 90 of 555Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified