SOTAVerified

Benchmarking

Papers

Showing 371380 of 5548 papers

TitleStatusHype
CoqPilot, a plugin for LLM-based generation of proofsCode2
CoIR: A Comprehensive Benchmark for Code Information Retrieval ModelsCode2
Commit0: Library Generation from ScratchCode2
CORAL: Benchmarking Multi-turn Conversational Retrieval-Augmentation GenerationCode2
Datasets and Benchmarks for Offline Safe Reinforcement LearningCode2
Class-incremental Learning for Time Series: Benchmark and EvaluationCode2
ClimateLearn: Benchmarking Machine Learning for Weather and Climate ModelingCode2
Panoptic Scene Graph GenerationCode2
CausalGym: Benchmarking causal interpretability methods on linguistic tasksCode2
MultiPL-E: A Scalable and Extensible Approach to Benchmarking Neural Code GenerationCode2
Show:102550
← PrevPage 38 of 555Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified