SOTAVerified

Benchmarking

Papers

Showing 11611170 of 5548 papers

TitleStatusHype
Are LLMs Capable of Data-based Statistical and Causal Reasoning? Benchmarking Advanced Quantitative Reasoning with DataCode1
CodeUpdateArena: Benchmarking Knowledge Editing on API UpdatesCode1
Benchmarking Large Language Models for News SummarizationCode1
Benchmarking Multidomain English-Indonesian Machine TranslationCode1
CoDEx: A Comprehensive Knowledge Graph Completion BenchmarkCode1
Ducho 2.0: Towards a More Up-to-Date Unified Framework for the Extraction of Multimodal Features in RecommendationCode1
CODEMENV: Benchmarking Large Language Models on Code MigrationCode1
CodeReef: an open platform for portable MLOps, reusable automation actions and reproducible benchmarkingCode1
Benchmarking Neural Network Robustness to Common Corruptions and Surface VariationsCode1
CodeIF: Benchmarking the Instruction-Following Capabilities of Large Language Models for Code GenerationCode1
Show:102550
← PrevPage 117 of 555Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified