SOTAVerified

Benchmarking

Papers

Showing 11811190 of 5548 papers

TitleStatusHype
CoDEx: A Comprehensive Knowledge Graph Completion BenchmarkCode1
Benchmarking Multi-Scene Fire and Smoke DetectionCode1
CodeUpdateArena: Benchmarking Knowledge Editing on API UpdatesCode1
Collective Knowledge: organizing research projects as a database of reusable components and portable workflows with common APIsCode1
Entering Real Social World! Benchmarking the Social Intelligence of Large Language Models from a First-person PerspectiveCode1
EntQA: Entity Linking as Question AnsweringCode1
Benchmarking Natural Language Understanding Services for building Conversational AgentsCode1
CodeIF: Benchmarking the Instruction-Following Capabilities of Large Language Models for Code GenerationCode1
Are Large Language Models Really Good Logical Reasoners? A Comprehensive Evaluation and BeyondCode1
CODEMENV: Benchmarking Large Language Models on Code MigrationCode1
Show:102550
← PrevPage 119 of 555Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified