SOTAVerified

Benchmarking

Papers

Showing 321330 of 5548 papers

TitleStatusHype
GPT-Fathom: Benchmarking Large Language Models to Decipher the Evolutionary Path towards GPT-4 and BeyondCode2
A Content-Driven Micro-Video Recommendation Dataset at ScaleCode2
A Toolkit for Reliable Benchmarking and Research in Multi-Objective Reinforcement LearningCode2
VerilogEval: Evaluating Large Language Models for Verilog Code GenerationCode2
PyGraft: Configurable Generation of Synthetic Schemas and Knowledge Graphs at Your FingertipsCode2
Benchmarking Large Language Models in Retrieval-Augmented GenerationCode2
Orientation-Independent Chinese Text Recognition in Scene ImagesCode2
Topical-Chat: Towards Knowledge-Grounded Open-Domain ConversationsCode2
BOLAA: Benchmarking and Orchestrating LLM-augmented Autonomous AgentsCode2
SEED-Bench: Benchmarking Multimodal LLMs with Generative ComprehensionCode2
Show:102550
← PrevPage 33 of 555Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified