SOTAVerified

Benchmarking

Papers

Showing 311320 of 5548 papers

TitleStatusHype
PG-Video-LLaVA: Pixel Grounding Large Video-Language ModelsCode2
Exponentially Faster Language ModellingCode2
What's In My Big Data?Code2
Battle of the Backbones: A Large-Scale Comparison of Pretrained Models across Computer Vision TasksCode2
Formalizing and Benchmarking Prompt Injection Attacks and DefensesCode2
Octopus: Embodied Vision-Language Programmer from Environmental FeedbackCode2
ProbTS: Benchmarking Point and Distributional Forecasting across Diverse Prediction HorizonsCode2
MLAgentBench: Evaluating Language Agents on Machine Learning ExperimentationCode2
RoleLLM: Benchmarking, Eliciting, and Enhancing Role-Playing Abilities of Large Language ModelsCode2
GPT-Fathom: Benchmarking Large Language Models to Decipher the Evolutionary Path towards GPT-4 and BeyondCode2
Show:102550
← PrevPage 32 of 555Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified