SOTAVerified

Benchmarking

Papers

Showing 23912400 of 5548 papers

TitleStatusHype
DQI: Measuring Data Quality in NLPCode0
Benchmarking Large Vision-Language Models on Fine-Grained Image Tasks: A Comprehensive EvaluationCode0
A General Benchmarking Framework for Text GenerationCode0
A Closer Look at Temporal Sentence Grounding in Videos: Dataset and MetricCode0
Generative Models for Fast Simulation of Cherenkov Detectors at the Electron-Ion ColliderCode0
Benchmarking Large Language Model Uncertainty for Prompt OptimizationCode0
Are Personalized Stochastic Parrots More Dangerous? Evaluating Persona Biases in Dialogue SystemsCode0
Domain-Expanded ASTE: Rethinking Generalization in Aspect Sentiment Triplet ExtractionCode0
Arena-Rosnav 2.0: A Development and Benchmarking Platform for Robot Navigation in Highly Dynamic EnvironmentsCode0
GenderBench: Evaluation Suite for Gender Biases in LLMsCode0
Show:102550
← PrevPage 240 of 555Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified