SOTAVerified

Benchmarking

Papers

Showing 16211630 of 5548 papers

TitleStatusHype
BSBench: will your LLM find the largest prime number?Code0
Urania: Differentially Private Insights into AI Use0
Debatable Intelligence: Benchmarking LLM Judges via Debate Speech EvaluationCode0
AV-Reasoner: Improving and Benchmarking Clue-Grounded Audio-Visual Counting for MLLMs0
VideoMathQA: Benchmarking Mathematical Reasoning via Multimodal Understanding in Videos0
A Unified Framework for Provably Efficient Algorithms to Estimate Shapley Values0
CzechLynx: A Dataset for Individual Identification and Pose Estimation of the Eurasian Lynx0
DIMCIM: A Quantitative Evaluation Framework for Default-mode Diversity and Generalization in Text-to-Image Generative Models0
Refer to Anything with Vision-Language Prompts0
From Standalone LLMs to Integrated Intelligence: A Survey of Compound Al Systems0
Show:102550
← PrevPage 163 of 555Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified