SOTAVerified

Benchmarking

Papers

Showing 19762000 of 5548 papers

TitleStatusHype
CaT-BENCH: Benchmarking Language Model Understanding of Causal and Temporal Dependencies in Plans0
BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex InstructionsCode4
Benchmarking Uncertainty Quantification Methods for Large Language Models with LM-PolygraphCode2
GenoTEX: An LLM Agent Benchmark for Automated Gene Expression Data AnalysisCode2
Deciphering the Definition of Adversarial Robustness for post-hoc OOD Detectors0
Six-CD: Benchmarking Concept Removals for Benign Text-to-image Diffusion ModelsCode1
NAVSIM: Data-Driven Non-Reactive Autonomous Vehicle Simulation and BenchmarkingCode7
Benchmarking Retinal Blood Vessel Segmentation Models for Cross-Dataset and Cross-Disease GeneralizationCode0
Sports Intelligence: Assessing the Sports Understanding Capabilities of Language Models through Question Answering from Text to Video0
FlowBench: Revisiting and Benchmarking Workflow-Guided Planning for LLM-based Agents0
CEBench: A Benchmarking Toolkit for the Cost-Effectiveness of LLM PipelinesCode0
Improving Expert Radiology Report Summarization by Prompting Large Language Models with a Layperson Summary0
QeMFi: A Multifidelity Dataset of Quantum Chemical Properties of Diverse MoleculesCode0
Beyond Optimism: Exploration With Partially Observable RewardsCode0
Selected Languages are All You Need for Cross-lingual Truthfulness TransferCode0
How far are today's time-series models from real-world weather forecasting applications?Code2
The Elusive Pursuit of Reproducing PATE-GAN: Benchmarking, Auditing, DebuggingCode0
Benchmarking Monocular 3D Dog Pose Estimation Using In-The-Wild Motion Capture Data0
African or European Swallow? Benchmarking Large Vision-Language Models for Fine-Grained Object ClassificationCode1
HoTPP Benchmark: Are We Good at the Long Horizon Events Forecasting?Code2
Resource-efficient Medical Image Analysis with Self-adapting Forward-Forward Networks0
DASB -- Discrete Audio and Speech Benchmark0
A Benchmarking Study of Kolmogorov-Arnold Networks on Tabular DataCode1
FairX: A comprehensive benchmarking tool for model analysis using fairness, utility, and explainabilityCode0
PoseBench: Benchmarking the Robustness of Pose Estimation Models under Corruptions0
Show:102550
← PrevPage 80 of 222Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified