SOTAVerified

Benchmarking

Papers

Showing 681690 of 5548 papers

TitleStatusHype
Why Stop at One Error? Benchmarking LLMs as Data Science Code Debuggers for Multi-Hop and Multi-Bug ErrorsCode0
Assessing Foundation Models for Sea Ice Type Segmentation in Sentinel-1 SAR Imagery0
LIM: Large Interpolator Model for Dynamic Reconstruction0
Benchmarking Deep Learning-Based Methods for Irradiance Nowcasting with Sky Images0
CLAIMCHECK: How Grounded are LLM Critiques of Scientific Papers?Code0
Evaluating Text-to-Image Synthesis with a Conditional Fréchet Distance0
FaceBench: A Multi-View Multi-Level Facial Attribute VQA Dataset for Benchmarking Face Perception MLLMsCode1
A Comprehensive Benchmark for RNA 3D Structure-Function ModelingCode1
GateLens: A Reasoning-Enhanced LLM Agent for Automotive Software Release Analytics0
ResearchBench: Benchmarking LLMs in Scientific Discovery via Inspiration-Based Task Decomposition0
Show:102550
← PrevPage 69 of 555Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified