SOTAVerified

Benchmarking

Papers

Showing 16611670 of 5548 papers

TitleStatusHype
PhySense: Principle-Based Physics Reasoning Benchmarking for Large Language Models0
GenSpace: Benchmarking Spatially-Aware Image Generation0
CaMMT: Benchmarking Culturally Aware Multimodal Machine Translation0
Progressive Class-level Distillation0
Breakpoint: Scalable evaluation of system-level reasoning in LLM code agents0
Beyond Atomic Geometry Representations in Materials Science: A Human-in-the-Loop Multimodal FrameworkCode0
Automated Structured Radiology Report Generation0
Geospatial Foundation Models to Enable Progress on Sustainable Development Goals0
Benchmarking Large Language Models for Cryptanalysis and Mismatched-Generalization0
SORCE: Small Object Retrieval in Complex EnvironmentsCode0
Show:102550
← PrevPage 167 of 555Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified