SOTAVerified

Benchmarking

Papers

Showing 25612570 of 5548 papers

TitleStatusHype
Machine Translation Meta Evaluation through Translation Accuracy Challenge SetsCode1
Muffin or Chihuahua? Challenging Multimodal Large Language Models with Multipanel VQA0
PPM: Automated Generation of Diverse Programming Problems for Benchmarking Code Generation ModelsCode0
SAM-based instance segmentation models for the automation of structural damage detection0
Benchmarking with MIMIC-IV, an irregular, spare clinical time series dataset0
MultiHop-RAG: Benchmarking Retrieval-Augmented Generation for Multi-Hop QueriesCode3
Biological Valuation Map of Flanders: A Sentinel-2 Imagery Analysis0
Benchmarking Large Language Models in Complex Question Answering Attribution using Knowledge Graphs0
Automated legal reasoning with discretion to act using s(LAW)0
TriSAM: Tri-Plane SAM for zero-shot cortical blood vessel segmentation in VEM images0
Show:102550
← PrevPage 257 of 555Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified