SOTAVerified

Benchmarking

Papers

Showing 37113720 of 5548 papers

TitleStatusHype
Towards Robust Evaluation: A Comprehensive Taxonomy of Datasets and Metrics for Open Domain Question Answering in the Era of Large Language Models0
Benchmarking FedAvg and FedCurv for Image Classification Tasks0
Benchmarking Critical Questions Generation: A Challenging Reasoning Task for Large Language Models0
Muffin or Chihuahua? Challenging Multimodal Large Language Models with Multipanel VQA0
Mukayese: Turkish NLP Strikes Back0
Benchmarking features from different radiomics toolkits / toolboxes using Image Biomarkers Standardization Initiative0
Benchmarking Feature Extractors for Reinforcement Learning-Based Semiconductor Defect Localization0
Benchmarking Expressive Japanese Character Text-to-Speech with VITS and Style-BERT-VITS20
Multicalibration for Confidence Scoring in LLMs0
Multi-Camera Action Dataset for Cross-Camera Action Recognition Benchmarking0
Show:102550
← PrevPage 372 of 555Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified