SOTAVerified

Benchmarking

Papers

Showing 16761700 of 5548 papers

TitleStatusHype
RBoard: A Unified Platform for Reproducible and Reusable Recommender System Benchmarks0
NeIn: Telling What You Don't Want0
Benchmarking and Building Zero-Shot Hindi Retrieval Model with Hindi-BEIR and NLLB-E50
Assessing SPARQL capabilities of Large Language ModelsCode2
DetoxBench: Benchmarking Large Language Models for Multitask Fraud & Abuse Detection0
CKnowEdit: A New Chinese Knowledge Editing Dataset for Linguistics, Facts, and Logic Error Correction in LLMs0
A Framework for Evaluating PM2.5 Forecasts from the Perspective of Individual Decision MakingCode0
Insights from Benchmarking Frontier Language Models on Web App Code GenerationCode1
Benchmarking Estimators for Natural Experiments: A Novel Dataset and a Doubly Robust Algorithm0
Absolute Ranking: An Essential Normalization for Benchmarking Optimization Algorithms0
PlantSeg: A Large-Scale In-the-wild Dataset for Plant Disease SegmentationCode2
Quantum Kernel Methods under Scrutiny: A Benchmarking Study0
Question-Answering Dense Video EventsCode0
Shuffle Vision Transformer: Lightweight, Fast and Efficient Recognition of Driver Facial Expression0
Prediction Accuracy & Reliability: Classification and Object Localization under Distribution Shift0
LLM Detectors Still Fall Short of Real World: Case of LLM-Generated Short News-Like PostsCode0
InfraLib: Enabling Reinforcement Learning and Decision-Making for Large-Scale Infrastructure Management0
RTLRewriter: Methodologies for Large Models aided RTL Code OptimizationCode1
PUB: Plot Understanding Benchmark and Dataset for Evaluating Large Language Models on Synthetic Visual Data Interpretation0
NUMOSIM: A Synthetic Mobility Dataset with Anomaly Detection Benchmarks0
Benchmarking Spurious Bias in Few-Shot Image ClassifiersCode0
Multi-Source Knowledge Pruning for Retrieval-Augmented Generation: A Benchmark and Empirical StudyCode0
LongGenBench: Benchmarking Long-Form Generation in Long Context LLMsCode1
EgoPressure: A Dataset for Hand Pressure and Pose Estimation in Egocentric Vision0
Benchmarking Cognitive Domains for LLMs: Insights from Taiwanese Hakka Culture0
Show:102550
← PrevPage 68 of 222Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified