SOTAVerified

Benchmarking

Papers

Showing 29262950 of 5548 papers

TitleStatusHype
DNR Bench: Benchmarking Over-Reasoning in Reasoning LLMs0
A Sober Look at the Robustness of CLIPs to Spurious Features0
Does AI for science need another ImageNet Or totally different benchmarks? A case study of machine learning force fields0
Does imputation matter? Benchmark for predictive models0
Domain Adaptation for Arabic Machine Translation: The Case of Financial Texts0
Domain Aligned CLIP for Few-shot Classification0
Domain Generalization in Computational Pathology: Survey and Guidelines0
Don't stack layers in graph neural networks, wire them randomly0
Downsampling and geometric feature methods for EEG classification tasks with CNNs0
On the Convergence of Differentially Private Federated Learning on Non-Lipschitz Objectives, and with Normalized Client Updates0
DPO: A Differential and Pointwise Control Approach to Reinforcement Learning0
DRAC: Diabetic Retinopathy Analysis Challenge with Ultra-Wide Optical Coherence Tomography Angiography Images0
Drift in a Popular Metal Oxide Sensor Dataset Reveals Limitations for Gas Classification Benchmarks0
DRIV100: In-The-Wild Multi-Domain Dataset and Evaluation for Real-World Domain Adaptation of Semantic Segmentation0
DSLOB: A Synthetic Limit Order Book Dataset for Benchmarking Forecasting Algorithms under Distributional Shift0
Dual Encoder-Decoder based Generative Adversarial Networks for Disentangled Facial Representation Learning0
Dual Task Framework for Improving Persona-grounded Dialogue Dataset0
DyFEn: Agent-Based Fee Setting in Payment Channel Networks0
Dyna-bAbI: unlocking bAbI's potential with dynamic synthetic benchmarking0
Dyna-bAbI: unlocking bAbI’s potential with dynamic synthetic benchmarking0
Dynabench: Rethinking Benchmarking in NLP0
Dynaboard: An Evaluation-As-A-Service Platform for Holistic Next-Generation Benchmarking0
Dynamic benchmarking framework for LLM-based conversational data capture0
Dynamic Benchmarking of Masked Language Models on Temporal Concept Drift with Multiple Views0
Dynamic Benchmarking of Reasoning Capabilities in Code Large Language Models Under Data Contamination0
Show:102550
← PrevPage 118 of 222Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified