SOTAVerified

Benchmarking

Papers

Showing 17511775 of 5548 papers

TitleStatusHype
JavaBench: A Benchmark of Object-Oriented Code Generation for Evaluating Large Language ModelsCode0
Lost in Benchmarks? Rethinking Large Language Model Benchmarking with Item Response TheoryCode0
Analyzing the Feature Extractor Networks for Face Image SynthesisCode0
IoT Data Trust Evaluation via Machine LearningCode0
Introducing SLAMBench, a performance and accuracy benchmarking methodology for SLAMCode0
Benchmarking Multi-dimensional AIGC Video Quality Assessment: A Dataset and Unified ModelCode0
Inverse Contextual Bandits: Learning How Behavior Evolves over TimeCode0
Calibrating Pre-trained Language Classifiers on LLM-generated Noisy Labels via Iterative RefinementCode0
STEP: A Unified Spiking Transformer Evaluation Platform for Fair and Reproducible BenchmarkingCode0
Calibrated Adaptive Probabilistic ODE SolversCode0
INTERSPEECH 2009 Emotion Challenge Revisited: Benchmarking 15 Years of Progress in Speech Emotion RecognitionCode0
Investigating the Impact of Hard Samples on Accuracy Reveals In-class Data ImbalanceCode0
IPC: A Benchmark Data Set for Learning with Graph-Structured DataCode0
Machine Learning Cryptanalysis of a Quantum Random Number GeneratorCode0
Integrating Expert Knowledge into Logical Programs via LLMsCode0
Benchmarking Adversarial Robustness to Bias Elicitation in Large Language Models: Scalable Automated Assessment with LLM-as-a-JudgeCode0
Cable Tree Wiring -- Benchmarking Solvers on a Real-World Scheduling Problem with a Variety of Precedence ConstraintsCode0
Integration of nested cross-validation, automated hyperparameter optimization, high-performance computing to reduce and quantify the variance of test performance estimation of deep learning modelsCode0
InstaIndoor and Multi-modal Deep Learning for Indoor Scene RecognitionCode0
B-XAIC Dataset: Benchmarking Explainable AI for Graph Neural Networks Using Chemical DataCode0
inMOTIFin: a lightweight end-to-end simulation software for regulatory sequencesCode0
Analysis | OPEN | Published: 17 June 2019 Multitask learning and benchmarking with clinical time series dataCode0
Building Conformal Prediction Intervals with Approximate Message PassingCode0
CodeS: Towards Code Model Generalization Under Distribution ShiftCode0
Building and benchmarking an Arabic Speech Commands dataset for small-footprint keyword spottingCode0
Show:102550
← PrevPage 71 of 222Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified