SOTAVerified

Benchmarking

Papers

Showing 45014550 of 5548 papers

TitleStatusHype
Beyond MD17: the reactive xxMD datasetCode0
The biglasso Package: A Memory- and Computation-Efficient Solver for Lasso Model Fitting with Big Data in RCode0
Learning to Transfer for Traffic Forecasting via Multi-task LearningCode0
IOLBENCH: Benchmarking LLMs on Linguistic ReasoningCode0
InViG: Benchmarking Interactive Visual Grounding with 500K Human-Robot InteractionsCode0
Investigating the Impact of Hard Samples on Accuracy Reveals In-class Data ImbalanceCode0
BEARD: Benchmarking the Adversarial Robustness for Dataset DistillationCode0
RerrFact: Reduced Evidence Retrieval Representations for Scientific Claim VerificationCode0
Inverse Contextual Bandits: Learning How Behavior Evolves over TimeCode0
UCFE: A User-Centric Financial Expertise Benchmark for Large Language ModelsCode0
Introducing SLAMBench, a performance and accuracy benchmarking methodology for SLAMCode0
INTERSPEECH 2009 Emotion Challenge Revisited: Benchmarking 15 Years of Progress in Speech Emotion RecognitionCode0
Integration of nested cross-validation, automated hyperparameter optimization, high-performance computing to reduce and quantify the variance of test performance estimation of deep learning modelsCode0
BdSLW60: A Word-Level Bangla Sign Language DatasetCode0
The Butterfly Effect of Model Editing: Few Edits Can Trigger Large Language Models CollapseCode0
Integrating Expert Knowledge into Logical Programs via LLMsCode0
The CaLiGraph Ontology as a Challenge for OWL ReasonersCode0
The Catechol Benchmark: Time-series Solvent Selection Data for Few-shot Machine LearningCode0
Strong and Simple Baselines for Multimodal Utterance EmbeddingsCode0
InstaIndoor and Multi-modal Deep Learning for Indoor Scene RecognitionCode0
The Collective Knowledge project: making ML models more portable and reproducible with open APIs, reusable best practices and MLOpsCode0
a-DCF: an architecture agnostic metric with application to spoofing-robust speaker verificationCode0
Resource Interoperability for Sustainable Benchmarking: The Case of EventsCode0
Bayesian Neural Networks with Soft EvidenceCode0
BASED: Benchmarking, Analysis, and Structural Estimation of DeblurringCode0
Bugs in the Data: How ImageNet Misrepresents BiodiversityCode0
inMOTIFin: a lightweight end-to-end simulation software for regulatory sequencesCode0
LexSumm and LexT5: Benchmarking and Modeling Legal Summarization Tasks in EnglishCode0
InDL: A New Dataset and Benchmark for In-Diagram Logic Interpretation based on Visual IllusionCode0
Individual Fairness Guarantees for Neural NetworksCode0
IndiBias: A Benchmark Dataset to Measure Social Biases in Language Models for Indian ContextCode0
LibOPT: An Open-Source Platform for Fast Prototyping Soft Optimization TechniquesCode0
BubGAN: Bubble Generative Adversarial Networks for Synthesizing Realistic Bubbly Flow ImagesCode0
bsnsing: A decision tree induction method based on recursive optimal boolean rule compositionCode0
Rethinking Empirical Evaluation of Adversarial Robustness Using First-Order Attack MethodsCode0
Improving the Perturbation-Based Explanation of Deepfake Detectors Through the Use of Adversarially-Generated SamplesCode0
BSBench: will your LLM find the largest prime number?Code0
Light Field Saliency Detection with Deep Convolutional NetworksCode0
Improving Pretrained Models for Zero-shot Multi-label Text Classification through Reinforced Label Hierarchy ReasoningCode0
Bridging the Generalisation Gap: Synthetic Data Generation for Multi-Site Clinical Model ValidationCode0
An Analyst-Inspector Framework for Evaluating Reproducibility of LLMs in Data ScienceCode0
Rethinking the Effectiveness of Graph Classification Datasets in Benchmarks for Assessing GNNsCode0
On-orbit model training for satellite imagery with label proportionsCode0
LimeSoDa: A Dataset Collection for Benchmarking of Machine Learning Regressors in Digital Soil MappingCode0
Improving Generalization of Neural Vehicle Routing Problem Solvers Through the Lens of Model ArchitectureCode0
Rethinking the Reference-based Distinctive Image CaptioningCode0
Linear energy storage and flexibility model with ramp rate, ramping, deadline and capacity constraintsCode0
BRI3L: A Brightness Illusion Image Dataset for Identification and Localization of Regions of Illusory PerceptionCode0
BoxingGym: Benchmarking Progress in Automated Experimental Design and Model DiscoveryCode0
BONES: a Benchmark fOr Neural Estimation of Shapley valuesCode0
Show:102550
← PrevPage 91 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified