SOTAVerified

Benchmarking

Papers

Showing 17511800 of 5548 papers

TitleStatusHype
Knowledge-Driven Slot Constraints for Goal-Oriented Dialogue SystemsCode0
Air Learning: A Deep Reinforcement Learning Gym for Autonomous Aerial Robot Visual NavigationCode0
Can a single neuron learn predictive uncertainty?Code0
JATE 2.0: Java Automatic Term Extraction with Apache SolrCode0
Can AI Validate Science? Benchmarking LLMs for Accurate Scientific Claim Evidence ReasoningCode0
JALMBench: Benchmarking Jailbreak Vulnerabilities in Audio Language ModelsCode0
Is Your Model Fairly Certain? Uncertainty-Aware Fairness Evaluation for LLMsCode0
COCO: Performance AssessmentCode0
DyKnow: Dynamically Verifying Time-Sensitive Factual Knowledge in LLMsCode0
JavaBench: A Benchmark of Object-Oriented Code Generation for Evaluating Large Language ModelsCode0
Analyzing the Feature Extractor Networks for Face Image SynthesisCode0
Mamba-Based Ensemble learning for White Blood Cell ClassificationCode0
Benchmarking ChatGPT-4 on ACR Radiation Oncology In-Training (TXIT) Exam and Red Journal Gray Zone Cases: Potentials and Challenges for AI-Assisted Medical Education and Decision Making in Radiation OncologyCode0
JExplore: Design Space Exploration Tool for Nvidia Jetson BoardsCode0
Benchmarking Multi-dimensional AIGC Video Quality Assessment: A Dataset and Unified ModelCode0
ISImed: A Framework for Self-Supervised Learning using Intrinsic Spatial Information in Medical ImagesCode0
Calibrating Pre-trained Language Classifiers on LLM-generated Noisy Labels via Iterative RefinementCode0
STEP: A Unified Spiking Transformer Evaluation Platform for Fair and Reproducible BenchmarkingCode0
IoT Data Trust Evaluation via Machine LearningCode0
Calibrated Adaptive Probabilistic ODE SolversCode0
IOLBENCH: Benchmarking LLMs on Linguistic ReasoningCode0
IPC: A Benchmark Data Set for Learning with Graph-Structured DataCode0
Knowledge Enhanced Conditional Imputation for Healthcare Time-seriesCode0
Investigating the Impact of Hard Samples on Accuracy Reveals In-class Data ImbalanceCode0
Benchmarking Adversarial Robustness to Bias Elicitation in Large Language Models: Scalable Automated Assessment with LLM-as-a-JudgeCode0
Cable Tree Wiring -- Benchmarking Solvers on a Real-World Scheduling Problem with a Variety of Precedence ConstraintsCode0
Inverse Contextual Bandits: Learning How Behavior Evolves over TimeCode0
Introducing SLAMBench, a performance and accuracy benchmarking methodology for SLAMCode0
InViG: Benchmarking Interactive Visual Grounding with 500K Human-Robot InteractionsCode0
B-XAIC Dataset: Benchmarking Explainable AI for Graph Neural Networks Using Chemical DataCode0
INTERSPEECH 2009 Emotion Challenge Revisited: Benchmarking 15 Years of Progress in Speech Emotion RecognitionCode0
Analysis | OPEN | Published: 17 June 2019 Multitask learning and benchmarking with clinical time series dataCode0
Building Conformal Prediction Intervals with Approximate Message PassingCode0
Building and benchmarking an Arabic Speech Commands dataset for small-footprint keyword spottingCode0
Adaptive Visual Scene Understanding: Incremental Scene Graph GenerationCode0
Integrating Expert Knowledge into Logical Programs via LLMsCode0
Building a Large Scale Dataset for Image Emotion Recognition: The Fine Print and The BenchmarkCode0
ColorGrid: A Multi-Agent Non-Stationary Environment for Goal Inference and AssistanceCode0
Integration of nested cross-validation, automated hyperparameter optimization, high-performance computing to reduce and quantify the variance of test performance estimation of deep learning modelsCode0
Bugs in the Data: How ImageNet Misrepresents BiodiversityCode0
CleanPatrick: A Benchmark for Image Data CleaningCode0
BubGAN: Bubble Generative Adversarial Networks for Synthesizing Realistic Bubbly Flow ImagesCode0
InstaIndoor and Multi-modal Deep Learning for Indoor Scene RecognitionCode0
bsnsing: A decision tree induction method based on recursive optimal boolean rule compositionCode0
BSBench: will your LLM find the largest prime number?Code0
Adaptive Shrinkage Estimation For Personalized Deep Kernel Regression In Modeling Brain TrajectoriesCode0
inMOTIFin: a lightweight end-to-end simulation software for regulatory sequencesCode0
Towards Learning Universal, Regional, and Local Hydrological Behaviors via Machine-Learning Applied to Large-Sample DatasetsCode0
Bridging the Generalisation Gap: Synthetic Data Generation for Multi-Site Clinical Model ValidationCode0
Adaptive Power System Emergency Control using Deep Reinforcement LearningCode0
Show:102550
← PrevPage 36 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified