SOTAVerified

Benchmarking

Papers

Showing 13511400 of 5548 papers

TitleStatusHype
Beyond neural scaling laws: beating power law scaling via data pruningCode1
ClinicRealm: Re-evaluating Large Language Models with Conventional Machine Learning for Non-Generative Clinical Prediction TasksCode1
IOHprofiler: A Benchmarking and Profiling Tool for Iterative Optimization HeuristicsCode1
Benchmarking Test-Time Adaptation against Distribution Shifts in Image ClassificationCode1
A framework for benchmarking clustering algorithmsCode1
Benchmarking the Abilities of Large Language Models for RDF Knowledge Graph Creation and Comprehension: How Well Do LLMs Speak Turtle?Code1
ISLES 2022: A multi-center magnetic resonance imaging stroke lesion segmentation datasetCode1
Open Radar Initiative: Large Scale Dataset for Benchmarking of micro-Doppler Recognition AlgorithmsCode1
Arctique: An artificial histopathological dataset unifying realism and controllability for uncertainty quantificationCode1
DNN+NeuroSim V2.0: An End-to-End Benchmarking Framework for Compute-in-Memory Accelerators for On-chip TrainingCode1
A User-Centric Multi-Intent Benchmark for Evaluating Large Language ModelsCode1
Benchmarking the Combinatorial Generalizability of Complex Query Answering on Knowledge GraphsCode1
BlenderGym: Benchmarking Foundational Model Systems for Graphics EditingCode1
OPF-Learn: An Open-Source Framework for Creating Representative AC Optimal Power Flow DatasetsCode1
Does your model understand genes? A benchmark of gene properties for biological and text modelsCode1
OPV2V: An Open Benchmark Dataset and Fusion Pipeline for Perception with Vehicle-to-Vehicle CommunicationCode1
Beyond Correctness: Benchmarking Multi-dimensional Code Generation for Large Language ModelsCode1
A framework for benchmarking class-out-of-distribution detection and its application to ImageNetCode1
Don’t be Contradicted with Anything! CI-ToD: Towards Benchmarking Consistency for Task-oriented Dialogue SystemCode1
IOHexperimenter: Benchmarking Platform for Iterative Optimization HeuristicsCode1
Is LLM-as-a-Judge Robust? Investigating Universal Adversarial Attacks on Zero-shot LLM AssessmentCode1
JoinGym: An Efficient Query Optimization Environment for Reinforcement LearningCode1
Kvasir-Instrument: Diagnostic and therapeutic tool segmentation dataset in gastrointestinal endoscopyCode1
DomainLab: A modular Python package for domain generalization in deep learningCode1
Best practices for constructing, preparing, and evaluating protein-ligand binding affinity benchmarksCode1
Introducing Milabench: Benchmarking Accelerators for AICode1
Benchpress: A Scalable and Versatile Workflow for Benchmarking Structure Learning AlgorithmsCode1
BEND: Benchmarking DNA Language Models on biologically meaningful tasksCode1
Introducing the VoicePrivacy InitiativeCode1
BenchML: an extensible pipelining framework for benchmarking representations of materials and molecules at scaleCode1
Benchmarking the Robustness of Deep Neural Networks to Common Corruptions in Digital PathologyCode1
Benchmarking Implicit Neural Representation and Geometric Rendering in Real-Time RGB-D SLAMCode1
Benchmark on Drug Target Interaction Modeling from a Structure PerspectiveCode1
Benchmarks for Deep Off-Policy EvaluationCode1
Intrinsic Image HarmonizationCode1
Exploiting News Article Structure for Automatic Corpus Generation of Entailment DatasetsCode1
Align and Distill: Unifying and Improving Domain Adaptive Object DetectionCode1
Event-Free Moving Object Segmentation from Moving Ego VehicleCode1
Ducho 2.0: Towards a More Up-to-Date Unified Framework for the Extraction of Multimodal Features in RecommendationCode1
Benchmarking the Robustness of Spatial-Temporal Models Against CorruptionsCode1
Benchmarking Image Retrieval for Visual LocalizationCode1
ArabicaQA: A Comprehensive Dataset for Arabic Question AnsweringCode1
Benchmarking human visual search computational models in natural scenes: models comparison and reference datasetsCode1
Interpretable statistical representations of neural population dynamics and geometryCode1
InstructTTSEval: Benchmarking Complex Natural-Language Instruction Following in Text-to-Speech SystemsCode1
Dynatask: A Framework for Creating Dynamic AI Benchmark TasksCode1
Physiology-based simulation of the retinal vasculature enables annotation-free segmentation of OCT angiographsCode1
PIC4rl-gym: a ROS2 modular framework for Robots Autonomous Navigation with Deep Reinforcement LearningCode1
Aquatic Navigation: A Challenging Benchmark for Deep Reinforcement LearningCode1
IntelliGraphs: Datasets for Benchmarking Knowledge Graph GenerationCode1
Show:102550
← PrevPage 28 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified