SOTAVerified

Benchmarking

Papers

Showing 351400 of 5548 papers

TitleStatusHype
POPGym: Benchmarking Partially Observable Reinforcement LearningCode2
Fortuna: A Library for Uncertainty Quantification in Deep LearningCode2
Desbordante: from benchmarking suite to high-performance science-intensive data profiler (preprint)Code2
Benchmarking the Robustness of LiDAR Semantic Segmentation ModelsCode2
Ultra-High-Definition Low-Light Image Enhancement: A Benchmark and Transformer-Based MethodCode2
PyPop7: A Pure-Python Library for Population-Based Black-Box OptimizationCode2
Why do tree-based models still outperform deep learning on typical tabular data?Code2
Immersive Neural Graphics PrimitivesCode2
LaMAR: Benchmarking Localization and Mapping for Augmented RealityCode2
rPPG-Toolbox: Deep Remote PPG ToolboxCode2
Building Normalizing Flows with Stochastic InterpolantsCode2
State-specific protein-ligand complex structure prediction with a multi-scale deep generative modelCode2
MultiPL-E: A Scalable and Extensible Approach to Benchmarking Neural Code GenerationCode2
Panoptic Scene Graph GenerationCode2
Why do tree-based models still outperform deep learning on tabular data?Code2
VMAS: A Vectorized Multi-Agent Simulator for Collective Robot LearningCode2
Understanding Performance of Long-Document Ranking Models through Comprehensive Evaluation and LeaderboardingCode2
The ArtBench Dataset: Benchmarking Generative Models with ArtworksCode2
DaisyRec 2.0: Benchmarking Recommendation for Rigorous EvaluationCode2
Challenges and Opportunities in Offline Reinforcement Learning from Visual ObservationsCode2
Fast Vision Transformers with HiLo AttentionCode2
BARS: Towards Open Benchmarking for Recommender SystemsCode2
K-LITE: Learning Transferable Visual Models with External KnowledgeCode2
Deep Visual Geo-localization BenchmarkCode2
Multi-Class Road User Detection With 3+1D Radar in the View-of-Delft DatasetCode2
ADATIME: A Benchmarking Suite for Domain Adaptation on Time Series DataCode2
Benchmarking Robustness of 3D Point Cloud Recognition Against Common CorruptionsCode2
AiTLAS: Artificial Intelligence Toolbox for Earth ObservationCode2
Investigating Tradeoffs in Real-World Video Super-ResolutionCode2
Multitask Prompted Training Enables Zero-Shot Task GeneralizationCode2
MetaDrive: Composing Diverse Driving Scenarios for Generalizable Reinforcement LearningCode2
Panoptic nuScenes: A Large-Scale Benchmark for LiDAR Panoptic Segmentation and TrackingCode2
BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval ModelsCode2
Learning to Fly -- a Gym Environment with PyBullet Physics for Reinforcement Learning of Multi-agent Quadcopter ControlCode2
Learning Transferable Visual Models From Natural Language SupervisionCode2
Evaluating Large-Vocabulary Object Detectors: The Devil is in the DetailsCode2
PyHealth: A Python Library for Health Predictive ModelsCode2
TadGAN: Time Series Anomaly Detection Using Generative Adversarial NetworksCode2
Searching for a Search Method: Benchmarking Search Algorithms for Generating NLP Adversarial ExamplesCode2
Bringing Light Into the Dark: A Large-scale Evaluation of Knowledge Graph Embedding Models Under a Unified FrameworkCode2
Benchmarking Graph Neural NetworksCode2
Benchmarking Zero-shot Text Classification: Datasets, Evaluation and Entailment ApproachCode2
Habitat: A Platform for Embodied AI ResearchCode2
Benchmarking Neural Network Robustness to Common Corruptions and PerturbationsCode2
A large annotated medical image dataset for the development and evaluation of segmentation algorithmsCode2
Benchmarking Deep Reinforcement Learning for Continuous ControlCode2
LLMThinkBench: Towards Basic Math Reasoning and Overthinking in Large Language ModelsCode1
Latent Thermodynamic Flows: Unified Representation Learning and Generative Modeling of Temperature-Dependent Behaviors from Limited DataCode1
CovDocker: Benchmarking Covalent Drug Design with Tasks, Datasets, and SolutionsCode1
WattsOnAI: Measuring, Analyzing, and Visualizing Energy and Carbon Footprint of AI WorkloadsCode1
Show:102550
← PrevPage 8 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified