SOTAVerified

Benchmarking

Papers

Showing 501525 of 5548 papers

TitleStatusHype
Circumventing shortcuts in audio-visual deepfake detection datasets with unsupervised learningCode1
Descending through a Crowded Valley — Benchmarking Deep Learning OptimizersCode1
RADAR: Benchmarking Language Models on Imperfect Tabular DataCode1
DexArt: Benchmarking Generalizable Dexterous Manipulation with Articulated ObjectsCode1
CIDEr: Consensus-based Image Description EvaluationCode1
Application-Oriented Benchmarking of Quantum Generative Learning Using QUARKCode1
CIBench: Evaluating Your LLMs with a Code Interpreter PluginCode1
CIPCaD-Bench: Continuous Industrial Process datasets for benchmarking Causal Discovery methodsCode1
CheX-GPT: Harnessing Large Language Models for Enhanced Chest X-ray Report LabelingCode1
DIG In: Evaluating Disparities in Image Generations with Indicators for Geographic DiversityCode1
CheXphoto: 10,000+ Photos and Transformations of Chest X-rays for Benchmarking Deep Learning RobustnessCode1
An Evaluation Dataset for Intent Classification and Out-of-Scope PredictionCode1
CHILI: Chemically-Informed Large-scale Inorganic Nanomaterials Dataset for Advancing Graph Machine LearningCode1
ClearPose: Large-scale Transparent Object Dataset and BenchmarkCode1
CodeReef: an open platform for portable MLOps, reusable automation actions and reproducible benchmarkingCode1
Towards Motion Forecasting with Real-World Perception Inputs: Are End-to-End Approaches Competitive?Code1
Chaos as an interpretable benchmark for forecasting and data-driven modellingCode1
An Empirical Study on Google Research Football Multi-agent ScenariosCode1
Addressing the generalization of 3D registration methods with a featureless baseline and an unbiased benchmarkCode1
CCTV-Gun: Benchmarking Handgun Detection in CCTV ImagesCode1
CharacterBench: Benchmarking Character Customization of Large Language ModelsCode1
Addressing Shortcomings in Fair Graph Learning Datasets: Towards a New BenchmarkCode1
An Empirical Study of GPT-4o Image Generation CapabilitiesCode1
CAVIAR: Co-simulation of 6G Communications, 3D Scenarios and AI for Digital TwinsCode1
Causality for Tabular Data Synthesis: A High-Order Structure Causal Benchmark FrameworkCode1
Show:102550
← PrevPage 21 of 222Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified