SOTAVerified

Benchmarking

Papers

Showing 376400 of 5548 papers

TitleStatusHype
CORAL: Benchmarking Multi-turn Conversational Retrieval-Augmentation GenerationCode2
CoqPilot, a plugin for LLM-based generation of proofsCode2
PerturBench: Benchmarking Machine Learning Models for Cellular Perturbation AnalysisCode2
Customizable Perturbation Synthesis for Robust SLAM BenchmarkingCode2
Deep Visual Geo-localization BenchmarkCode2
PocketVina Enables Scalable and Highly Accurate Physically Valid Docking through Multi-Pocket ConditioningCode2
PolygloToxicityPrompts: Multilingual Evaluation of Neural Toxic Degeneration in Large Language ModelsCode2
Are large language models superhuman chemists?Code2
CoIR: A Comprehensive Benchmark for Code Information Retrieval ModelsCode2
COALA: A Practical and Vision-Centric Federated Learning PlatformCode2
ClimateLearn: Benchmarking Machine Learning for Weather and Climate ModelingCode2
Commit0: Library Generation from ScratchCode2
Class-incremental Learning for Time Series: Benchmark and EvaluationCode2
Assessing SPARQL capabilities of Large Language ModelsCode2
Reliable and Efficient Concept Erasure of Text-to-Image Diffusion ModelsCode2
Challenges and Opportunities in Offline Reinforcement Learning from Visual ObservationsCode2
COMPL-AI Framework: A Technical Interpretation and LLM Benchmarking Suite for the EU Artificial Intelligence ActCode2
R-Judge: Benchmarking Safety Risk Awareness for LLM AgentsCode2
Benchmarking Robustness of 3D Point Cloud Recognition Against Common CorruptionsCode2
Desbordante: from benchmarking suite to high-performance science-intensive data profiler (preprint)Code2
Quanda: An Interpretability Toolkit for Training Data Attribution Evaluation and BeyondCode2
COSMOS: Catching Out-of-Context Misinformation with Self-Supervised LearningCode1
Category-wise Fine-Tuning: Resisting Incorrect Pseudo-Labels in Multi-Label Image Classification with Partial LabelsCode1
RADAR: Benchmarking Language Models on Imperfect Tabular DataCode1
Application-Oriented Benchmarking of Quantum Generative Learning Using QUARKCode1
Show:102550
← PrevPage 16 of 222Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified