SOTAVerified

Benchmarking

Papers

Showing 76100 of 5548 papers

TitleStatusHype
Language Model Council: Democratically Benchmarking Foundation Models on Highly Subjective TasksCode3
ITBench: Evaluating AI Agents across Diverse Real-World IT Automation TasksCode3
LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot LearningCode3
Highly Accurate Quantum Chemical Property Prediction with Uni-Mol+Code3
Benchmarking Automatic Machine Learning FrameworksCode3
HumanVid: Demystifying Training Data for Camera-controllable Human Image AnimationCode3
General Geospatial Inference with a Population Dynamics Foundation ModelCode3
HEST-1k: A Dataset for Spatial Transcriptomics and Histology Image AnalysisCode3
Super-NaturalInstructions: Generalization via Declarative Instructions on 1600+ NLP TasksCode3
IFEval-Audio: Benchmarking Instruction-Following Capability in Audio-based Large Language ModelsCode3
Advancing LLM Reasoning Generalists with Preference TreesCode3
mlpack 3: a fast, flexible machine learning libraryCode3
BatteryLife: A Comprehensive Dataset and Benchmark for Battery Life PredictionCode3
AbdomenAtlas: A Large-Scale, Detailed-Annotated, & Multi-Center Dataset for Efficient Transfer Learning and Open Algorithmic BenchmarkingCode3
LocoMuJoCo: A Comprehensive Imitation Learning Benchmark for LocomotionCode3
Benchmarking Large Language Models on CFLUE -- A Chinese Financial Language Understanding Evaluation DatasetCode3
Benchmarking LLMs via Uncertainty QuantificationCode3
Embodied Agent Interface: Benchmarking LLMs for Embodied Decision MakingCode3
Benchmarking Multimodal Retrieval Augmented Generation with Dynamic VQA Dataset and Self-adaptive Planning AgentCode3
Exploring Progress in Multivariate Time Series Forecasting: Comprehensive Benchmarking and Heterogeneity AnalysisCode3
BERGEN: A Benchmarking Library for Retrieval-Augmented GenerationCode3
AndroidLab: Training and Systematic Benchmarking of Android Autonomous AgentsCode3
Automatic Intrinsic Reward Shaping for Exploration in Deep Reinforcement LearningCode3
DISCOVERYWORLD: A Virtual Environment for Developing and Evaluating Automated Scientific Discovery AgentsCode3
A Unified Framework for Rank-based Evaluation Metrics for Link Prediction in Knowledge GraphsCode3
Show:102550
← PrevPage 4 of 222Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified