SOTAVerified

Benchmarking

Papers

Showing 125 of 5548 papers

TitleStatusHype
WebWalker: Benchmarking LLMs in Web TraversalCode11
StableToolBench: Towards Stable Large-Scale Benchmarking on Tool Learning of Large Language ModelsCode9
Better than classical? The subtle art of benchmarking quantum machine learning modelsCode7
EvoRL: A GPU-accelerated Framework for Evolutionary Reinforcement LearningCode7
CALE: Continuous Arcade Learning EnvironmentCode7
DeepSpeed-FastGen: High-throughput Text Generation for LLMs via MII and DeepSpeed-InferenceCode7
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer EnvironmentsCode7
SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language ModelsCode7
Segment Anything in Medical Images and Videos: Benchmark and DeploymentCode7
NAVSIM: Data-Driven Non-Reactive Autonomous Vehicle Simulation and BenchmarkingCode7
ECCO: Can We Improve Model-Generated Code Efficiency Without Sacrificing Functional Correctness?Code7
TaskBench: Benchmarking Large Language Models for Task AutomationCode6
CodeGeeX: A Pre-Trained Model for Code Generation with Multilingual Benchmarking on HumanEval-XCode5
TFB: Towards Comprehensive and Fair Benchmarking of Time Series Forecasting MethodsCode5
The BrowserGym Ecosystem for Web Agent ResearchCode5
Segment Anything Model for Medical Image Segmentation: Current Applications and Future DirectionsCode5
SMPLest-X: Ultimate Scaling for Expressive Human Pose and Shape EstimationCode5
AssetOpsBench: Benchmarking AI Agents for Task Automation in Industrial Asset Operations and MaintenanceCode5
OmniDocBench: Benchmarking Diverse PDF Document Parsing with Comprehensive AnnotationsCode5
VBench++: Comprehensive and Versatile Benchmark Suite for Video Generative ModelsCode5
Benchmarking the Myopic Trap: Positional Bias in Information RetrievalCode5
Enabling more efficient and cost-effective AI/ML systems with Collective Mind, virtualized MLOps, MLPerf, Collective Knowledge Playground and reproducible optimization tournamentsCode4
I Think, Therefore I am: Benchmarking Awareness of Large Language Models Using AwareBenchCode4
Dora: Sampling and Benchmarking for 3D Shape Variational Auto-EncodersCode4
LLMC: Benchmarking Large Language Model Quantization with a Versatile Compression ToolkitCode4
Show:102550
← PrevPage 1 of 222Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified