SOTAVerified

Benchmarking

Papers

Showing 150 of 5548 papers

TitleStatusHype
WebWalker: Benchmarking LLMs in Web TraversalCode11
StableToolBench: Towards Stable Large-Scale Benchmarking on Tool Learning of Large Language ModelsCode9
EvoRL: A GPU-accelerated Framework for Evolutionary Reinforcement LearningCode7
CALE: Continuous Arcade Learning EnvironmentCode7
Segment Anything in Medical Images and Videos: Benchmark and DeploymentCode7
ECCO: Can We Improve Model-Generated Code Efficiency Without Sacrificing Functional Correctness?Code7
NAVSIM: Data-Driven Non-Reactive Autonomous Vehicle Simulation and BenchmarkingCode7
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer EnvironmentsCode7
Better than classical? The subtle art of benchmarking quantum machine learning modelsCode7
SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language ModelsCode7
DeepSpeed-FastGen: High-throughput Text Generation for LLMs via MII and DeepSpeed-InferenceCode7
TaskBench: Benchmarking Large Language Models for Task AutomationCode6
AssetOpsBench: Benchmarking AI Agents for Task Automation in Industrial Asset Operations and MaintenanceCode5
Benchmarking the Myopic Trap: Positional Bias in Information RetrievalCode5
SMPLest-X: Ultimate Scaling for Expressive Human Pose and Shape EstimationCode5
OmniDocBench: Benchmarking Diverse PDF Document Parsing with Comprehensive AnnotationsCode5
The BrowserGym Ecosystem for Web Agent ResearchCode5
VBench++: Comprehensive and Versatile Benchmark Suite for Video Generative ModelsCode5
TFB: Towards Comprehensive and Fair Benchmarking of Time Series Forecasting MethodsCode5
Segment Anything Model for Medical Image Segmentation: Current Applications and Future DirectionsCode5
CodeGeeX: A Pre-Trained Model for Code Generation with Multilingual Benchmarking on HumanEval-XCode5
OpenUnlearning: Accelerating LLM Unlearning via Unified Benchmarking of Methods and MetricsCode4
TerraTorch: The Geospatial Foundation Models ToolkitCode4
Stop Overthinking: A Survey on Efficient Reasoning for Large Language ModelsCode4
Recent Advances in Large Langauge Model Benchmarks against Data Contamination: From Static to Dynamic EvaluationCode4
Building reliable sim driving agents by scaling self-playCode4
A deep learning framework for efficient pathology image analysisCode4
Accelerating Data Processing and Benchmarking of AI Models for PathologyCode4
Meta Audiobox Aesthetics: Unified Automatic Quality Assessment for Speech, Music, and SoundCode4
Rankify: A Comprehensive Python Toolkit for Retrieval, Re-Ranking, and Retrieval-Augmented GenerationCode4
Molecular-driven Foundation Model for Oncologic PathologyCode4
Vchitect-2.0: Parallel Transformer for Scaling Up Video Diffusion ModelsCode4
OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and ReasoningCode4
Dora: Sampling and Benchmarking for 3D Shape Variational Auto-EncodersCode4
TableGPT2: A Large Multimodal Model with Tabular Data IntegrationCode4
MLPerf Power: Benchmarking the Energy Efficiency of Machine Learning Systems from Microwatts to Megawatts for Sustainable AICode4
shapiq: Shapley Interactions for Machine LearningCode4
Enabling more efficient and cost-effective AI/ML systems with Collective Mind, virtualized MLOps, MLPerf, Collective Knowledge Playground and reproducible optimization tournamentsCode4
BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex InstructionsCode4
Bench2Drive: Towards Multi-Ability Benchmarking of Closed-Loop End-To-End Autonomous DrivingCode4
AndroidWorld: A Dynamic Benchmarking Environment for Autonomous AgentsCode4
Aequitas Flow: Streamlining Fair ML ExperimentationCode4
LLMC: Benchmarking Large Language Model Quantization with a Versatile Compression ToolkitCode4
Benchmarking Retrieval-Augmented Generation for MedicineCode4
I Think, Therefore I am: Benchmarking Awareness of Large Language Models Using AwareBenchCode4
Pearl: A Production-ready Reinforcement Learning AgentCode4
Benchmarking Neural Network Training AlgorithmsCode4
OpenAGI: When LLM Meets Domain ExpertsCode4
Vision-Language Models for Vision Tasks: A SurveyCode4
MTEB: Massive Text Embedding BenchmarkCode4
Show:102550
← PrevPage 1 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified