SOTAVerified

Benchmarking

Papers

Showing 101150 of 5548 papers

TitleStatusHype
Multi-Head RAG: Solving Multi-Aspect Problems with LLMsCode3
WildBench: Benchmarking LLMs with Challenging Tasks from Real Users in the WildCode3
MLVU: Benchmarking Multi-task Long Video UnderstandingCode3
Benchmarking and Improving Bird's Eye View Perception Robustness in Autonomous DrivingCode3
Benchmarking Large Language Models on CFLUE -- A Chinese Financial Language Understanding Evaluation DatasetCode3
Are EEG-to-Text Models Working?Code3
ACEGEN: Reinforcement learning of generative chemical agents for drug discoveryCode3
SEED-Bench-2-Plus: Benchmarking Multimodal Large Language Models with Text-Rich Visual ComprehensionCode3
DeepFake-O-Meter v2.0: An Open Platform for DeepFake DetectionCode3
STaRK: Benchmarking LLM Retrieval on Textual and Relational Knowledge BasesCode3
Advancing LLM Reasoning Generalists with Preference TreesCode3
Real-IAD: A Real-World Multi-View Dataset for Benchmarking Versatile Industrial Anomaly DetectionCode3
AlphaFin: Benchmarking Financial Analysis with Retrieval-Augmented Stock-Chain FrameworkCode3
Recurrent Drafter for Fast Speculative Decoding in Large Language ModelsCode3
MultiHop-RAG: Benchmarking Retrieval-Augmented Generation for Multi-Hop QueriesCode3
AgentBoard: An Analytical Evaluation Board of Multi-turn LLM AgentsCode3
Benchmarking LLMs via Uncertainty QuantificationCode3
A Vision-Language Foundation Model to Enhance Efficiency of Chest X-ray InterpretationCode3
SEED-Bench: Benchmarking Multimodal Large Language ModelsCode3
AM-RADIO: Agglomerative Vision Foundation Model -- Reduce All Domains Into OneCode3
LocoMuJoCo: A Comprehensive Imitation Learning Benchmark for LocomotionCode3
CRITERIA: a New Benchmarking Paradigm for Evaluating Trajectory Prediction Models for Autonomous DrivingCode3
Exploring Progress in Multivariate Time Series Forecasting: Comprehensive Benchmarking and Heterogeneity AnalysisCode3
T^3Bench: Benchmarking Current Progress in Text-to-3D GenerationCode3
SMPLer-X: Scaling Up Expressive Human Pose and Shape EstimationCode3
Matbench Discovery -- A framework to evaluate machine learning crystal stability predictionsCode3
LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot LearningCode3
TorchBench: Benchmarking PyTorch with High API Surface CoverageCode3
Highly Accurate Quantum Chemical Property Prediction with Uni-Mol+Code3
Automatic Intrinsic Reward Shaping for Exploration in Deep Reinforcement LearningCode3
AER: Auto-Encoder with Regression for Time Series Anomaly DetectionCode3
CORL: Research-oriented Deep Offline Reinforcement Learning LibraryCode3
Super-NaturalInstructions: Generalization via Declarative Instructions on 1600+ NLP TasksCode3
A Unified Framework for Rank-based Evaluation Metrics for Link Prediction in Knowledge GraphsCode3
CleanRL: High-quality Single-file Implementations of Deep Reinforcement Learning AlgorithmsCode3
Personalized Benchmarking with the Ludwig Benchmarking ToolkitCode3
Benchmarking Multimodal AutoML for Tabular Data with Text FieldsCode3
A Survey on Performance Metrics for Object-Detection AlgorithmsCode3
Benchmarking Automatic Machine Learning FrameworksCode3
mlpack 3: a fast, flexible machine learning libraryCode3
DrafterBench: Benchmarking Large Language Models for Tasks Automation in Civil EngineeringCode2
GDGB: A Benchmark for Generative Dynamic Text-Attributed Graph LearningCode2
PocketVina Enables Scalable and Highly Accurate Physically Valid Docking through Multi-Pocket ConditioningCode2
TAB: Unified Benchmarking of Time Series Anomaly Detection MethodsCode2
BMFM-RNA: An Open Framework for Building and Evaluating Transcriptomic Foundation ModelsCode2
SEC-bench: Automated Benchmarking of LLM Agents on Real-World Software Security TasksCode2
SDialog: A Python Toolkit for Synthetic Dialogue Generation and AnalysisCode2
IntPhys 2: Benchmarking Intuitive Physics Understanding In Complex Synthetic EnvironmentsCode2
MegaHan97K: A Large-Scale Dataset for Mega-Category Chinese Character Recognition with over 97K CategoriesCode2
GSCodec Studio: A Modular Framework for Gaussian Splat CompressionCode2
Show:102550
← PrevPage 3 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified