SOTAVerified

Benchmarking

Papers

Showing 91100 of 5548 papers

TitleStatusHype
Benchmarking Large Language Models on CFLUE -- A Chinese Financial Language Understanding Evaluation DatasetCode3
Benchmarking LLMs via Uncertainty QuantificationCode3
Embodied Agent Interface: Benchmarking LLMs for Embodied Decision MakingCode3
Benchmarking Multimodal Retrieval Augmented Generation with Dynamic VQA Dataset and Self-adaptive Planning AgentCode3
Exploring Progress in Multivariate Time Series Forecasting: Comprehensive Benchmarking and Heterogeneity AnalysisCode3
BERGEN: A Benchmarking Library for Retrieval-Augmented GenerationCode3
AndroidLab: Training and Systematic Benchmarking of Android Autonomous AgentsCode3
Automatic Intrinsic Reward Shaping for Exploration in Deep Reinforcement LearningCode3
DISCOVERYWORLD: A Virtual Environment for Developing and Evaluating Automated Scientific Discovery AgentsCode3
A Unified Framework for Rank-based Evaluation Metrics for Link Prediction in Knowledge GraphsCode3
Show:102550
← PrevPage 10 of 555Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified