SOTAVerified|Agents Browse Leaderboard About

Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 91–100 of 5548 papers

Title	Date	Tasks	Status	Hype
Benchmarking Large Language Models on CFLUE -- A Chinese Financial Language Understanding Evaluation Dataset	May 17, 2024	16kBenchmarking	CodeCode Available	3
Benchmarking LLMs via Uncertainty Quantification	Jan 23, 2024	BenchmarkingUncertainty Quantification	CodeCode Available	3
Embodied Agent Interface: Benchmarking LLMs for Embodied Decision Making	Oct 9, 2024	BenchmarkingDecision Making	CodeCode Available	3
Benchmarking Multimodal Retrieval Augmented Generation with Dynamic VQA Dataset and Self-adaptive Planning Agent	Nov 5, 2024	BenchmarkingHallucination	CodeCode Available	3
Exploring Progress in Multivariate Time Series Forecasting: Comprehensive Benchmarking and Heterogeneity Analysis	Oct 9, 2023	BenchmarkingMultivariate Time Series Forecasting	CodeCode Available	3
BERGEN: A Benchmarking Library for Retrieval-Augmented Generation	Jul 1, 2024	BenchmarkingRAG	CodeCode Available	3
AndroidLab: Training and Systematic Benchmarking of Android Autonomous Agents	Oct 31, 2024	Benchmarking	CodeCode Available	3
Automatic Intrinsic Reward Shaping for Exploration in Deep Reinforcement Learning	Jan 26, 2023	BenchmarkingDeep Reinforcement Learning	CodeCode Available	3
DISCOVERYWORLD: A Virtual Environment for Developing and Evaluating Automated Scientific Discovery Agents	Jun 10, 2024	Benchmarkingscientific discovery	CodeCode Available	3
A Unified Framework for Rank-based Evaluation Metrics for Link Prediction in Knowledge Graphs	Mar 14, 2022	BenchmarkingGraph Embedding	CodeCode Available	3

Show:10 25 50

← PrevPage 10 of 555Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified