SOTAVerified|Agents Browse Leaderboard About

Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 941–950 of 5548 papers

Title	Date	Tasks	Status	Hype
Benchmarking the rationality of AI decision making using the transitivity axiom	Feb 14, 2025	BenchmarkingDecision Making	—Unverified	0
Forecasting time series with constraints	Feb 14, 2025	Additive modelsBenchmarking	CodeCode Available	0
A Survey on LLM-based News Recommender Systems	Feb 13, 2025	BenchmarkingFairness	—Unverified	0
AT-Drone: Benchmarking Adaptive Teaming in Multi-Drone Pursuit	Feb 13, 2025	BenchmarkingEdge-computing	—Unverified	0
MME-CoT: Benchmarking Chain-of-Thought in Large Multimodal Models for Reasoning Quality, Robustness, and Efficiency	Feb 13, 2025	BenchmarkingMath	—Unverified	0
Do LLMs Recognize Your Preferences? Evaluating Personalized Preference Following in LLMs	Feb 13, 2025	BenchmarkingRetrieval	CodeCode Available	1
Beyond the Singular: The Essential Role of Multiple Generations in Effective Benchmark Evaluation and Analysis	Feb 13, 2025	Benchmarking	—Unverified	0
Standardisation of Convex Ultrasound Data Through Geometric Analysis and Augmentation	Feb 13, 2025	Benchmarking	—Unverified	0
EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents	Feb 13, 2025	Benchmarking	—Unverified	0
Zero-shot generation of synthetic neurosurgical data with large language models	Feb 13, 2025	BenchmarkingDe-identification	CodeCode Available	0

Show:10 25 50

← PrevPage 95 of 555Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified