SOTAVerified|Agents Browse Leaderboard About

Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 241–250 of 5548 papers

Title	Date	Tasks	Status	Hype
UniGen: A Unified Framework for Textual Dataset Generation Using Large Language Models	Jun 27, 2024	AttributeBenchmarking	CodeCode Available	2
MathOdyssey: Benchmarking Mathematical Problem-Solving Skills in Large Language Models Using Odyssey Math Data	Jun 26, 2024	BenchmarkingMath	CodeCode Available	2
GenRL: Multimodal-foundation world models for generalization in embodied agents	Jun 26, 2024	BenchmarkingReinforcement Learning (RL)	CodeCode Available	2
Leave No Document Behind: Benchmarking Long-Context LLMs with Extended Multi-Doc QA	Jun 25, 2024	BenchmarkingLong-Context Understanding	CodeCode Available	2
From Perfect to Noisy World Simulation: Customizable Embodied Multi-modal Perturbations for SLAM Robustness Benchmarking	Jun 24, 2024	BenchmarkingNeRF	CodeCode Available	2
DreamBench++: A Human-Aligned Benchmark for Personalized Image Generation	Jun 24, 2024	BenchmarkingImage Generation	CodeCode Available	2
FaceScore: Benchmarking and Enhancing Face Quality in Human Generation	Jun 24, 2024	BenchmarkingDenoising	CodeCode Available	2
Towards Open Respiratory Acoustic Foundation Models: Pretraining and Benchmarking	Jun 23, 2024	Benchmarking	CodeCode Available	2
GenoTEX: An LLM Agent Benchmark for Automated Gene Expression Data Analysis	Jun 21, 2024	AI AgentAutoML	CodeCode Available	2
Benchmarking Uncertainty Quantification Methods for Large Language Models with LM-Polygraph	Jun 21, 2024	BenchmarkingText Generation	CodeCode Available	2

Show:10 25 50

← PrevPage 25 of 555Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified