SOTAVerified|Agents Browse Leaderboard About

Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 291–300 of 5548 papers

Title	Date	Tasks	Status	Hype
PEDANTS: Cheap but Effective and Interpretable Answer Equivalence	Feb 17, 2024	BenchmarkingForm	CodeCode Available	2
AI Hospital: Benchmarking Large Language Models in a Multi-agent Medical Interaction Simulator	Feb 15, 2024	BenchmarkingDiagnostic	CodeCode Available	2
MultiMedEval: A Benchmark and a Toolkit for Evaluating Medical Vision-Language Models	Feb 14, 2024	BenchmarkingDiversity	CodeCode Available	2
LoTa-Bench: Benchmarking Language-oriented Task Planners for Embodied Agents	Feb 13, 2024	BenchmarkingModel Selection	CodeCode Available	2
Customizable Perturbation Synthesis for Robust SLAM Benchmarking	Feb 12, 2024	BenchmarkingSimultaneous Localization and Mapping	CodeCode Available	2
AIR-Bench: Benchmarking Large Audio-Language Models via Generative Comprehension	Feb 12, 2024	2kAutomatic Speech Recognition	CodeCode Available	2
InstructScene: Instruction-Driven 3D Indoor Scene Synthesis with Semantic Graph Prior	Feb 7, 2024	BenchmarkingDecoder	CodeCode Available	2
LV-Eval: A Balanced Long-Context Benchmark with 5 Length Levels Up to 256K	Feb 6, 2024	16kBenchmarking	CodeCode Available	2
LtU-ILI: An All-in-One Framework for Implicit Inference in Astrophysics and Cosmology	Feb 6, 2024	AllBenchmarking	CodeCode Available	2
EffiBench: Benchmarking the Efficiency of Automatically Generated Code	Feb 3, 2024	BenchmarkingCode Completion	CodeCode Available	2

Show:10 25 50

← PrevPage 30 of 555Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified