SOTAVerified

Benchmarking

Papers

Showing 291300 of 5548 papers

TitleStatusHype
PEDANTS: Cheap but Effective and Interpretable Answer EquivalenceCode2
AI Hospital: Benchmarking Large Language Models in a Multi-agent Medical Interaction SimulatorCode2
MultiMedEval: A Benchmark and a Toolkit for Evaluating Medical Vision-Language ModelsCode2
LoTa-Bench: Benchmarking Language-oriented Task Planners for Embodied AgentsCode2
Customizable Perturbation Synthesis for Robust SLAM BenchmarkingCode2
AIR-Bench: Benchmarking Large Audio-Language Models via Generative ComprehensionCode2
InstructScene: Instruction-Driven 3D Indoor Scene Synthesis with Semantic Graph PriorCode2
LV-Eval: A Balanced Long-Context Benchmark with 5 Length Levels Up to 256KCode2
LtU-ILI: An All-in-One Framework for Implicit Inference in Astrophysics and CosmologyCode2
EffiBench: Benchmarking the Efficiency of Automatically Generated CodeCode2
Show:102550
← PrevPage 30 of 555Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified