SOTAVerified

Benchmarking

Papers

Showing 241250 of 5548 papers

TitleStatusHype
Evaluating Large-Vocabulary Object Detectors: The Devil is in the DetailsCode2
MIBench: A Comprehensive Framework for Benchmarking Model Inversion Attack and DefenseCode2
FluidLab: A Differentiable Environment for Benchmarking Complex Fluid ManipulationCode2
DreamBench++: A Human-Aligned Benchmark for Personalized Image GenerationCode2
LLM-Based Multi-Agent Systems are Scalable Graph Generative ModelsCode2
MMLongBench-Doc: Benchmarking Long-context Document Understanding with VisualizationsCode2
MTVQA: Benchmarking Multilingual Text-Centric Visual Question AnsweringCode2
AI Hospital: Benchmarking Large Language Models in a Multi-agent Medical Interaction SimulatorCode2
A Content-Driven Micro-Video Recommendation Dataset at ScaleCode2
DrafterBench: Benchmarking Large Language Models for Tasks Automation in Civil EngineeringCode2
Show:102550
← PrevPage 25 of 555Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified