SOTAVerified

Benchmarking

Papers

Showing 161170 of 5548 papers

TitleStatusHype
MINERVA: Evaluating Complex Video ReasoningCode2
BrowseComp-ZH: Benchmarking Web Browsing Ability of Large Language Models in ChineseCode2
WASP: Benchmarking Web Agent Security Against Prompt Injection AttacksCode2
Know Me, Respond to Me: Benchmarking LLMs for Dynamic User Profiling and Personalized Responses at ScaleCode2
HypoBench: Towards Systematic and Principled Benchmarking for Hypothesis GenerationCode2
TorchFX: A modern approach to Audio DSP with PyTorch and GPU accelerationCode2
Envisioning Beyond the Pixels: Benchmarking Reasoning-Informed Visual EditingCode2
Benchmarking Synthetic Tabular Data: A Multi-Dimensional Evaluation FrameworkCode2
Decouple and Track: Benchmarking and Improving Video Diffusion Transformers for Motion TransferCode2
VenusFactory: A Unified Platform for Protein Engineering Data Retrieval and Language Model Fine-TuningCode2
Show:102550
← PrevPage 17 of 555Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified