SOTAVerified

Benchmarking

Papers

Showing 271280 of 5548 papers

TitleStatusHype
AssistedDS: Benchmarking How External Domain Knowledge Assists LLMs in Automated Data Science0
DeepResearchGym: A Free, Transparent, and Reproducible Evaluation Sandbox for Deep Research0
SeePhys: Does Seeing Help Thinking? -- Benchmarking Vision-Based Physics ReasoningCode1
Are Vision Language Models Ready for Clinical Diagnosis? A 3D Medical Benchmark for Tumor-centric Visual Question AnsweringCode1
Benchmarking Laparoscopic Surgical Image Restoration and BeyondCode2
SpokenNativQA: Multilingual Everyday Spoken Queries for LLMs0
Where Paths Collide: A Comprehensive Survey of Classic and Learning-Based Multi-Agent Pathfinding0
EnvSDD: Benchmarking Environmental Sound Deepfake Detection0
Retrieval-Augmented Generation for Service Discovery: Chunking Strategies and Benchmarking0
Benchmarking Large Language Models for Cyberbullying Detection in Real-World YouTube Comments0
Show:102550
← PrevPage 28 of 555Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified