SOTAVerified

Benchmarking

Papers

Showing 301310 of 5548 papers

TitleStatusHype
Planning, Creation, Usage: Benchmarking LLMs for Comprehensive Tool Utilization in Real-World Complex ScenariosCode2
R-Judge: Benchmarking Safety Risk Awareness for LLM AgentsCode2
WAVES: Benchmarking the Robustness of Image WatermarksCode2
Authorship Obfuscation in Multilingual Machine-Generated Text DetectionCode2
InfiAgent-DABench: Evaluating Agents on Data Analysis TasksCode2
A Call to Reflect on Evaluation Practices for Age Estimation: Comparative Analysis of the State-of-the-Art and a Unified BenchmarkCode2
EQ-Bench: An Emotional Intelligence Benchmark for Large Language ModelsCode2
AlignBench: Benchmarking Chinese Alignment of Large Language ModelsCode2
Biomedical knowledge graph-optimized prompt generation for large language modelsCode2
SEED-Bench-2: Benchmarking Multimodal Large Language ModelsCode2
Show:102550
← PrevPage 31 of 555Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified