SOTAVerified

Benchmarking

Papers

Showing 211220 of 5548 papers

TitleStatusHype
Bench4KE: Benchmarking Automated Competency Question GenerationCode1
MetaFaith: Faithful Natural Language Uncertainty Expression in LLMsCode0
Draw ALL Your Imagine: A Holistic Benchmark and Agent Framework for Complex Instruction-based Image GenerationCode1
ByzFL: Research Framework for Robust Federated LearningCode1
Is Your Model Fairly Certain? Uncertainty-Aware Fairness Evaluation for LLMsCode0
MSQA: Benchmarking LLMs on Graduate-Level Materials Science Reasoning and Knowledge0
Socratic-PRMBench: Benchmarking Process Reward Models with Systematic Reasoning Patterns0
R2I-Bench: Benchmarking Reasoning-Driven Text-to-Image Generation0
SNS-Bench-VL: Benchmarking Multimodal Large Language Models in Social Networking ServicesCode0
Diagnosing and Addressing Pitfalls in KG-RAG Datasets: Toward More Reliable Benchmarking0
Show:102550
← PrevPage 22 of 555Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified