SOTAVerified

Benchmarking

Papers

Showing 101110 of 5548 papers

TitleStatusHype
Multi-Head RAG: Solving Multi-Aspect Problems with LLMsCode3
WildBench: Benchmarking LLMs with Challenging Tasks from Real Users in the WildCode3
MLVU: Benchmarking Multi-task Long Video UnderstandingCode3
Benchmarking and Improving Bird's Eye View Perception Robustness in Autonomous DrivingCode3
Benchmarking Large Language Models on CFLUE -- A Chinese Financial Language Understanding Evaluation DatasetCode3
Are EEG-to-Text Models Working?Code3
ACEGEN: Reinforcement learning of generative chemical agents for drug discoveryCode3
SEED-Bench-2-Plus: Benchmarking Multimodal Large Language Models with Text-Rich Visual ComprehensionCode3
DeepFake-O-Meter v2.0: An Open Platform for DeepFake DetectionCode3
STaRK: Benchmarking LLM Retrieval on Textual and Relational Knowledge BasesCode3
Show:102550
← PrevPage 11 of 555Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified