SOTAVerified

Benchmarking

Papers

Showing 851860 of 5548 papers

TitleStatusHype
Benchmarking Generation and Evaluation Capabilities of Large Language Models for Instruction Controllable SummarizationCode1
MAgIC: Investigation of Large Language Model Powered Multi-Agent in Cognition, Adaptability, Rationality and CollaborationCode1
Combinatorial Optimization with Policy Adaptation using Latent Space SearchCode1
Benchmarking PtO and PnO Methods in the Predictive Combinatorial Optimization RegimeCode1
WaterBench: Towards Holistic Evaluation of Watermarks for Large Language ModelsCode1
Flames: Benchmarking Value Alignment of LLMs in ChineseCode1
CloudEval-YAML: A Practical Benchmark for Cloud Configuration GenerationCode1
MultiIoT: Benchmarking Machine Learning for the Internet of ThingsCode1
TencentLLMEval: A Hierarchical Evaluation of Real-World Capabilities for Human-Aligned LLMsCode1
The voraus-AD Dataset for Anomaly Detection in Robot ApplicationsCode1
Show:102550
← PrevPage 86 of 555Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified