SOTAVerified

Benchmarking

Papers

Showing 33213330 of 5548 papers

TitleStatusHype
Towards Explainability and Fairness in Swiss Judgement Prediction: Benchmarking on a Multilingual Dataset0
Benchmarking LLMs on the Semantic Overlap Summarization Task0
Partial Rankings of OptimizersCode0
HypoTermQA: Hypothetical Terms Dataset for Benchmarking Hallucination Tendency of LLMsCode0
E(3)-equivariant models cannot learn chirality: Field-based molecular generation0
Decoding Intelligence: A Framework for Certifying Knowledge Comprehension in LLMs0
Benchmarking Observational Studies with Experimental Data under Right-Censoring0
Benchmarking the Robustness of Panoptic Segmentation for Automated Driving0
GenCeption: Evaluate Multimodal LLMs with Unlabeled Unimodal DataCode0
PQA: Zero-shot Protein Question Answering for Free-form Scientific Enquiry with Large Language ModelsCode0
Show:102550
← PrevPage 333 of 555Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified