SOTAVerified

Benchmarking

Papers

Showing 44514460 of 5548 papers

TitleStatusHype
Large-scale Ridesharing DARP Instances Based on Real Travel DemandCode0
Calibrating Pre-trained Language Classifiers on LLM-generated Noisy Labels via Iterative RefinementCode0
JExplore: Design Space Exploration Tool for Nvidia Jetson BoardsCode0
Anchor Points: Benchmarking Models with Much Fewer ExamplesCode0
Laughing Heads: Can Transformers Detect What Makes a Sentence Funny?Code0
THaMES: An End-to-End Tool for Hallucination Mitigation and Evaluation in Large Language ModelsCode0
JATE 2.0: Java Automatic Term Extraction with Apache SolrCode0
JALMBench: Benchmarking Jailbreak Vulnerabilities in Audio Language ModelsCode0
Calibrated Adaptive Probabilistic ODE SolversCode0
Is Your Model Fairly Certain? Uncertainty-Aware Fairness Evaluation for LLMsCode0
Show:102550
← PrevPage 446 of 555Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified