SOTAVerified

Benchmarking

Papers

Showing 16711680 of 5548 papers

TitleStatusHype
Is Your Model Fairly Certain? Uncertainty-Aware Fairness Evaluation for LLMsCode0
JALMBench: Benchmarking Jailbreak Vulnerabilities in Audio Language ModelsCode0
A Benchmark on Extremely Weakly Supervised Text Classification: Reconcile Seed Matching and Prompting ApproachesCode0
Certifiable Black-Box Attacks with Randomized Adversarial Examples: Breaking Defenses with Provable ConfidenceCode0
CEBench: A Benchmarking Toolkit for the Cost-Effectiveness of LLM PipelinesCode0
Benchmarking and Improving Compositional Generalization of Multi-aspect Controllable Text GenerationCode0
ISImed: A Framework for Self-Supervised Learning using Intrinsic Spatial Information in Medical ImagesCode0
DyKnow: Dynamically Verifying Time-Sensitive Factual Knowledge in LLMsCode0
Joint Multi-Scale Tone Mapping and Denoising for HDR Image EnhancementCode0
Keep Security! Benchmarking Security Policy Preservation in Large Language Model Contexts Against Indirect Attacks in Question AnsweringCode0
Show:102550
← PrevPage 168 of 555Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified