SOTAVerified

Benchmarking

Papers

Showing 24812490 of 5548 papers

TitleStatusHype
Is LLM-as-a-Judge Robust? Investigating Universal Adversarial Attacks on Zero-shot LLM AssessmentCode1
CODIS: Benchmarking Context-Dependent Visual Comprehension for Multimodal Large Language Models0
A Unified Framework and Dataset for Assessing Societal Bias in Vision-Language Models0
KetGPT -- Dataset Augmentation of Quantum Circuits using Transformers0
CHILI: Chemically-Informed Large-scale Inorganic Nanomaterials Dataset for Advancing Graph Machine LearningCode1
Benchmarking Retrieval-Augmented Generation for MedicineCode4
CausalGym: Benchmarking causal interpretability methods on linguistic tasksCode2
Synthetic location trajectory generation using categorical diffusion modelsCode0
Event-Based Motion MagnificationCode2
FeB4RAG: Evaluating Federated Search in the Context of Retrieval Augmented Generation0
Show:102550
← PrevPage 249 of 555Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified