SOTAVerified

Benchmarking

Papers

Showing 16311640 of 5548 papers

TitleStatusHype
DIMCIM: A Quantitative Evaluation Framework for Default-mode Diversity and Generalization in Text-to-Image Generative Models0
Knowledge-guided Contextual Gene Set Analysis Using Large Language Models0
Seeing in the Dark: Benchmarking Egocentric 3D Vision with the Oxford Day-and-Night Dataset0
MedAgentGym: Training LLM Agents for Code-Based Medical Reasoning at Scale0
Curse of Slicing: Why Sliced Mutual Information is a Deceptive Measure of Statistical Dependence0
MELABenchv1: Benchmarking Large Language Models against Smaller Fine-Tuned Models for Low-Resource Maltese NLP0
HSSBench: Benchmarking Humanities and Social Sciences Ability for Multimodal Large Language ModelsCode0
A Kernel-Based Approach for Accurate Steady-State Detection in Performance Time SeriesCode0
Generating Automotive Code: Large Language Models for Software Development and Verification in Safety-Critical Systems0
CETBench: A Novel Dataset constructed via Transformations over Programs for Benchmarking LLMs for Code-Equivalence Checking0
Show:102550
← PrevPage 164 of 555Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified