SOTAVerified

Benchmarking

Papers

Showing 26212630 of 5548 papers

TitleStatusHype
An Auditing Test To Detect Behavioral Shift in Language ModelsCode0
FairMT-Bench: Benchmarking Fairness for Multi-turn Dialogue in Conversational LLMs0
A Survey of Small Language Models0
OReole-FM: successes and challenges toward billion-parameter foundation models for high-resolution satellite imagery0
Benchmarking Graph Learning for Drug-Drug Interaction Prediction0
Towards Better Open-Ended Text Generation: A Multicriteria Evaluation FrameworkCode0
Conditional diffusions for amortized neural posterior estimationCode0
From Blind Solvers to Logical Thinkers: Benchmarking LLMs' Logical Integrity on Faulty Mathematical Problems0
Benchmarking Foundation Models on Exceptional Cases: Dataset Creation and ValidationCode0
Benchmarking Floworks against OpenAI & Anthropic: A Novel Framework for Enhanced LLM Function Calling0
Show:102550
← PrevPage 263 of 555Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified