SOTAVerified

TinyQA Benchmark++

Ultra-lightweight evaluation suite and python package designed to expose critical failures in Large Language Model (LLM) systems within seconds

Papers

Showing 11 of 1 papers

Show:102550

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1gemma-3-12bExact Macth90.4Unverified
2gemma-3-4bExact Match86.5Unverified
3llama-3.2-3b-instructExact Match84.6Unverified
4mistral-24b-instructExact Match84.6Unverified
5ministral-8bExact Match80.8Unverified
6ministral-3bExact Match76.9Unverified
7llama-3.2-1b-instructExact Match53.8Unverified
8mistral-7b-instructExact Match50Unverified