TinyQA Benchmark++
Ultra-lightweight evaluation suite and python package designed to expose critical failures in Large Language Model (LLM) systems within seconds
Papers
Showing 1–1 of 1 papers
Benchmark Results
| # | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| 1 | gemma-3-12b | Exact Macth | 90.4 | — | Unverified |
| 2 | gemma-3-4b | Exact Match | 86.5 | — | Unverified |
| 3 | llama-3.2-3b-instruct | Exact Match | 84.6 | — | Unverified |
| 4 | mistral-24b-instruct | Exact Match | 84.6 | — | Unverified |
| 5 | ministral-8b | Exact Match | 80.8 | — | Unverified |
| 6 | ministral-3b | Exact Match | 76.9 | — | Unverified |
| 7 | llama-3.2-1b-instruct | Exact Match | 53.8 | — | Unverified |
| 8 | mistral-7b-instruct | Exact Match | 50 | — | Unverified |