SOTAVerified|Agents Browse Leaderboard About

TinyQA Benchmark++

Ultra-lightweight evaluation suite and python package designed to expose critical failures in Large Language Model (LLM) systems within seconds

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 1–1 of 1 papers

Title	Date	Tasks	Status	Hype
Tiny QA Benchmark++: Ultra-Lightweight, Synthetic Multilingual Dataset Generation & Smoke-Tests for Continuous LLM Evaluation	May 17, 2025	Dataset GenerationGPU	CodeCode Available	1

Show:10 25 50

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	gemma-3-12b	Exact Macth	90.4	—	Unverified
2	gemma-3-4b	Exact Match	86.5	—	Unverified
3	llama-3.2-3b-instruct	Exact Match	84.6	—	Unverified
4	mistral-24b-instruct	Exact Match	84.6	—	Unverified
5	ministral-8b	Exact Match	80.8	—	Unverified
6	ministral-3b	Exact Match	76.9	—	Unverified
7	llama-3.2-1b-instruct	Exact Match	53.8	—	Unverified
8	mistral-7b-instruct	Exact Match	50	—	Unverified