Tiny QA Benchmark++: Ultra-Lightweight, Synthetic Multilingual Dataset Generation & Smoke-Tests for Continuous LLM Evaluation

2025-05-17Code Available1· sign in to hype

Vincent Koc

Code Available — Be the first to reproduce this paper.

Code

github.com/vincentkoc/tiny_qa_benchmark_pp
OfficialIn papernone★ 13

Abstract

Tiny QA Benchmark++ (TQB++) presents an ultra-lightweight, multilingual smoke-test suite designed to give large-language-model (LLM) pipelines a unit-test style safety net dataset that runs in seconds with minimal cost. Born out of the tight feedback-loop demands building the Comet Opik prompt-optimization SDK, where waiting on heavyweight benchmarks breaks developer flow. TQB++ couples a 52-item English gold set (less than 20 kB) with a tiny synthetic-data generator pypi package built on provider-agnostic LiteLLM. The generator lets practitioners mint their own tiny packs in any language, domain, or difficulty, while ten ready-made packs already cover Arabic, Chinese, French, German, Japanese, Korean, Portuguese, Russian, Spanish, and Turkish. Every dataset ships with Croissant metadata and plug-and-play files for OpenAI-Evals, LangChain, and standard CI tools, so teams can drop deterministic micro-benchmarks directly into pull-request gates, prompt-engineering loops, and production dashboards without touching GPU budgets. A complete TQB++ run adds only a few seconds to pipeline latency yet reliably flags prompt-template errors, tokenizer drift, and fine-tuning side-effects long before full-scale suites like MMLU or BIG-Bench would finish configuring. The entire framework is released to accelerate continuous, resource-efficient quality assurance across the generative-AI ecosystem.

Tasks

Dataset Generation GPU Large Language Model MMLU Prompt Engineering TinyQA Benchmark++

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
tinyqabenchmark_core-en	gemma-3-4b	Exact Match	86.5	—	Unverified
tinyqabenchmark_core-en	mistral-24b-instruct	Exact Match	84.6	—	Unverified
tinyqabenchmark_core-en	llama-3.2-3b-instruct	Exact Match	84.6	—	Unverified
tinyqabenchmark_core-en	ministral-8b	Exact Match	80.8	—	Unverified
tinyqabenchmark_core-en	ministral-3b	Exact Match	76.9	—	Unverified
tinyqabenchmark_core-en	llama-3.2-1b-instruct	Exact Match	53.8	—	Unverified
tinyqabenchmark_core-en	mistral-7b-instruct	Exact Match	50	—	Unverified
tinyqabenchmark_core-en	gemma-3-12b	Exact Macth	90.4	—	Unverified

Tiny QA Benchmark++: Ultra-Lightweight, Synthetic Multilingual Dataset Generation & Smoke-Tests for Continuous LLM Evaluation

Code

Abstract

Tasks

Benchmark Results

Reproductions