LINGOLY: A Benchmark of Olympiad-Level Linguistic Reasoning Puzzles in Low-Resource and Extinct Languages

2024-06-10Code Available0· sign in to hype

Andrew M. Bean, Simi Hellsten, Harry Mayne, Jabez Magomere, Ethan A. Chi, Ryan Chi, Scott A. Hale, Hannah Rose Kirk

Code Available — Be the first to reproduce this paper.

Code

github.com/am-bean/lingOly
OfficialIn papernone★ 9

Abstract

In this paper, we present the LingOly benchmark, a novel benchmark for advanced reasoning abilities in large language models. Using challenging Linguistic Olympiad puzzles, we evaluate (i) capabilities for in-context identification and generalisation of linguistic patterns in very low-resource or extinct languages, and (ii) abilities to follow complex task instructions. The LingOly benchmark covers more than 90 mostly low-resource languages, minimising issues of data contamination, and contains 1,133 problems across 6 formats and 5 levels of human difficulty. We assess performance with both direct accuracy and comparison to a no-context baseline to penalise memorisation. Scores from 11 state-of-the-art LLMs demonstrate the benchmark to be challenging, and models perform poorly on the higher difficulty problems. On harder problems, even the top model only achieved 38.7% accuracy, a 24.7% improvement over the no-context baseline. Large closed models typically outperform open models, and in general, the higher resource the language, the better the scores. These results indicate, in absence of memorisation, true multi-step out-of-domain reasoning remains a challenge for current language models.

Tasks

Logical Reasoning

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
LingOly	Claude Opus	Delta_NoContext	28.8	—	Unverified
LingOly	GPT-4o	Delta_NoContext	25.1	—	Unverified
LingOly	Gemini 1.5 Pro	Delta_NoContext	23.4	—	Unverified
LingOly	GPT-4	Delta_NoContext	21.5	—	Unverified
LingOly	Command R+	Delta_NoContext	11.6	—	Unverified
LingOly	GPT-3.5	Delta_NoContext	11.2	—	Unverified
LingOly	Mixtral 8x7B	Delta_NoContext	6.4	—	Unverified
LingOly	Llama 3 8B	Delta_NoContext	4.9	—	Unverified
LingOly	Llama 3 70B	Delta_NoContext	2.9	—	Unverified
LingOly	Gemma 7B	Delta_NoContext	2.2	—	Unverified
LingOly	Llama 2 70B	Delta_NoContext	1.1	—	Unverified

LINGOLY: A Benchmark of Olympiad-Level Linguistic Reasoning Puzzles in Low-Resource and Extinct Languages

Code

Abstract

Tasks

Benchmark Results

Reproductions