LogicSkills: A Structured Benchmark for Formal Reasoning in Large Language Models

2026-03-17Unverified0· sign in to hype

Brian Rabern, Philipp Mondorf, Barbara Plank

Unverified — Be the first to reproduce this paper.

Abstract

Large language models perform well on many logical reasoning benchmarks, but it remains unclear which core logical skills they truly master. To address this, we introduce LogicSkills, a benchmark that isolates three fundamental logical skills: (i) formal symbolizationx2014translating premises into first-order logic; (ii) countermodel constructionx2014showing that an argument is logically invalid by constructing a finite countermodel; and (iii) validity assessmentx2014determining whether a conclusion follows from a set of premises. Items are drawn from the two-variable fragment of first-order logic without identity and are presented in both English and a Carrollian nonce-word language. All instances are solver-verified with Z3 for correctness and non-triviality. Across conventional instruction-tuned LLMs, performance is high on validity assessment but substantially lower on formal symbolization and countermodel construction, highlighting that high task-level accuracy can mask weaknesses in core logical skills. In contrast, recent reasoning-tuned models perform strongly across all three tasks, suggesting a more systematic logical skill profile.

LogicSkills: A Structured Benchmark for Formal Reasoning in Large Language Models

Abstract

Reproductions