SOTAVerified

LLM-BABYBENCH: Understanding and Evaluating Grounded Planning and Reasoning in LLMs

2025-05-17Code Available0· sign in to hype

Omar Choukrani, Idriss Malek, Daniil Orel, Zhuohan Xie, Zangir Iklassov, Martin Takáč, Salem Lahlou

Code Available — Be the first to reproduce this paper.

Reproduce

Code

Abstract

Assessing the capacity of Large Language Models (LLMs) to plan and reason within the constraints of interactive environments is crucial for developing capable AI agents. We introduce LLM-BabyBench, a new benchmark suite designed specifically for this purpose. Built upon a textual adaptation of the procedurally generated BabyAI grid world, this suite evaluates LLMs on three fundamental aspects of grounded intelligence: (1) predicting the consequences of actions on the environment state (Predict task), (2) generating sequences of low-level actions to achieve specified objectives (Plan task), and (3) decomposing high-level instructions into coherent subgoal sequences (Decompose task). We detail the methodology for generating the three corresponding datasets (LLM-BabyBench-Predict, -Plan, -Decompose) by extracting structured information from an expert agent operating within the text-based environment. Furthermore, we provide a standardized evaluation harness and metrics, including environment interaction for validating generated plans, to facilitate reproducible assessment of diverse LLMs. Initial baseline results highlight the challenges posed by these grounded reasoning tasks. The benchmark suite, datasets, data generation code, and evaluation code are made publicly available (https://github.com/choukrani/llm-babybenchGitHub, https://huggingface.co/datasets/salem-mbzuai/LLM-BabyBenchHuggingFace).

Tasks

Reproductions