PECC: Problem Extraction and Coding Challenges

2024-04-29Code Available1· sign in to hype

Patrick Haller, Jonas Golde, Alan Akbik

Code Available — Be the first to reproduce this paper.

Code

github.com/hallerpatrick/pecc
OfficialIn papernone★ 14

Abstract

Recent advancements in large language models (LLMs) have showcased their exceptional abilities across various tasks, such as code generation, problem-solving and reasoning. Existing benchmarks evaluate tasks in isolation, yet the extent to which LLMs can understand prose-style tasks, identify the underlying problems, and then generate appropriate code solutions is still unexplored. Addressing this gap, we introduce PECC, a novel benchmark derived from Advent Of Code (AoC) challenges and Project Euler, including 2396 problems. Unlike conventional benchmarks, PECC requires LLMs to interpret narrative-embedded problems, extract requirements, and generate executable code. A key feature of our dataset is the complexity added by natural language prompting in chat-based evaluations, mirroring real-world instruction ambiguities. Results show varying model performance between narrative and neutral problems, with specific challenges in the Euler math-based subset with GPT-3.5-Turbo passing 50% of the AoC challenges and only 8% on the Euler problems. By probing the limits of LLMs' capabilities, our benchmark provides a framework to monitor and assess the subsequent progress of LLMs as a universal problem solver.

Tasks

Code Generation Math Text Generation

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
PECC	Claude 3 Haiku	Pass@3	27.67	—	Unverified
PECC	GPT-3.5 Turbo	Pass@3	23.75	—	Unverified
PECC	codechat-bison	Pass@3	11.39	—	Unverified
PECC	chat-bison	Pass@3	8.48	—	Unverified
PECC	Mixtral-8x7B-Instruct	Pass@3	8.35	—	Unverified
PECC	Phi-3-mini-128k-instruct	Pass@3	7.18	—	Unverified
PECC	WizardLM-2-7B	Pass@3	3.72	—	Unverified
PECC	Llama-3-8B-Instruct	Pass@3	3.1	—	Unverified

PECC: Problem Extraction and Coding Challenges

Code

Abstract

Tasks

Benchmark Results

Reproductions