Cumulative Reasoning with Large Language Models

2023-08-08Code Available2· sign in to hype

Yifan Zhang, Jingqin Yang, Yang Yuan, Andrew Chi-Chih Yao

Code Available — Be the first to reproduce this paper.

Code

github.com/iiis-ai/cumulative-reasoning
OfficialIn paperpytorch★ 308

Abstract

Despite the recent advancements in language models (LMs), their ability to solve complex problems remains limited. This paper introduces Cumulative Reasoning (CR), a novel approach that utilizes LMs cumulatively and iteratively, mirroring human thought processes for problem-solving. CR decomposes tasks into smaller, manageable components and leverages previous propositions for effective composition, significantly enhancing problem-solving capabilities. We demonstrate CR's superiority through several complex reasoning tasks: it outperforms existing methods in logical inference tasks with up to a 9.3% improvement, achieving 98.04% accuracy on the curated FOLIO wiki dataset. In the Game of 24, it achieves 98% accuracy, marking a 24% improvement over the prior state-of-the-art. Additionally, CR sets new state-of-the-art on the MATH dataset, achieving a 4.2% increase from previous methods and a 43% relative improvement in the most challenging problems. By extending CR to incorporate a code environment without external aids like retrieval or web browsing, we further harness the computational and logical reasoning capabilities of LMs, achieving a remarkable 72.2% accuracy on the MATH dataset and outperforming the PAL/PoT method by 38.8%. Our work not only sets new state-of-the-art but also paves the way toward more sophisticated AI reasoning methods. The code is available at https://github.com/iiis-ai/cumulative-reasoning.

Tasks

Decision Making Logical Reasoning Math Math Word Problem Solving

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
MATH	CR (GPT-4-turbo model, w/ code)	Accuracy	72.2	—	Unverified
MATH	CR (GPT-4 model, w/o code)	Accuracy	58	—	Unverified

Cumulative Reasoning with Large Language Models

Code

Abstract

Tasks

Benchmark Results

Reproductions