A Case Study of Web App Coding with OpenAI Reasoning Models

2024-09-19Code Available1· sign in to hype

Yi Cui

Code Available — Be the first to reproduce this paper.

Code

github.com/onekq/webapp1k
Officialnone★ 16

Abstract

This paper presents a case study of coding tasks by the latest reasoning models of OpenAI, i.e. o1-preview and o1-mini, in comparison with other frontier models. The o1 models deliver SOTA results for WebApp1K, a single-task benchmark. To this end, we introduce WebApp1K-Duo, a harder benchmark doubling number of tasks and test cases. The new benchmark causes the o1 model performances to decline significantly, falling behind Claude 3.5. Moreover, they consistently fail when confronted with atypical yet correct test cases, a trap non-reasoning models occasionally avoid. We hypothesize that the performance variability is due to instruction comprehension. Specifically, the reasoning mechanism boosts performance when all expectations are captured, meanwhile exacerbates errors when key expectations are missed, potentially impacted by input lengths. As such, we argue that the coding success of reasoning models hinges on the top-notch base model and SFT to ensure meticulous adherence to instructions.

Tasks

Code Generation

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
WebApp1k-Duo-React	claude-3-5-sonnet	pass@1	0.68	—	Unverified
WebApp1k-Duo-React	o1-mini	pass@1	0.67	—	Unverified
WebApp1k-Duo-React	o1-preview	pass@1	0.65	—	Unverified
WebApp1k-Duo-React	gpt-4o-2024-08-06	pass@1	0.53	—	Unverified
WebApp1k-Duo-React	deepseek-v2.5	pass@1	0.49	—	Unverified
WebApp1k-Duo-React	mistral-large-2	pass@1	0.45	—	Unverified
WebApp1K-React	o1-preview	pass@1	0.95	—	Unverified
WebApp1K-React	o1-mini	pass@1	0.94	—	Unverified
WebApp1K-React	deepseek-v2.5	pass@1	0.83	—	Unverified

A Case Study of Web App Coding with OpenAI Reasoning Models

Code

Abstract

Tasks

Benchmark Results

Reproductions