StreamBench: Towards Benchmarking Continuous Improvement of Language Agents

2024-06-13Code Available2· sign in to hype

Cheng-Kuang Wu, Zhi Rui Tam, Chieh-Yen Lin, Yun-Nung Chen, Hung-Yi Lee

Code Available — Be the first to reproduce this paper.

Code

github.com/stream-bench/stream-bench
OfficialIn papernone★ 76

Abstract

Recent works have shown that large language model (LLM) agents are able to improve themselves from experience, which is an important ability for continuous enhancement post-deployment. However, existing benchmarks primarily evaluate their innate capabilities and do not assess their ability to improve over time. To address this gap, we introduce StreamBench, a pioneering benchmark designed to evaluate the continuous improvement of LLM agents over an input-feedback sequence. StreamBench simulates an online learning environment where LLMs receive a continuous flow of feedback stream and iteratively enhance their performance. In addition, we propose several simple yet effective baselines for improving LLMs on StreamBench, and provide a comprehensive analysis to identify critical components that contribute to successful streaming strategies. Our work serves as a stepping stone towards developing effective online learning strategies for LLMs, paving the way for more adaptive AI systems in streaming scenarios. Source code: https://github.com/stream-bench/stream-bench. Benchmark website: https://stream-bench.github.io.

Tasks

Benchmarking Language Modeling Language Modelling Large Language Model

StreamBench: Towards Benchmarking Continuous Improvement of Language Agents

Code

Abstract

Tasks

Reproductions