WebWalker: Benchmarking LLMs in Web Traversal

2025-01-13Code Available11· sign in to hype

Jialong Wu, Wenbiao Yin, Yong Jiang, Zhenglin Wang, Zekun Xi, Runnan Fang, Linhai Zhang, Yulan He, Deyu Zhou, Pengjun Xie, Fei Huang

arXiv PDF

Code Available — Be the first to reproduce this paper.

Reproduce

Code

github.com/alibaba-nlp/webwalker
OfficialIn papernone★ 18,542
github.com/alibaba-nlp/webagent
none★ 18,522

Abstract

Retrieval-augmented generation (RAG) demonstrates remarkable performance across tasks in open-domain question-answering. However, traditional search engines may retrieve shallow content, limiting the ability of LLMs to handle complex, multi-layered information. To address it, we introduce WebWalkerQA, a benchmark designed to assess the ability of LLMs to perform web traversal. It evaluates the capacity of LLMs to traverse a website's subpages to extract high-quality data systematically. We propose WebWalker, which is a multi-agent framework that mimics human-like web navigation through an explore-critic paradigm. Extensive experimental results show that WebWalkerQA is challenging and demonstrates the effectiveness of RAG combined with WebWalker, through the horizontal and vertical integration in real-world scenarios.

Tasks

Benchmarking Open-Domain Question Answering Question Answering RAG Retrieval Retrieval-augmented Generation

WebWalker: Benchmarking LLMs in Web Traversal

Code

Abstract

Tasks

Reproductions