Can Deep Research Agents Retrieve and Organize? Evaluating the Synthesis Gap with Expert Taxonomies
Ming Zhang, Jiabao Zhuang, Wenqing Jing, Kexin Tan, Ziyu Kong, Jingyi Deng, Yujiong Shen, Yuhang Zhao, Ning Luo, Renzhe Zheng, Jiahui Lin, Mingqi Wu, Long Ma, Shihan Dou, Tao Gui, Qi Zhang, Xuanjing Huang
Code Available — Be the first to reproduce this paper.
ReproduceCode
- github.com/konglonggefdu/taxobenchOfficialIn paper★ 7
Abstract
Deep Research Agents increasingly automate survey generation, yet whether they match human experts in two core abilities remains unclear: retrieving essential papers and organizing them into expert-like taxonomies. Existing benchmarks emphasize writing quality or citation correctness, while standard clustering metrics fail to capture hierarchical taxonomy structure. We introduce TaxoBench, a benchmark built from 72 highly-cited LLM surveys containing expert-authored taxonomy trees with 3,815 papers mapped to paper categories as ground truth. TaxoBench evaluates both abilities: (1) retrieval, measuring whether agents retrieve expert-cited papers; and (2) organization, assessed at two levels: the leaf-level measures paper-to-category assignment, while the hierarchy-level measures taxonomy structure via novel metrics -- Unordered Semantic Tree Edit Distance (US-TED/US-NTED) and Semantic Path Similarity (Sem-Path). TaxoBench supports two evaluation modes: Deep Research tests end-to-end capability given only a topic, while Bottom-Up provides the expert paper set to isolate organization ability. Evaluating 7 Deep Research Agents and 12 frontier LLMs reveals a dual bottleneck: the best agent retrieves only 20.92% of expert-cited papers, and even with perfect input, the best model achieves only 31.24% ARI with substantial structural gaps. Our benchmark is publicly available at https://github.com/KongLongGeFDU/TaxoBench