Test-Time Scaling in Reasoning Models Is Not Effective for Knowledge-Intensive Tasks Yet
James Xu Zhao, Bryan Hooi, See-Kiong Ng
Code Available — Be the first to reproduce this paper.
ReproduceCode
- github.com/xuzhao0/tts-knowledgeOfficialIn paper★ 9
Abstract
Test-time scaling increases inference-time computation by allowing models to generate long reasoning chains, and has improved performance across many domains. However, in this work, we show that this approach is not yet effective for knowledge-intensive tasks. We evaluate 14 reasoning models on two knowledge-intensive benchmarks and find that increasing test-time computation does not consistently improve accuracy and often increases hallucinations. Further analysis shows that changes in hallucination rates under increased test-time computation are largely driven by models' willingness to answer. We also observe that extended reasoning can induce confirmation bias, leading to overconfident hallucinations. Finally, we provide an information-theoretic account: compute-only test-time scaling is a post-processing of a fixed trained model and therefore cannot increase information about the ground-truth answer beyond what is already encoded in the model, explaining its limited gains on knowledge-intensive tasks. Code and data are available at https://github.com/XuZhao0/tts-knowledge