SOTAVerified

Test-Time Scaling in Reasoning Models Is Not Effective for Knowledge-Intensive Tasks Yet

2026-01-31Code Available0· sign in to hype

James Xu Zhao, Bryan Hooi, See-Kiong Ng

Code Available — Be the first to reproduce this paper.

Reproduce

Code

Abstract

Test-time scaling increases inference-time computation by allowing models to generate long reasoning chains, and has improved performance across many domains. However, in this work, we show that this approach is not yet effective for knowledge-intensive tasks. We evaluate 14 reasoning models on two knowledge-intensive benchmarks and find that increasing test-time computation does not consistently improve accuracy and often increases hallucinations. Further analysis shows that changes in hallucination rates under increased test-time computation are largely driven by models' willingness to answer. We also observe that extended reasoning can induce confirmation bias, leading to overconfident hallucinations. Finally, we provide an information-theoretic account: compute-only test-time scaling is a post-processing of a fixed trained model and therefore cannot increase information about the ground-truth answer beyond what is already encoded in the model, explaining its limited gains on knowledge-intensive tasks. Code and data are available at https://github.com/XuZhao0/tts-knowledge

Reproductions