SOTAVerified

Right Question is Already Half the Answer: Fully Unsupervised LLM Reasoning Incentivization

2025-04-08Code Available2· sign in to hype

Qingyang Zhang, Haitao Wu, Changqing Zhang, Peilin Zhao, Yatao Bian

Code Available — Be the first to reproduce this paper.

Reproduce

Code

Abstract

Existing methods to enhance the reasoning capability of large language models predominantly rely on supervised fine-tuning (SFT) followed by reinforcement learning (RL) on reasoning-specific data. These approaches critically depend on external supervisions--such as labeled reasoning traces, verified golden answers, or pre-trained reward models. In this work, we propose Entropy Minimized Policy Optimization ( ), which makes an early attempt at fully unsupervised LLM reasoning incentivization. By continuously minimizing the predictive entropy of LLMs on unlabeled questions in a latent semantic space, achieves competitive performance compared to supervised counterparts on both mathematical and free-form natural reasoning tasks. Specifically, without any supervised signals, boosts the accuracy of Qwen2.5-Math-7B Base from 30.7\% to 48.1\% on mathematical benchmarks and improves the accuracy of Qwen2.5-7B Base from 32.1\% to 50.1\% on MMLU-Pro. Primary experiments and analysis are also provided to interpret the effectiveness of . Code is available at https://github.com/QingyangZhang/EMPO.

Tasks

Reproductions