From π to π: Taming Divergence in Soft Clipping via Bilateral Decoupled Decay of Probability Gradient Weight

2026-03-15Code Available0· sign in to hype

Xiaoliang Fu, Jiaye Lin, Yangyi Fang, Chaowen Hu, Cong Qin, Zekai Shao, Binbin Zheng, Lu Pan, Ke Zeng

Code Available — Be the first to reproduce this paper.

Code

github.com/venomrose-juri/dgpo-rl
OfficialIn paper★ 45

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has catalyzed a leap in Large Language Model (LLM) reasoning, yet its optimization dynamics remain fragile. Standard algorithms like GRPO enforce stability via ``hard clipping'', which inadvertently stifles exploration by discarding gradients of tokens outside the trust region. While recent ``soft clipping'' methods attempt to recover these gradients, they suffer from a critical challenge: relying on log-probability gradient (_θ π_θ) yields divergent weights as probabilities vanish, destabilizing LLM training. We rethink this convention by establishing probability gradient (_θπ_θ) as the superior optimization primitive. Accordingly, we propose Decoupled Gradient Policy Optimization (DGPO), which employs a decoupled decay mechanism based on importance sampling ratios. By applying asymmetric, continuous decay to boundary tokens, DGPO resolves the conflict between stability and sustained exploration. Extensive experiments across DeepSeek-R1-Distill-Qwen series models (1.5B/7B/14B) demonstrate that DGPO consistently outperforms strong baselines on various mathematical benchmarks, offering a robust and scalable solution for RLVR. Our code and implementation are available at: https://github.com/VenomRose-Juri/DGPO-RL.

From π to π: Taming Divergence in Soft Clipping via Bilateral Decoupled Decay of Probability Gradient Weight

Code

Abstract

Reproductions