Probability-Consistent Preference Optimization for Enhanced LLM Reasoning

2025-05-29Code Available0· sign in to hype

Yunqiao Yang, Houxing Ren, Zimu Lu, Ke Wang, Weikang Shi, Aojun Zhou, Junting Pan, Mingjie Zhan, Hongsheng Li

Code Available — Be the first to reproduce this paper.

Code

github.com/yunqiaoyang/pcpo
OfficialIn paperpytorch★ 3

Abstract

Recent advances in preference optimization have demonstrated significant potential for improving mathematical reasoning capabilities in large language models (LLMs). While current approaches leverage high-quality pairwise preference data through outcome-based criteria like answer correctness or consistency, they fundamentally neglect the internal logical coherence of responses. To overcome this, we propose Probability-Consistent Preference Optimization (PCPO), a novel framework that establishes dual quantitative metrics for preference selection: (1) surface-level answer correctness and (2) intrinsic token-level probability consistency across responses. Extensive experiments show that our PCPO consistently outperforms existing outcome-only criterion approaches across a diverse range of LLMs and benchmarks. Our code is publicly available at https://github.com/YunqiaoYang/PCPO.

Tasks

Mathematical Reasoning

Probability-Consistent Preference Optimization for Enhanced LLM Reasoning

Code

Abstract

Tasks

Reproductions