SOTAVerified

β-DPO: Direct Preference Optimization with Dynamic β

2024-07-11Code Available2· sign in to hype

Junkang Wu, Yuexiang Xie, Zhengyi Yang, Jiancan Wu, Jinyang Gao, Bolin Ding, Xiang Wang, Xiangnan He

Code Available — Be the first to reproduce this paper.

Reproduce

Code

Abstract

Direct Preference Optimization (DPO) has emerged as a compelling approach for training Large Language Models (LLMs) to adhere to human preferences. However, the performance of DPO is sensitive to the fine-tuning of its trade-off parameter , as well as to the quality of the preference data. We analyze the impact of and data quality on DPO, uncovering that optimal values vary with the informativeness of pairwise data. Addressing the limitations of static values, we introduce a novel framework that dynamically calibrates at the batch level, informed by data quality considerations. Additionally, our method incorporates -guided data filtering to safeguard against the influence of outliers. Through empirical evaluation, we demonstrate that our dynamic adjustment technique significantly improves DPO's performance across a range of models and datasets, offering a more robust and adaptable training paradigm for aligning LLMs with human feedback. The code is available at https://github.com/junkangwu/beta-DPO.

Tasks

Reproductions