Revisiting Self-Play Preference Optimization: On the Role of Prompt Difficulty
Yao Xiao, Jung-jae Kim, Roy Ka-wei Lee, Lidong Bing ยท Oct 7, 2025
Citations: 0
Pairwise Preference Automatic Metrics General
- Self-play preference optimization has emerged as a prominent paradigm for aligning large language models (LLMs).
- It typically involves a language model to generate on-policy responses for prompts and a reward model (RM) to guide the selection of chosen and rejected responses, which can be further trained with direct preference optimization (DPO).