Skip to content

OpenTrain Research Tools

Human Feedback and Eval Paper Explorer

A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research. Every paper includes structured metadata for quick triage.

Total papers: 1 Search mode: keyword RSS
Revisiting Self-Play Preference Optimization: On the Role of Prompt Difficulty

Yao Xiao, Jung-jae Kim, Roy Ka-wei Lee, Lidong Bing ยท Oct 7, 2025

Citations: 0
Pairwise Preference Automatic Metrics General
  • Self-play preference optimization has emerged as a prominent paradigm for aligning large language models (LLMs).
  • It typically involves a language model to generate on-policy responses for prompts and a reward model (RM) to guide the selection of chosen and rejected responses, which can be further trained with direct preference optimization (DPO).

Protocol Hubs