Duel-Evolve: Reward-Free Test-Time Scaling via LLM Self-Preferences
Sweta Karlekar, Carolina Zheng, Magnus Saebo, Nicolas Beltran-Velez, Shuyang Yu, John Bowlan · Feb 25, 2026
Citations: 0
Pairwise Preference Automatic Metrics Math
- Building on this observation, we introduce Duel-Evolve, an evolutionary optimization algorithm that replaces external scalar rewards with pairwise preferences elicited from the same LLM used to generate candidates.
- Results show that pairwise self-preferences provide strong optimization signal for test-time improvement over large, discrete output spaces.