Orthogonalized Policy Optimization:Policy Optimization as Orthogonal Projection in Hilbert Space
Wang Zixian · Jan 18, 2026 · Citations: 0
Abstract
We propose Orthogonalized Policy Optimization (OPO), a principled framework for large language model alignment derived from optimization in the Hilbert function space L2(pi_k). Lifting policy updates from the probability simplex into L2(pi_k) transforms the nonlinear normalization constraint into a linear orthogonality condition <v, 1>_{pi_k} = 0 on the density fluctuation field v = pi/pi_k - 1. By the Hilbert projection theorem, the unique closed-form update is v_star = (omega_alpha - E[omega_alpha]) / mu, where the subtracted mean acts as a chemical potential enforcing probability conservation. This interpretation reveals advantage z-score normalization as a conservation-law projection rather than a variance-reduction heuristic. OPO cleanly decouples sampling geometry, controlled by the escort exponent alpha, from optimization geometry, governed by the stiffness parameter mu, a separation not attainable under KL-based objectives. The same update can also be derived as a Euclidean mirror-descent step and as the linear-response law of near-equilibrium statistical mechanics, establishing its structural uniqueness within ratio geometry. Structurally, OPO induces constant curvature, non-saturating linear gradient dynamics, and an intrinsic chi-square trust region. Experiments on MATH benchmarks show that the Hilbert projection formulation prevents gradient saturation typical of KL-constrained methods. By sustaining non-vanishing gradients in high-confidence regimes, OPO avoids premature plateaus and achieves stronger long-horizon training rewards and improved out-of-distribution generalization compared to clipping-based baselines.