Skip to content
← Back to explorer

Orthogonalized Policy Optimization:Policy Optimization as Orthogonal Projection in Hilbert Space

Wang Zixian · Jan 18, 2026 · Citations: 0

Abstract

We propose Orthogonalized Policy Optimization (OPO), a principled framework for large language model alignment derived from optimization in the Hilbert function space L2(pi_k). Lifting policy updates from the probability simplex into L2(pi_k) transforms the nonlinear normalization constraint into a linear orthogonality condition <v, 1>_{pi_k} = 0 on the density fluctuation field v = pi/pi_k - 1. By the Hilbert projection theorem, the unique closed-form update is v_star = (omega_alpha - E[omega_alpha]) / mu, where the subtracted mean acts as a chemical potential enforcing probability conservation. This interpretation reveals advantage z-score normalization as a conservation-law projection rather than a variance-reduction heuristic. OPO cleanly decouples sampling geometry, controlled by the escort exponent alpha, from optimization geometry, governed by the stiffness parameter mu, a separation not attainable under KL-based objectives. The same update can also be derived as a Euclidean mirror-descent step and as the linear-response law of near-equilibrium statistical mechanics, establishing its structural uniqueness within ratio geometry. Structurally, OPO induces constant curvature, non-saturating linear gradient dynamics, and an intrinsic chi-square trust region. Experiments on MATH benchmarks show that the Hilbert projection formulation prevents gradient saturation typical of KL-constrained methods. By sustaining non-vanishing gradients in high-confidence regimes, OPO avoids premature plateaus and achieves stronger long-horizon training rewards and improved out-of-distribution generalization compared to clipping-based baselines.

Human Data Lens

  • Uses human feedback: No
  • Feedback types: None
  • Rater population: Unknown
  • Unit of annotation: Unknown
  • Expertise required: Math, Law

Evaluation Lens

  • Evaluation modes: Automatic Metrics
  • Agentic eval: Long Horizon
  • Quality controls: Not reported
  • Confidence: 0.50
  • Flags: ambiguous

Research Summary

Contribution Summary

  • We propose Orthogonalized Policy Optimization (OPO), a principled framework for large language model alignment derived from optimization in the Hilbert function space L2(pi_k).
  • Lifting policy updates from the probability simplex into L2(pi_k) transforms the nonlinear normalization constraint into a linear orthogonality condition <v, 1>_{pi_k} = 0 on the density fluctuation field v = pi/pi_k - 1.
  • By the Hilbert projection theorem, the unique closed-form update is v_star = (omega_alpha - E[omega_alpha]) / mu, where the subtracted mean acts as a chemical potential enforcing probability conservation.

Why It Matters For Eval

  • Experiments on MATH benchmarks show that the Hilbert projection formulation prevents gradient saturation typical of KL-constrained methods.

Related Papers