A Proof of Learning Rate Transfer under $μ$P

Soufiane Hayou · Nov 3, 2025 · Citations: 0

Abstract

We provide the first proof of learning rate transfer with width in a linear multi-layer perceptron (MLP) parametrized with $μ$P, a neural network parameterization designed to ``maximize'' feature learning in the infinite-width limit. We show that under $μP$, the optimal learning rate converges to a \emph{non-zero constant} as width goes to infinity, providing a theoretical explanation to learning rate transfer. In contrast, we show that this property fails to hold under alternative parametrizations such as Standard Parametrization (SP) and Neural Tangent Parametrization (NTP). We provide intuitive proofs and support the theoretical findings with extensive empirical results.

Human Data Lens

Uses human feedback: No
Feedback types: None
Rater population: Unknown
Unit of annotation: Unknown
Expertise required: Math

Evaluation Lens

Evaluation modes: Automatic Metrics
Agentic eval: None
Quality controls: Not reported
Confidence: 0.30
Flags: low_signal, possible_false_positive

Research Summary

Contribution Summary

We provide the first proof of learning rate transfer with width in a linear multi-layer perceptron (MLP) parametrized with $μ$P, a neural network parameterization designed to ``maximize'' feature learning in the infinite-width limit.
We show that under $μP$, the optimal learning rate converges to a \emph{non-zero constant} as width goes to infinity, providing a theoretical explanation to learning rate transfer.
In contrast, we show that this property fails to hold under alternative parametrizations such as Standard Parametrization (SP) and Neural Tangent Parametrization (NTP).

A Proof of Learning Rate Transfer under $μ$P

Abstract

Human Data Lens

Evaluation Lens

Research Summary

Contribution Summary

Related Papers