Decoupled Weight Decay Regularization
Ilya Loshchilov, Frank Hutter
No strong AI-core implementation/artifact signals were detected from current providers.
L$_2$ regularization and weight decay regularization are equivalent for standard stochastic gradient descent (when rescaled by the learning rate), but as we demonstrate this is \emph{not} the case for adaptive gradient algorithms, such as Adam. While common implementations of these algorithms employ L$_2$ regularization (often calling it "weight decay" in what may be misleading due to the inequivalence we expose), we ...
propose a simple modification to recover the original formulation of weight decay regularization by \emph{decoupling} the weight decay from the optimization steps taken w.r.t. the loss function. We provide empirical evidence that our proposed modification (i) decouples the optimal choice of weight decay factor from the setting of the learning rate for both standard SGD and Adam and (ii) substantially improves Adam's generalization performance, allowing it to compete with SGD with momentum on image classification datasets (on which it was previously typically outperformed by the latter). Our proposed decoupled weight decay has already been adopted by many researchers, and the community has implemented it in TensorFlow and PyTorch; the complete source code for our experiments is available at https://github.com/loshchil/AdamW-and-SGDW
Results & Benchmarks
No concrete benchmark grounding is available yet. Treat the page as context or an implementation starting point only.
L$_2$ regularization and weight decay regularization are equivalent for standard stochastic gradient descent (when rescaled by the learning rate), but as we demonstrate this is \emph{not} the case for adaptive gradient algorithms, such as Adam.
Implementation Evidence Summary
This is primarily a method paper. Reproduce it within a maintained framework baseline instead of chasing paper-specific repos.
Reproduction Risks
- No maintained paper-verified implementation is currently available
Evidence disclosure
Evidence graph: 2 refs, 1 links.
Utility signals: depth 60/100, grounding 58/100, status medium.
Implementation Status
There is no verified maintained implementation yet. Use this baseline plan to decide whether to prototype now or defer.
- This is primarily a method paper. Reproduce it within a maintained framework baseline instead of chasing paper-specific repos.
- Start with framework-native implementations (e.g. PyTorch optimizer module, Optax, or Transformers training loops).
- Replicate the paper ablation settings first, then compare against modern baselines.
Reproduction readiness
No verified implementation available
- · No maintained repository has been identified for this paper. Check adjacent implementations or HF artifacts below.
No benchmark numbers could be verified. You will not be able to validate reproduction correctness against published numbers.
Framework baselines
- PyTorch Adam optimizer docs
Reference implementation of Adam in PyTorch.
- Optax Adam optimizer docs
JAX/Flax baseline for Adam variants.
- Keras Adam optimizer docs
TensorFlow/Keras baseline for Adam.
Hugging Face artifacts
No trustworthy direct or curated related Hugging Face artifacts were found yet.
Continue with targeted Hugging Face searches derived from the paper title and method context:
Models
Tip: start with models, then check datasets/spaces if you need evaluation data or demos.
Direct artifact matches are currently sparse. Use targeted Hugging Face searches to quickly locate candidate models, datasets, and demos.
Research context
9,105
Citations
0
References
Tasks
Regularization (linguistics), Computer science, Engineering, Computational Mechanics, Physical Sciences
Methods
None detected
Domains
Physics, Mathematics
Evaluation & Human Feedback Data
Open this paper in HFEPX to review benchmark signals, evaluation modes, and human-feedback protocol context.
Open in HFEPXExplore Similar Papers
Jump to Paper2Code search queries derived from this paper's research context.
Related papers
-
Search on Paper2Code
ИСПОЛЬЗОВAНИЕ ПОТЕНЦИAЛA СОЦИAЛЬНЫХ ПAРТНЕРОВ В ПОДГОТОВКЕ БУДУЩИХ ПЕДAГОГОВ (2024) Semantic similarity
-
Search on Paper2Code
IDENTIFIKASI MODE RUSAK JAMAK (MULTI-FAULTS) BANTALAN MENGGUNAKAN ANALISIS ENVELOPE PADA TURBIN ANGIN Horizontal Axis (2018) Semantic similarity
-
Search on Paper2Code
ANALISIS PERLAKUAN AKUNTANSI PRODUK RUSAK PADA PT. AJINOMOTO INDONESIA MOJOKERTO FACTORY (2017) Semantic similarity
-
Search on Paper2Code
Perbedaan Kinerja Karyawan Sebelum dan Sesudah Penerapan Uraian Pekerjaan (2001) Semantic similarity
-
Search on Paper2Code
EVALUASI PERLAKUAN AKUNTANSI HARGA POKOK PRODUK RUSAK PADA PT. MULIAPRIMA REPLICATAMA SEMARANG (2000) Semantic similarity
Need human evaluators for your AI research? Scale annotation with expert AI Trainers.