How reproducible is "Decoupled Weight Decay Regularization"?

Estimated time to first reproduction: a few hours. Risk flags: No maintained paper-verified implementation is currently available. This is primarily a method paper. Reproduce it within a maintained framework baseline instead of chasing paper-specific repos.

What framework is used to implement "Decoupled Weight Decay Regularization"?

The primary implementation uses PyTorch Adam optimizer docs.

Decoupled Weight Decay Regularization

Ilya Loshchilov, Frank Hutter

Published: Nov 14, 2017

No direct implementation yet

Evidence: Inferred

Domain fit: Niche / domain-specific

Verified repos: 0

No strong AI-core implementation/artifact signals were detected from current providers.

Framework: PyTorch Adam optimizer docs

Time to first repro: a few hours

1 risk flag

DOI Publisher

L$_2$ regularization and weight decay regularization are equivalent for standard stochastic gradient descent (when rescaled by the learning rate), but as we demonstrate this is \emph{not} the case for adaptive gradient algorithms, such as Adam. While common implementations of these algorithms employ L$_2$ regularization (often calling it "weight decay" in what may be misleading due to the inequivalence we expose), we ...

Read full abstract

propose a simple modification to recover the original formulation of weight decay regularization by \emph{decoupling} the weight decay from the optimization steps taken w.r.t. the loss function. We provide empirical evidence that our proposed modification (i) decouples the optimal choice of weight decay factor from the setting of the learning rate for both standard SGD and Adam and (ii) substantially improves Adam's generalization performance, allowing it to compete with SGD with momentum on image classification datasets (on which it was previously typically outperformed by the latter). Our proposed decoupled weight decay has already been adopted by many researchers, and the community has implemented it in TensorFlow and PyTorch; the complete source code for our experiments is available at https://github.com/loshchil/AdamW-and-SGDW

Technical details

Canonical key: doi-10.48550_arxiv.1711.05101

Cache status: Fresh

Generated at: Jun 19, 2026, 6:39 PM

Artifact coverage: sparse

HF provider: ok (mixed)

PWC source used: No

LLM status: not_generated

LLM model: n/a

LLM generated: Unknown

LLM content type: n/a

HF policy: hf-relevance-v27

context only

Benchmarks: missing

Time to repro: a few hours

1 risk flag

PyTorch Adam optimizer docs

Results & Benchmarks

Freshness tier: cold

Direct + Inferred Evidence

No concrete benchmark grounding is available yet. Treat the page as context or an implementation starting point only.

Implementation Evidence Summary

Confidence: low

This is primarily a method paper. Reproduce it within a maintained framework baseline instead of chasing paper-specific repos.

Reproduction Risks

No maintained paper-verified implementation is currently available

Evidence disclosure

Evidence graph: 2 refs, 1 links.

Utility signals: depth 60/100, grounding 58/100, status medium.

Implementation Status

No verified maintained repo

There is no verified maintained implementation yet. Use this baseline plan to decide whether to prototype now or defer.

This is primarily a method paper. Reproduce it within a maintained framework baseline instead of chasing paper-specific repos.
Start with framework-native implementations (e.g. PyTorch optimizer module, Optax, or Transformers training loops).
Replicate the paper ablation settings first, then compare against modern baselines.

Time to first repro: a few hours

Reproduction readiness

No Repo

Time to first repro: hours

Last checked: Jun 19, 2026

No verified implementation available

· No maintained repository has been identified for this paper. Check adjacent implementations or HF artifacts below.

No benchmark numbers could be verified. You will not be able to validate reproduction correctness against published numbers.

Framework baselines

PyTorch Adam optimizer docs
Reference implementation of Adam in PyTorch.
Optax Adam optimizer docs
JAX/Flax baseline for Adam variants.
Keras Adam optimizer docs
TensorFlow/Keras baseline for Adam.

Hugging Face artifacts

No trustworthy direct or curated related Hugging Face artifacts were found yet.

Continue with targeted Hugging Face searches derived from the paper title and method context:

Models

Decoupled Weight Decay Regularization Regularization (linguistics) Regularization (linguistics) model

Datasets

Regularization (linguistics) dataset Decoupled Weight Decay Regularization dataset

Spaces

Regularization (linguistics) demo Decoupled Weight Decay Regularization demo

Tip: start with models, then check datasets/spaces if you need evaluation data or demos.

Direct artifact matches are currently sparse. Use targeted Hugging Face searches to quickly locate candidate models, datasets, and demos.

Search models Search datasets Search spaces

Research context

9,105

Citations

References

Tasks

Regularization (linguistics), Computer science, Engineering, Computational Mechanics, Physical Sciences

Methods

None detected

Domains

Physics, Mathematics

Evaluation & Human Feedback Data

Open this paper in HFEPX to review benchmark signals, evaluation modes, and human-feedback protocol context.

Open in HFEPX

Explore Similar Papers

Jump to Paper2Code search queries derived from this paper's research context.

Regularization (linguistics) Computer science Engineering Computational Mechanics Physical Sciences Physics