What is the best open-source implementation of "Rethinking Early Stopping: Refine, Then Calibrate"?

The best maintained implementation is dholzmueller/pytabkit with 374 stars on GitHub. Confidence: high. Reproducibility: Strong.

How reproducible is "Rethinking Early Stopping: Refine, Then Calibrate"?

Estimated time to first reproduction: a few hours. No risk flags identified. Start with dholzmueller/pytabkit and validate setup instructions in README.

What framework is used to implement "Rethinking Early Stopping: Refine, Then Calibrate"?

The primary implementation uses pytorch.

Rethinking Early Stopping: Refine, Then Calibrate

Eugène Berta, David Holzmüller, Michael I. Jordan, Francis Bach

Published: Jan 31, 2025

Best maintained implementation now

Evidence: Direct

Domain fit: AI-adjacent

Verified repos: 4

Top repo stars: 374

Paper appears method- or tooling-adjacent to AI workflows with partial ecosystem coverage.

Framework: pytorch

Time to first repro: a few hours

No risk flags

arXiv PDF

Machine learning classifiers often produce probabilistic predictions that are critical for accurate and interpretable decision-making in various domains. The quality of these predictions is generally evaluated with proper losses, such as cross-entropy, which decompose into two components: calibration error assesses general under/overconfidence, while refinement error measures the ability to distinguish different clas ...

Read full abstract

ses. In this paper, we present a novel variational formulation of the calibration-refinement decomposition that sheds new light on post-hoc calibration, and enables rapid estimation of the different terms. Equipped with this new perspective, we provide theoretical and empirical evidence that calibration and refinement errors are not minimized simultaneously during training. Selecting the best epoch based on validation loss thus leads to a compromise point that is suboptimal for both terms. To address this, we propose minimizing refinement error only during training (Refine,...), before minimizing calibration error post hoc, using standard techniques (...then Calibrate). Our method integrates seamlessly with any classifier and consistently improves performance across diverse classification tasks.

Technical details

Canonical key: arxiv-2501.19195

Cache status: Stale (SWR served)

Generated at: Jun 18, 2026, 2:29 PM

Artifact coverage: direct

HF provider: ok (token)

PWC source used: Yes

LLM status: not_generated

LLM model: n/a

LLM generated: Unknown

LLM content type: n/a

HF policy: hf-relevance-v27

implementation starting point

Benchmarks: missing

Time to repro: a few hours

pytorch

Results & Benchmarks

Freshness tier: cold

Direct + Inferred Evidence

No concrete benchmark grounding is available yet. Treat the page as context or an implementation starting point only.

Machine learning classifiers often produce probabilistic predictions that are critical for accurate and interpretable decision-making in various domains.

Use This Implementation Because…

Confidence: high

dholzmueller/pytabkit is the strongest maintained implementation based on ranking signals. CI workflows are present. License is declared (Apache-2.0).

Open dholzmueller/pytabkit

Reproduction Risks

No repository-level red flags were detected, but paper-specific preprocessing and hyperparameter details may still be under-specified.

Evidence disclosure

Evidence graph: 3 refs, 3 links.

Utility signals: depth 55/100, grounding 75/100, status medium.

Implementation Comparison

Top 3 paths

Compare maintenance quality, reproducibility coverage, and evidence confidence before choosing a reproduction baseline.

dholzmueller/pytabkit

best maintained

Maintenance: Recently updated

Confidence: High

Reproducibility: Strong

Official implementation from Papers with Code · Repository link is mentioned in the paper metadata

Stars: 374
Last push: Jan 6, 2026 (165d ago)

CIReleasesDependencies

Risk flags

No Docker setup

dholzmueller/probmetrics

historical official

Maintenance: Active

Confidence: High

Reproducibility: Strong

Official implementation from Papers with Code · Community adoption signal (65 stars)

Stars: 65
Last push: May 29, 2026 (22d ago)

CIReleasesDependencies

Risk flags

No Docker setup

eugeneberta/refinethencalibrate-vision

alternative

Maintenance: Stale

Confidence: High

Reproducibility: Limited

Official implementation from Papers with Code · Repository link is mentioned in the paper metadata

Stars: 9
Last push: Feb 3, 2025 (502d ago)

Risk flags

No push in 12+ months
No CI pipeline detected
No tagged releases

Best implementation now

dholzmueller/pytabkit

Confidence: High

Reproducibility: Strong

ML models + benchmark for tabular data classification and regression

Stars: 374

Forks: 37

Last push: Jan 6, 2026

License: Apache-2.0

Official implementation from Papers with Code

Repository link is mentioned in the paper metadata

Community adoption signal (374 stars)

License ✓

CI ✓

Deps ✓

Docker –

Selected dholzmueller/pytabkit as the strongest maintained implementation for new work.
Includes CI workflow signals.
Includes dependency/environment manifest signals.
Repository activity is within the last 24 months.

Historical official implementation

Preserved for provenance. Not recommended as the default path for new builds.

dholzmueller/probmetrics

Stars: 65

Last push: May 29, 2026

Reproduction readiness

Ready to Run

Time to first repro: hours

Last checked: Jun 18, 2026

Ready to reproduce

· Clone dholzmueller/pytabkit and install dependencies from pyproject.toml.
· CI pipeline detected — automated tests are in place.
· Last updated 165 days ago.

Open dholzmueller/pytabkit

Quick start

git clone https://github.com/dholzmueller/pytabkit.git
pip install -e .

No benchmark numbers could be verified. You will not be able to validate reproduction correctness against published numbers.

Additional implementations

Official

eugeneberta/refinethencalibrate-vision
Confidence: High

Computer vision benchmark for the paper "Rethinking Early Stopping: Refine, Then Calibrate"

Stars: 9

Forks: 0

Last push: Feb 3, 2025
eugeneberta/refinethencalibrate-theory
Confidence: High

Solver and experiments for the theoretical sections of the paper "Rethinking Early Stopping: Refine, Then Calibrate".

Stars: 5

Forks: 0

Last push: Feb 3, 2025