Official implementation from Papers with Code · Repository link is mentioned in the paper metadata
- Stars
- 374
- Last push
- Jan 6, 2026 (165d ago)
Risk flags
- No Docker setup
Eugène Berta, David Holzmüller, Michael I. Jordan, Francis Bach
Paper appears method- or tooling-adjacent to AI workflows with partial ecosystem coverage.
Machine learning classifiers often produce probabilistic predictions that are critical for accurate and interpretable decision-making in various domains. The quality of these predictions is generally evaluated with proper losses, such as cross-entropy, which decompose into two components: calibration error assesses general under/overconfidence, while refinement error measures the ability to distinguish different clas ...
ses. In this paper, we present a novel variational formulation of the calibration-refinement decomposition that sheds new light on post-hoc calibration, and enables rapid estimation of the different terms. Equipped with this new perspective, we provide theoretical and empirical evidence that calibration and refinement errors are not minimized simultaneously during training. Selecting the best epoch based on validation loss thus leads to a compromise point that is suboptimal for both terms. To address this, we propose minimizing refinement error only during training (Refine,...), before minimizing calibration error post hoc, using standard techniques (...then Calibrate). Our method integrates seamlessly with any classifier and consistently improves performance across diverse classification tasks.
No concrete benchmark grounding is available yet. Treat the page as context or an implementation starting point only.
Machine learning classifiers often produce probabilistic predictions that are critical for accurate and interpretable decision-making in various domains.
dholzmueller/pytabkit is the strongest maintained implementation based on ranking signals. CI workflows are present. License is declared (Apache-2.0).
Open dholzmueller/pytabkitEvidence graph: 3 refs, 3 links.
Utility signals: depth 55/100, grounding 75/100, status medium.
Compare maintenance quality, reproducibility coverage, and evidence confidence before choosing a reproduction baseline.
Official implementation from Papers with Code · Repository link is mentioned in the paper metadata
Risk flags
Official implementation from Papers with Code · Community adoption signal (65 stars)
Risk flags
Official implementation from Papers with Code · Repository link is mentioned in the paper metadata
Risk flags
ML models + benchmark for tabular data classification and regression
Preserved for provenance. Not recommended as the default path for new builds.
Ready to reproduce
Quick start
git clone https://github.com/dholzmueller/pytabkit.git
pip install -e . No benchmark numbers could be verified. You will not be able to validate reproduction correctness against published numbers.
Computer vision benchmark for the paper "Rethinking Early Stopping: Refine, Then Calibrate"
Solver and experiments for the theoretical sections of the paper "Rethinking Early Stopping: Refine, Then Calibrate".
No additional community repositories detected yet.
These repositories had low-confidence matching signals and are hidden by default.
No direct paper-linked artifacts were found. Showing strongest curated related artifacts for faster exploration.
No trustworthy model matches right now.
Search models on Hugging FaceNo trustworthy dataset matches right now.
Search datasets on Hugging FaceBroaden demo search
Evaluation & Human Feedback Data
Open this paper in HFEPX to review benchmark signals, evaluation modes, and human-feedback protocol context.
Open in HFEPXNeed human evaluators for your AI research? Scale annotation with expert AI Trainers.