Skip to content

Researcher verdict

Recommended implementation path available

implementation baseline
Benchmark trust: thin evidence
Quality tier: researcher ready

This page has evidence-backed benchmark findings and a concrete implementation recommendation anchored on google/uncertainty-baselines. Use it as an implementation baseline, then validate benchmark parity before adapting it.

Why this page is still worth reading

  • A concrete repository path exists via google/uncertainty-baselines, so this page can act as a practical starting point.
  • Reproduction risks are surfaced explicitly, which helps decide whether the paper is worth immediate prototyping.

Benchmark trust

Some benchmark signal exists in the extracted evidence, but it is not structured strongly enough yet for a confident benchmark decision.

Use this page as

Start here when you need the most practical implementation path quickly.

Results & Benchmarks

Freshness tier: cold
Direct + Inferred Evidence

Some benchmark signal exists in the extracted evidence, but it is not structured strongly enough yet for a confident benchmark decision.

Uncertainty Baselines: Benchmarks for Uncertainty & Robustness in Deep Learning is the primary contribution described in this paper.

Use This Implementation Because…

Confidence: high

google/uncertainty-baselines is the strongest maintained implementation based on ranking signals. CI workflows are present. License is declared (Apache-2.0).

Open google/uncertainty-baselines

Reproduction Risks

  • Dependency manifest is missing

Hardware Notes

Expect multi-day setup/compute for meaningful reproduction based on current guidance.

Evidence disclosure

LLM evidence refs: paper.abstract, evidencePack.paperSections[id=paper_2], evidencePack.paperSections[id=paper_5], evidencePack.paperSections[id=paper_8], researcherSummary.benchmarkSnapshot[0], researcherSummary.benchmarkSnapshot[1], evidencePack.paperSections[id=paper_7], evidencePack.paperSections[id=paper_10], guidance.riskFlags[0], repos[0].fullName, researcherSummary.hardwareNotes[0], researcherSummary.timeToFirstMeaningfulRun, paper.title, summary.hasReliableImplementation

Evidence graph: 3 refs, 3 links.

Utility signals: depth 60/100, grounding 75/100, status medium.

Implementation Comparison

Top 3 paths

Compare maintenance quality, reproducibility coverage, and evidence confidence before choosing a reproduction baseline.

Maintenance: Recently updated
Confidence: High
Reproducibility: Moderate

Official implementation from Papers with Code · Repository link is mentioned in the paper metadata

Stars
1,567
Last push
Feb 2, 2026 (33d ago)
CI

Risk flags

  • No tagged releases
  • No Docker setup
  • Dependency manifest missing
Maintenance: Active
Confidence: Low
Reproducibility: Moderate

Matched via arXiv identifier search · Partial overlap with paper title keywords

Stars
0
Last push
Feb 24, 2026 (11d ago)
CI

Risk flags

  • No tagged releases
  • No Docker setup
  • Dependency manifest missing
Maintenance: Stale
Confidence: Low
Reproducibility: Limited

Matched via arXiv identifier search · Repository appears stale (>24 months since last push)

Stars
17
Last push
Nov 6, 2022 (1217d ago)

Risk flags

  • No push in 12+ months
  • No CI pipeline detected
  • No tagged releases

Paper summary

AI-generated

AI-generated summary grounded in paper metadata and artifact signals.

The work defines Uncertainty Baselines as a collection of standardized benchmarks for uncertainty and robustness in deep learning, covering at least ImageNet and Diabetic Retinopathy tasks with multiple baseline methods. This page includes benchmark evidence for Uncertainty robustness benchmarking on ImageNet. Reproduction guidance focuses on implementation viability and concrete risk controls.

Key contributions

  • The work defines Uncertainty Baselines as a collection of standardized benchmarks for uncertainty and robustness in deep learning, covering at least ImageNet and Diabetic Retinopathy tasks with multiple baseline methods.
  • The benchmarks provide unified implementations of multiple uncertainty methods built on common backbones, including Wide ResNet for CIFAR10/100 and ResNet-50 (plus ResNet-101/152 and EfficientNet).
  • For the Diabetic Retinopathy benchmark, the authors include detailed hyperparameter tuning results from two rounds of quasirandom search to help others retune their own uncertainty methods.
  • The paper illustrates Uncertainty Baselines’ capabilities using only one of nine tasks, ImageNet, and explicitly avoids a legend comparing specific baselines, limiting direct method-to-method performance comparison.

Implementation guidance

Use google/uncertainty-baselines first because deterministic ranking and extracted evidence align on implementation viability. Start with the repo setup path, then validate benchmark reproduction before adaptation.

Reproducibility notes

  • Environment recreation may fail or produce inconsistent results because the repository lacks an explicit dependency manifest, leading to mismatched library versions.
  • Insufficient compute or time allocation can prevent full convergence on large benchmarks like ImageNet and Diabetic Retinopathy, yielding weaker or irreproducible uncertainty.
  • Departing from the specified data preprocessing pipelines, such as padding, cropping, and ResNet-style normalization, can invalidate comparisons with the provided baselines.
  • Incorrectly configuring or omitting the documented quasirandom hyperparameter search for the Diabetic Retinopathy benchmark can result in misleading performance and uncertainty.

Best implementation now

google/uncertainty-baselines
Confidence: High
Reproducibility: Moderate

High-quality implementations of standard and SOTA methods on a variety of tasks.

Stars: 1,567
Forks: 215
Last push: Feb 2, 2026
License: Apache-2.0
Official implementation from Papers with Code
Repository link is mentioned in the paper metadata
Matched via arXiv identifier search
Partial overlap with paper title keywords
Community adoption signal (1567 stars)
License ✓
CI ✓
Deps –
Docker –
  • Selected google/uncertainty-baselines as the strongest maintained implementation for new work.
  • Includes CI workflow signals.
  • Repository activity is within the last 24 months.

Reproduction path

Direct

Follow the direct implementation path

  1. 1

    Start with google/uncertainty-baselines and validate setup instructions in README.

  2. 2

    Reproduce the baseline result with the provided defaults before modifying hyperparameters.

  3. 3

    Log exact dependency versions and runtime environment for reproducibility.

Time to first repro: a few days
Dependency manifest is missing

Additional implementations

No additional verified repositories beyond the primary recommendation.

These repositories had low-confidence matching signals and are hidden by default.

Hugging Face artifacts

No trustworthy direct or curated related Hugging Face artifacts were found yet.

Continue with targeted Hugging Face searches derived from the paper title and method context:

Tip: start with models, then check datasets/spaces if you need evaluation data or demos.

Direct artifact matches are currently sparse. Use targeted Hugging Face searches to quickly locate candidate models, datasets, and demos.

Research context

Evaluation & Human Feedback Data

Open this paper in HFEPX to review benchmark signals, evaluation modes, and human-feedback protocol context.

Open in HFEPX

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.