What framework is used to implement "Size-adaptive Hypothesis Testing for Fairness"?

The primary implementation uses none.

Size-adaptive Hypothesis Testing for Fairness

Q: How reproducible is "Size-adaptive Hypothesis Testing for Fairness"?

Estimated time to first reproduction: a few days. Risk flags: Only historical official implementation is available. Only historical official repository was found (alanturin-g/saft).

Antonio Ferrara, Francesco Cozzi, Alan Perotti, André Panisson, Francesco Bonchi

Published: Jun 12, 2025

Historical official implementation (not recommended for new builds)

Evidence: Historical

Domain fit: AI-adjacent

Verified repos: 1

Paper appears method- or tooling-adjacent to AI workflows with partial ecosystem coverage.

Framework: none

Time to first repro: a few days

1 risk flag

arXiv PDF

Determining whether an algorithmic decision-making system discriminates against a specific demographic typically involves comparing a single point estimate of a fairness metric against a predefined threshold. This practice is statistically brittle: it ignores sampling error and treats small demographic subgroups the same as large ones. The problem intensifies in intersectional analyses, where multiple sensitive attri ...

Read full abstract

butes are considered jointly, giving rise to a larger number of smaller groups. As these groups become more granular, the data representing them becomes too sparse for reliable estimation, and fairness metrics yield excessively wide confidence intervals, precluding meaningful conclusions about potential unfair treatments. In this paper, we introduce a unified, size-adaptive, hypothesis-testing framework that turns fairness assessment into an evidence-based statistical decision. Our contribution is twofold. (i) For sufficiently large subgroups, we prove a Central-Limit result for the statistical parity difference, leading to analytic confidence intervals and a Wald test whose type-I (false positive) error is guaranteed at level $α$. (ii) For the long tail of small intersectional groups, we derive a fully Bayesian Dirichlet-multinomial estimator; Monte-Carlo credible intervals are calibrated for any sample size and naturally converge to Wald intervals as more data becomes available. We validate our approach empirically on benchmark datasets, demonstrating how our tests provide interpretable, statistically rigorous decisions under varying degrees of data availability and intersectionality.

Technical details

Canonical key: arxiv-2506.10586

Cache status: Stale (SWR served)

Generated at: Apr 16, 2026, 4:40 AM

Artifact coverage: sparse

HF provider: ok (token)

PWC source used: Yes

LLM status: not_generated

LLM model: n/a

LLM generated: Unknown

LLM content type: n/a

HF policy: hf-relevance-v27

implementation starting point

Benchmarks: thin evidence

Time to repro: a few days

1 risk flag

none

Results & Benchmarks

Freshness tier: warm

Direct + Inferred Evidence

Some benchmark signal exists in the extracted evidence, but it is not structured strongly enough yet for a confident benchmark decision.

Use This Implementation Because…

Confidence: low

Only historical official repository was found (alanturin-g/saft).

Open alanturin-g/saft

Reproduction Risks

Only historical official implementation is available
No direct maintained implementation is currently verified.

Hardware Notes

Expect multi-day setup/compute for meaningful reproduction based on current guidance.

Evidence disclosure

Evidence graph: 2 refs, 1 links.

Utility signals: depth 95/100, grounding 68/100, status medium.

Implementation Comparison

Top 1 paths

Compare maintenance quality, reproducibility coverage, and evidence confidence before choosing a reproduction baseline.

alanturin-g/saft

historical official

Maintenance: Recently updated

Confidence: High

Reproducibility: Limited

Official implementation from Papers with Code · Repository link is mentioned in the paper metadata

Stars: 0
Last push: Oct 22, 2025 (177d ago)

Dependencies

Risk flags

No CI pipeline detected
No tagged releases
No Docker setup

Best implementation now

Only a historical official implementation is available.

Use with caution for new projects; verify against current tooling and maintained community alternatives.

alanturin-g/saft

Historical official

Stars: 0

Last push: Oct 22, 2025

Only historical official repository was found: alanturin-g/saft.
No maintained paper-verified implementation met reliability thresholds.

Reproduction readiness

Setup Required

Time to first repro: days

Last checked: Apr 16, 2026

Hardware requirements

Expect multi-day setup/compute for meaningful reproduction based on current guidance.

Dependencies pinned, manual setup needed

· alanturin-g/saft has requirements.txt but requires manual environment setup.
· No Dockerfile — you will set up the environment manually.
· No CI pipeline — test coverage is unknown.

Open alanturin-g/saft

Quick start

git clone https://github.com/alanturin-g/saft.git
pip install -r requirements.txt

Hugging Face artifacts

No trustworthy direct or curated related Hugging Face artifacts were found yet.

Continue with targeted Hugging Face searches derived from the paper title and method context:

Models

arxiv:2506.10586

Datasets

arxiv:2506.10586

Spaces

arxiv:2506.10586

Tip: start with models, then check datasets/spaces if you need evaluation data or demos.

Direct artifact matches are currently sparse. Use targeted Hugging Face searches to quickly locate candidate models, datasets, and demos.

Search models Search datasets Search spaces

Research context

Evaluation & Human Feedback Data

Open this paper in HFEPX to review benchmark signals, evaluation modes, and human-feedback protocol context.

Open in HFEPX

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.

Post a Job Get a Quote