Skip to content
← Back to explorer

Daily Archive

HFEPX Weekly Archive: 2026-W01

Updated from current HFEPX corpus (Feb 27, 2026). 6 papers are grouped in this daily page. Common evaluation modes: Automatic Metrics. Most common rater population: Domain Experts. Common annotation unit: Pairwise. Frequent quality control: Calibration. Frequently cited benchmark: Needle In A Haystack. Common metric signal: cost. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Jan 3, 2026.

Papers: 6 Last published: Jan 3, 2026 Global RSS

Research Narrative

Grounded narrative Model: deterministic-grounded Source: persisted

Updated from current HFEPX corpus (Feb 27, 2026). This page tracks 6 papers for HFEPX Weekly Archive: 2026-W01. Dominant protocol signals include automatic metrics, with frequent benchmark focus on Needle In A Haystack, Retrieval and metric focus on cost, accuracy. Use the grounded sections below to prioritize reproducible protocol choices, benchmark-matched comparisons, and judge-vs-human evaluation checks.

Why This Matters For Eval Research

Protocol Takeaways

Benchmark Interpretation

  • Needle In A Haystack appears in 16.7% of hub papers (1/6); use this cohort for benchmark-matched comparisons.
  • Retrieval appears in 16.7% of hub papers (1/6); use this cohort for benchmark-matched comparisons.

Metric Interpretation

  • cost is reported in 33.3% of hub papers (2/6); compare with a secondary metric before ranking methods.
  • accuracy is reported in 16.7% of hub papers (1/6); compare with a secondary metric before ranking methods.

Researcher Checklist

  • Close gap on Papers with explicit human feedback. Coverage is a replication risk (16.7% vs 45% target).
  • Close gap on Papers reporting quality controls. Coverage is a replication risk (16.7% vs 30% target).
  • Close gap on Papers naming benchmarks/datasets. Coverage is a replication risk (16.7% vs 35% target).
  • Maintain strength on Papers naming evaluation metrics. Coverage is strong (66.7% vs 35% target).
  • Close gap on Papers with known rater population. Coverage is a replication risk (16.7% vs 35% target).
  • Close gap on Papers with known annotation unit. Coverage is a replication risk (16.7% vs 35% target).

Papers with explicit human feedback

Coverage is a replication risk (16.7% vs 45% target).

Papers reporting quality controls

Coverage is a replication risk (16.7% vs 30% target).

Papers naming benchmarks/datasets

Coverage is a replication risk (16.7% vs 35% target).

Papers naming evaluation metrics

Coverage is strong (66.7% vs 35% target).

Papers with known rater population

Coverage is a replication risk (16.7% vs 35% target).

Papers with known annotation unit

Coverage is a replication risk (16.7% vs 35% target).

Suggested Reading Order

  1. 1. ARGUS: Adaptive Rotation-Invariant Geometric Unsupervised System

    Start here for detailed protocol reporting, including rater and quality-control evidence.

  2. 2. Improving Variational Autoencoder using Random Fourier Transformation: An Aviation Safety Anomaly Detection Case-Study

    Start here for detailed protocol reporting, including rater and quality-control evidence.

  3. 3. Fast-weight Product Key Memory

    Start here for detailed protocol reporting, including rater and quality-control evidence.

  4. 4. RAIR: A Rule-Aware Benchmark Uniting Challenging Long-Tail and Visual Salience Subset for E-commerce Relevance Assessment

    Adds automatic metrics for broader coverage within this hub.

  5. 5. WISE: Web Information Satire and Fakeness Evaluation

    Adds automatic metrics for broader coverage within this hub.

  6. 6. Coupling Experts and Routers in Mixture-of-Experts via an Auxiliary Loss

    Adds automatic metrics for broader coverage within this hub.

Known Limitations

  • Only 16.7% of papers report quality controls; prioritize calibration/adjudication evidence.
  • Rater population is under-specified (16.7% coverage).
  • Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Links

Benchmark Brief

Needle In A Haystack

Coverage: 1 papers (16.7%)

1 papers (16.7%) mention Needle In A Haystack.

Examples: Fast-weight Product Key Memory

Benchmark Brief

Retrieval

Coverage: 1 papers (16.7%)

1 papers (16.7%) mention Retrieval.

Examples: Fast-weight Product Key Memory

Metric Brief

cost

Coverage: 2 papers (33.3%)

2 papers (33.3%) mention cost.

Examples: Fast-weight Product Key Memory , Coupling Experts and Routers in Mixture-of-Experts via an Auxiliary Loss

Metric Brief

accuracy

Coverage: 1 papers (16.7%)

1 papers (16.7%) mention accuracy.

Examples: WISE: Web Information Satire and Fakeness Evaluation

Metric Brief

auc

Coverage: 1 papers (16.7%)

1 papers (16.7%) mention auc.

Examples: WISE: Web Information Satire and Fakeness Evaluation

Papers Published On This Date

Recent Daily Archives