Skip to content
← Back to explorer

Daily Archive

HFEPX Fortnight Archive: 2025-F19

Updated from current HFEPX corpus (Feb 27, 2026). 10 papers are grouped in this daily page. Common evaluation modes: Automatic Metrics, Simulation Env. Most common rater population: Domain Experts. Frequent quality control: Calibration. Frequently cited benchmark: AdvBench. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Sep 20, 2025.

Papers: 10 Last published: Sep 20, 2025 Global RSS

Research Narrative

Grounded narrative Model: deterministic-grounded Source: persisted

Updated from current HFEPX corpus (Feb 27, 2026). This page tracks 10 papers for HFEPX Fortnight Archive: 2025-F19. Dominant protocol signals include automatic metrics, simulation environments, with frequent benchmark focus on AdvBench, AIME and metric focus on accuracy, auroc. Use the grounded sections below to prioritize reproducible protocol choices, benchmark-matched comparisons, and judge-vs-human evaluation checks.

Why This Matters For Eval Research

Protocol Takeaways

Benchmark Interpretation

  • AdvBench appears in 10% of hub papers (1/10); use this cohort for benchmark-matched comparisons.
  • AIME appears in 10% of hub papers (1/10); use this cohort for benchmark-matched comparisons.

Metric Interpretation

  • accuracy is reported in 30% of hub papers (3/10); compare with a secondary metric before ranking methods.
  • auroc is reported in 10% of hub papers (1/10); compare with a secondary metric before ranking methods.

Researcher Checklist

  • Close gap on Papers with explicit human feedback. Coverage is a replication risk (20% vs 45% target).
  • Close gap on Papers reporting quality controls. Coverage is a replication risk (10% vs 30% target).
  • Tighten coverage on Papers naming benchmarks/datasets. Coverage is usable but incomplete (30% vs 35% target).
  • Maintain strength on Papers naming evaluation metrics. Coverage is strong (50% vs 35% target).
  • Tighten coverage on Papers with known rater population. Coverage is usable but incomplete (30% vs 35% target).
  • Close gap on Papers with known annotation unit. Coverage is a replication risk (0% vs 35% target).

Papers with explicit human feedback

Coverage is a replication risk (20% vs 45% target).

Papers reporting quality controls

Coverage is a replication risk (10% vs 30% target).

Papers naming benchmarks/datasets

Coverage is usable but incomplete (30% vs 35% target).

Papers naming evaluation metrics

Coverage is strong (50% vs 35% target).

Papers with known rater population

Coverage is usable but incomplete (30% vs 35% target).

Papers with known annotation unit

Coverage is a replication risk (0% vs 35% target).

Suggested Reading Order

  1. 1. KANO: Kolmogorov-Arnold Neural Operator

    Start here for detailed protocol reporting, including rater and quality-control evidence.

  2. 2. Evolving Language Models without Labels: Majority Drives Selection, Novelty Promotes Variation

    Start here for detailed protocol reporting, including rater and quality-control evidence.

  3. 3. ATTS: Asynchronous Test-Time Scaling via Conformal Prediction

    Start here for detailed protocol reporting, including rater and quality-control evidence.

  4. 4. ClearFairy: Capturing Creative Workflows through Decision Structuring, In-Situ Questioning, and Rationale Inference

    Adds automatic metrics with critique/edit feedback for broader coverage within this hub.

  5. 5. A Simple and Efficient Jailbreak Method Exploiting LLMs' Helpfulness

    Adds automatic metrics with red-team protocols for broader coverage within this hub.

  6. 6. The AI Memory Gap: Users Misremember What They Created With AI or Without

    Adds automatic metrics for broader coverage within this hub.

  7. 7. Collaborative Document Editing with Multiple Users and AI Agents

    Adds simulation environments for broader coverage within this hub.

  8. 8. PeruMedQA: Benchmarking Large Language Models (LLMs) on Peruvian Medical Exams -- Dataset Construction and Evaluation

    Adds automatic metrics for broader coverage within this hub.

Known Limitations

  • Only 10% of papers report quality controls; prioritize calibration/adjudication evidence.
  • Annotation unit is under-specified (0% coverage).
  • Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Links

automatic_metrics vs simulation_env

both=0, left_only=9, right_only=1

0 papers use both Automatic Metrics and Simulation Env.

Benchmark Brief

AdvBench

Coverage: 1 papers (10%)

1 papers (10%) mention AdvBench.

Examples: A Simple and Efficient Jailbreak Method Exploiting LLMs' Helpfulness

Benchmark Brief

AIME

Coverage: 1 papers (10%)

1 papers (10%) mention AIME.

Examples: ATTS: Asynchronous Test-Time Scaling via Conformal Prediction

Benchmark Brief

MATH

Coverage: 1 papers (10%)

1 papers (10%) mention MATH.

Examples: Evolving Language Models without Labels: Majority Drives Selection, Novelty Promotes Variation

Metric Brief

auroc

Coverage: 1 papers (10%)

1 papers (10%) mention auroc.

Examples: MedicalPatchNet: A Patch-Based Self-Explainable AI Architecture for Chest X-ray Classification

Metric Brief

helpfulness

Coverage: 1 papers (10%)

1 papers (10%) mention helpfulness.

Examples: A Simple and Efficient Jailbreak Method Exploiting LLMs' Helpfulness

Papers Published On This Date

Recent Daily Archives