Skip to content
← Back to explorer

HFEPX Archive Slice

HFEPX Daily Archive: 2025-09-26

Updated from current HFEPX corpus (Apr 12, 2026). 25 papers are grouped in this daily page.

Read Full Context

Updated from current HFEPX corpus (Apr 12, 2026). 25 papers are grouped in this daily page. Common evaluation modes: Automatic Metrics, Llm As Judge. Most common rater population: Domain Experts. Frequent quality control: Calibration. Frequently cited benchmark: GenEval. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Sep 26, 2025.

Papers: 25 Last published: Sep 26, 2025 Global RSS

Researcher Quick Triage

Use this archive page for time-slice monitoring (what changed in evaluation methods, metrics, and protocol quality this period). Quality band: High .

High-Signal Coverage

100.0%

25 / 25 papers are not low-signal flagged.

Benchmark Anchors

20.0%

Papers with benchmark/dataset mentions in extraction output.

Metric Anchors

48.0%

Papers with reported metric mentions in extraction output.

  • 3 papers report explicit quality controls for this archive period.
  • Prioritize papers with both benchmark and metric anchors for reliable longitudinal comparisons.

Primary action: Use this slice as early signal only; benchmark/metric anchoring is limited for rigorous period-over-period claims.

Get this digest every Friday →

Subscribe

Why This Time Slice Matters

  • 20% of papers report explicit human-feedback signals, led by demonstration data.
  • automatic metrics appears in 44% of papers in this hub.
  • GenEval is a recurring benchmark anchor for cross-paper comparisons in this page.

Protocol Takeaways For This Period

  • Most common quality-control signal is rater calibration (12% of papers).
  • Rater context is mostly domain experts, and annotation is commonly mixed annotation units; use this to scope replication staffing.
  • Pair this hub with a human_eval-heavy hub to validate judge-model calibration.

Start Here (Highest-Signal Papers In This Slice)

Ranked by protocol completeness and evidence density for faster period-over-period review.

Protocol Matrix (Top 10)

Quickly compare method ingredients across this archive slice.

Paper Eval Modes Benchmarks Metrics Quality Controls
IA2: Alignment with ICL Activations Improves Supervised Fine-Tuning

Sep 26, 2025

Automatic Metrics Not reported Accuracy, Cost Calibration
Induction Signatures Are Not Enough: A Matched-Compute Study of Load-Bearing Structure in In-Context Learning

Sep 26, 2025

Automatic Metrics DROP Perplexity Not reported
HEART: Emotionally-Driven Test-Time Scaling of Language Models

Sep 26, 2025

Automatic Metrics GPQA, LiveCodeBench Accuracy Not reported
LogiPart: Local Large Language Models for Data Exploration at Scale with Logical Partitioning

Sep 26, 2025

Llm As Judge, Automatic Metrics Not reported Accuracy Calibration
CoSpaDi: Compressing LLMs via Calibration-Guided Sparse Dictionary Learning

Sep 26, 2025

Automatic Metrics Not reported Accuracy, Perplexity Calibration
AutoPK: Leveraging LLMs and a Hybrid Similarity Metric for Advanced Retrieval of Pharmacokinetic Data from Complex Tables and Documents

Sep 26, 2025

Automatic Metrics Not reported F1, Precision Not reported
Compute-Optimal Quantization-Aware Training

Sep 26, 2025

Automatic Metrics Not reported Accuracy, Precision Not reported
Critique-Coder: Enhancing Coder Models by Critique Reinforcement Learning

Sep 26, 2025

Not reported LiveCodeBench Not reported Not reported
Benefits and Pitfalls of Reinforcement Learning for Language Model Planning: A Theoretical Perspective

Sep 26, 2025

Automatic Metrics Not reported Accuracy Not reported
SciTS: Scientific Time Series Understanding and Generation with LLMs

Sep 26, 2025

Automatic Metrics Not reported Precision Not reported
Researcher Workflow (Detailed)

Checklist

  • Gap: Papers with explicit human feedback

    Coverage is a replication risk (20% vs 45% target).

  • Gap: Papers reporting quality controls

    Coverage is a replication risk (12% vs 30% target).

  • Gap: Papers naming benchmarks/datasets

    Coverage is a replication risk (12% vs 35% target).

  • Gap: Papers naming evaluation metrics

    Coverage is a replication risk (16% vs 35% target).

  • Gap: Papers with known rater population

    Coverage is a replication risk (4% vs 35% target).

  • Gap: Papers with known annotation unit

    Coverage is a replication risk (0% vs 35% target).

Strengths

  • This hub still surfaces a concentrated paper set for protocol triage and replication planning.

Known Gaps

  • Only 12% of papers report quality controls; prioritize calibration/adjudication evidence.
  • Rater population is under-specified (4% coverage).
  • Annotation unit is under-specified (0% coverage).

Suggested Next Analyses

  • Pair this hub with a human_eval-heavy hub to validate judge-model calibration.
  • Stratify by benchmark (GenEval vs GSM8K) before comparing methods.
  • Track metric sensitivity by reporting both accuracy and agreement.
  • Add inter-annotator agreement checks when reproducing these protocols.

Recommended Queries

Known Limitations
  • Only 12% of papers report quality controls; prioritize calibration/adjudication evidence.
  • Rater population is under-specified (4% coverage).
  • Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.
Research Utility Snapshot (Detailed)

Evaluation Modes

  • Automatic Metrics (11)
  • Llm As Judge (2)
  • Simulation Env (1)

Top Metrics

  • Accuracy (3)
  • Agreement (1)
  • Cost (1)

Top Benchmarks

  • GenEval (1)
  • GSM8K (1)
  • HumanEval+ (1)
  • LiveCodeBench (1)

Quality Controls

  • Calibration (3)

Papers In This Archive Slice

Recent Archive Slices

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.