Skip to content
← Back to explorer

HFEPX Archive Slice

HFEPX Daily Archive: 2025-09-30

Updated from current HFEPX corpus (Mar 10, 2026). 8 papers are grouped in this daily page.

Read Full Context

Updated from current HFEPX corpus (Mar 10, 2026). 8 papers are grouped in this daily page. Common evaluation modes: Automatic Metrics, Llm As Judge. Most common rater population: Domain Experts. Common annotation unit: Multi Dim Rubric. Frequent quality control: Inter Annotator Agreement Reported. Frequently cited benchmark: Aurora-Bench. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Sep 30, 2025.

Papers: 8 Last published: Sep 30, 2025 Global RSS

Researcher Quick Triage

Use this archive page for time-slice monitoring (what changed in evaluation methods, metrics, and protocol quality this period). Quality band: Medium .

High-Signal Coverage

100.0%

8 / 8 papers are not low-signal flagged.

Benchmark Anchors

25.0%

Papers with benchmark/dataset mentions in extraction output.

Metric Anchors

25.0%

Papers with reported metric mentions in extraction output.

  • 1 papers report explicit quality controls for this archive period.
  • Prioritize papers with both benchmark and metric anchors for reliable longitudinal comparisons.

Primary action: Use this slice as early signal only; benchmark/metric anchoring is limited for rigorous period-over-period claims.

Get this digest every Friday →

Subscribe

Why This Time Slice Matters

  • 50% of papers report explicit human-feedback signals, led by pairwise preferences.
  • automatic metrics appears in 25% of papers in this hub.
  • Aurora-Bench is a recurring benchmark anchor for cross-paper comparisons in this page.

Protocol Takeaways For This Period

  • Most common quality-control signal is inter-annotator agreement reporting (12.5% of papers).
  • Rater context is mostly domain experts, and annotation is commonly multi-dimensional rubrics; use this to scope replication staffing.
  • Pair this hub with a human_eval-heavy hub to validate judge-model calibration.

Start Here (Highest-Signal Papers In This Slice)

Ranked by protocol completeness and evidence density for faster period-over-period review.

Protocol Matrix (Top 10)

Quickly compare method ingredients across this archive slice.

Paper Eval Modes Benchmarks Metrics Quality Controls
MENLO: From Preferences to Proficiency -- Evaluating and Modeling Native-like Quality Across 47 Languages

Sep 30, 2025

Automatic Metrics Not reported Agreement Inter Annotator Agreement Reported
PrefDisco: Benchmarking Proactive Personalized Reasoning

Sep 30, 2025

Automatic Metrics Not reported Accuracy Not reported
EditReward: A Human-Aligned Reward Model for Instruction-Guided Image Editing

Sep 30, 2025

Llm As Judge Genai Bench, Aurora Bench Not reported Not reported
BiasFreeBench: a Benchmark for Mitigating Bias in Large Language Model Responses

Sep 30, 2025

Not reported Biasfreebench Not reported Not reported
ProfVLM: A Lightweight Video-Language Model for Multi-View Proficiency Estimation

Sep 30, 2025

Not reported Not reported Not reported Not reported
Your Agent May Misevolve: Emergent Risks in Self-evolving LLM Agents

Sep 30, 2025

Not reported Not reported Not reported Not reported
Latent Thinking Optimization: Your Latent Reasoning Language Model Secretly Encodes Reward Signals in Its Latent Thoughts

Sep 30, 2025

Not reported Not reported Not reported Not reported
LD-MoLE: Learnable Dynamic Routing for Mixture of LoRA Experts

Sep 30, 2025

Not reported Not reported Not reported Not reported
Researcher Workflow (Detailed)

Checklist

  • Strong: Papers with explicit human feedback

    Coverage is strong (50% vs 45% target).

  • Gap: Papers reporting quality controls

    Coverage is a replication risk (12.5% vs 30% target).

  • Gap: Papers naming benchmarks/datasets

    Coverage is a replication risk (12.5% vs 35% target).

  • Moderate: Papers naming evaluation metrics

    Coverage is usable but incomplete (25% vs 35% target).

  • Strong: Papers with known rater population

    Coverage is strong (50% vs 35% target).

  • Moderate: Papers with known annotation unit

    Coverage is usable but incomplete (25% vs 35% target).

Strengths

  • Strong human-feedback signal (50% of papers).

Known Gaps

  • Only 12.5% of papers report quality controls; prioritize calibration/adjudication evidence.
  • Benchmark coverage is thin (12.5% of papers mention benchmarks/datasets).

Suggested Next Analyses

  • Pair this hub with a human_eval-heavy hub to validate judge-model calibration.
  • Stratify by benchmark (Aurora-Bench vs Editreward-Bench) before comparing methods.
  • Track metric sensitivity by reporting both accuracy and agreement.

Recommended Queries

Known Limitations
  • Only 12.5% of papers report quality controls; prioritize calibration/adjudication evidence.
  • Benchmark coverage is thin (12.5% of papers mention benchmarks/datasets).
  • Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.
Research Utility Snapshot (Detailed)

Evaluation Modes

  • Automatic Metrics (2)
  • Llm As Judge (1)

Top Metrics

  • Accuracy (1)
  • Agreement (1)

Top Benchmarks

  • Aurora Bench (1)
  • Editreward Bench (1)
  • Genai Bench (1)

Quality Controls

  • Inter Annotator Agreement Reported (1)

Papers In This Archive Slice

Recent Archive Slices

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.