Skip to content
← Back to explorer

Daily Archive

HFEPX Weekly Archive: 2025-W41

Updated from current HFEPX corpus (Feb 27, 2026). 20 papers are grouped in this daily page. Common evaluation modes: Automatic Metrics, Human Eval. Most common rater population: Domain Experts. Common annotation unit: Pairwise. Frequent quality control: Calibration. Frequently cited benchmark: AlpacaEval. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Oct 12, 2025.

Papers: 20 Last published: Oct 12, 2025 Global RSS

Research Narrative

Grounded narrative Model: deterministic-grounded Source: persisted

Updated from current HFEPX corpus (Feb 27, 2026). This page tracks 20 papers for HFEPX Weekly Archive: 2025-W41. Dominant protocol signals include automatic metrics, human evaluation, simulation environments, with frequent benchmark focus on AlpacaEval, Arena-Hard and metric focus on accuracy, cost. Use the grounded sections below to prioritize reproducible protocol choices, benchmark-matched comparisons, and judge-vs-human evaluation checks.

Why This Matters For Eval Research

Protocol Takeaways

Benchmark Interpretation

  • AlpacaEval appears in 5% of hub papers (1/20); use this cohort for benchmark-matched comparisons.
  • Arena-Hard appears in 5% of hub papers (1/20); use this cohort for benchmark-matched comparisons.

Metric Interpretation

  • accuracy is reported in 25% of hub papers (5/20); compare with a secondary metric before ranking methods.
  • cost is reported in 10% of hub papers (2/20); compare with a secondary metric before ranking methods.

Researcher Checklist

  • Close gap on Papers with explicit human feedback. Coverage is a replication risk (15% vs 45% target).
  • Close gap on Papers reporting quality controls. Coverage is a replication risk (5% vs 30% target).
  • Tighten coverage on Papers naming benchmarks/datasets. Coverage is usable but incomplete (25% vs 35% target).
  • Maintain strength on Papers naming evaluation metrics. Coverage is strong (50% vs 35% target).
  • Close gap on Papers with known rater population. Coverage is a replication risk (10% vs 35% target).
  • Close gap on Papers with known annotation unit. Coverage is a replication risk (10% vs 35% target).

Papers with explicit human feedback

Coverage is a replication risk (15% vs 45% target).

Papers reporting quality controls

Coverage is a replication risk (5% vs 30% target).

Papers naming benchmarks/datasets

Coverage is usable but incomplete (25% vs 35% target).

Papers naming evaluation metrics

Coverage is strong (50% vs 35% target).

Papers with known rater population

Coverage is a replication risk (10% vs 35% target).

Papers with known annotation unit

Coverage is a replication risk (10% vs 35% target).

Suggested Reading Order

  1. 1. FML-bench: Benchmarking Machine Learning Agents for Scientific Research

    Start here for detailed protocol reporting, including rater and quality-control evidence.

  2. 2. Mapping Semantic & Syntactic Relationships with Geometric Rotation

    Start here for detailed protocol reporting, including rater and quality-control evidence.

  3. 3. The Speech-LLM Takes It All: A Truly Fully End-to-End Spoken Dialogue State Tracking Approach

    Start here for detailed protocol reporting, including rater and quality-control evidence.

  4. 4. Peeking inside the Black-Box: Reinforcement Learning for Explainable and Accurate Relation Extraction

    Include a human-eval paper to anchor calibration against automated judge settings.

  5. 5. Chlorophyll-a Mapping and Prediction in the Mar Menor Lagoon Using C2RCC-Processed Sentinel 2 Imagery

    Adds automatic metrics for broader coverage within this hub.

  6. 6. Verifying Chain-of-Thought Reasoning via Its Computational Graph

    Adds automatic metrics for broader coverage within this hub.

  7. 7. Multimodal Prompt Optimization: Why Not Leverage Multiple Modalities for MLLMs

    Adds automatic metrics for broader coverage within this hub.

  8. 8. FinAuditing: A Financial Taxonomy-Structured Multi-Document Benchmark for Evaluating LLMs

    Adds automatic metrics for broader coverage within this hub.

Known Limitations

  • Only 5% of papers report quality controls; prioritize calibration/adjudication evidence.
  • Rater population is under-specified (10% coverage).
  • Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Links

human_eval vs automatic_metrics

both=1, left_only=0, right_only=19

1 papers use both Human Eval and Automatic Metrics.

automatic_metrics vs simulation_env

both=1, left_only=19, right_only=0

1 papers use both Automatic Metrics and Simulation Env.

human_eval vs simulation_env

both=0, left_only=1, right_only=1

0 papers use both Human Eval and Simulation Env.

Benchmark Brief

AlpacaEval

Coverage: 1 papers (5%)

1 papers (5%) mention AlpacaEval.

Examples: Revisiting Self-Play Preference Optimization: On the Role of Prompt Difficulty

Benchmark Brief

Arena-Hard

Coverage: 1 papers (5%)

1 papers (5%) mention Arena-Hard.

Examples: Revisiting Self-Play Preference Optimization: On the Role of Prompt Difficulty

Benchmark Brief

Fml-Bench

Coverage: 1 papers (5%)

1 papers (5%) mention Fml-Bench.

Examples: FML-bench: Benchmarking Machine Learning Agents for Scientific Research

Metric Brief

calibration

Coverage: 1 papers (5%)

1 papers (5%) mention calibration.

Examples: Chlorophyll-a Mapping and Prediction in the Mar Menor Lagoon Using C2RCC-Processed Sentinel 2 Imagery

Papers Published On This Date

Recent Daily Archives