Skip to content
← Back to explorer

Daily Archive

HFEPX Fortnight Archive: 2025-F23

Updated from current HFEPX corpus (Feb 27, 2026). 15 papers are grouped in this daily page. Common evaluation modes: Automatic Metrics, Simulation Env. Most common rater population: Domain Experts. Frequently cited benchmark: Retrieval. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Nov 15, 2025.

Papers: 15 Last published: Nov 15, 2025 Global RSS

Research Narrative

Grounded narrative Model: deterministic-grounded Source: persisted

Updated from current HFEPX corpus (Feb 27, 2026). This page tracks 15 papers for HFEPX Fortnight Archive: 2025-F23. Dominant protocol signals include automatic metrics, simulation environments, with frequent benchmark focus on Retrieval, Cv-Bench and metric focus on accuracy, cost. Use the grounded sections below to prioritize reproducible protocol choices, benchmark-matched comparisons, and judge-vs-human evaluation checks.

Why This Matters For Eval Research

Protocol Takeaways

Benchmark Interpretation

  • Retrieval appears in 20% of hub papers (3/15); use this cohort for benchmark-matched comparisons.
  • Cv-Bench appears in 6.7% of hub papers (1/15); use this cohort for benchmark-matched comparisons.

Metric Interpretation

  • accuracy is reported in 40% of hub papers (6/15); compare with a secondary metric before ranking methods.
  • cost is reported in 13.3% of hub papers (2/15); compare with a secondary metric before ranking methods.

Researcher Checklist

  • Close gap on Papers with explicit human feedback. Coverage is a replication risk (20% vs 45% target).
  • Close gap on Papers reporting quality controls. Coverage is a replication risk (0% vs 30% target).
  • Maintain strength on Papers naming benchmarks/datasets. Coverage is strong (53.3% vs 35% target).
  • Maintain strength on Papers naming evaluation metrics. Coverage is strong (60% vs 35% target).
  • Close gap on Papers with known rater population. Coverage is a replication risk (6.7% vs 35% target).
  • Close gap on Papers with known annotation unit. Coverage is a replication risk (0% vs 35% target).

Papers with explicit human feedback

Coverage is a replication risk (20% vs 45% target).

Papers reporting quality controls

Coverage is a replication risk (0% vs 30% target).

Papers naming benchmarks/datasets

Coverage is strong (53.3% vs 35% target).

Papers naming evaluation metrics

Coverage is strong (60% vs 35% target).

Papers with known rater population

Coverage is a replication risk (6.7% vs 35% target).

Papers with known annotation unit

Coverage is a replication risk (0% vs 35% target).

Suggested Reading Order

  1. 1. EARL: Entropy-Aware RL Alignment of LLMs for Reliable RTL Code Generation

    Start here for detailed protocol reporting, including rater and quality-control evidence.

  2. 2. CLARITY: Contextual Linguistic Adaptation and Accent Retrieval for Dual-Bias Mitigation in Text-to-Speech Generation

    Start here for detailed protocol reporting, including rater and quality-control evidence.

  3. 3. Multimodal Peer Review Simulation with Actionable To-Do Recommendations for Community-Aware Manuscript Revisions

    Start here for detailed protocol reporting, including rater and quality-control evidence.

  4. 4. Mastering Olympiad-Level Physics with Artificial Intelligence

    Adds automatic metrics for broader coverage within this hub.

  5. 5. Chain of Summaries: Summarization Through Iterative Questioning

    Adds automatic metrics for broader coverage within this hub.

  6. 6. State of the Art in Text Classification for South Slavic Languages: Fine-Tuning or Prompting?

    Adds automatic metrics for broader coverage within this hub.

  7. 7. Intelligence per Watt: Measuring Intelligence Efficiency of Local AI

    Adds automatic metrics for broader coverage within this hub.

  8. 8. Beyond Fact Retrieval: Episodic Memory for RAG with Generative Semantic Workspaces

    Adds automatic metrics for broader coverage within this hub.

Known Limitations

  • Only 0% of papers report quality controls; prioritize calibration/adjudication evidence.
  • Rater population is under-specified (6.7% coverage).
  • Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Links

automatic_metrics vs simulation_env

both=0, left_only=14, right_only=1

0 papers use both Automatic Metrics and Simulation Env.

Benchmark Brief

Cv-Bench

Coverage: 1 papers (6.7%)

1 papers (6.7%) mention Cv-Bench.

Examples: Long Grounded Thoughts: Synthesizing Visual Problems and Reasoning Chains at Scale

Benchmark Brief

MATH

Coverage: 1 papers (6.7%)

1 papers (6.7%) mention MATH.

Examples: Long Grounded Thoughts: Synthesizing Visual Problems and Reasoning Chains at Scale

Metric Brief

latency

Coverage: 2 papers (13.3%)

2 papers (13.3%) mention latency.

Examples: Intelligence per Watt: Measuring Intelligence Efficiency of Local AI , OckBench: Measuring the Efficiency of LLM Reasoning

Papers Published On This Date

Recent Daily Archives