Skip to content
← Back to explorer

Daily Archive

HFEPX Weekly Archive: 2026-W05

Updated from current HFEPX corpus (Feb 27, 2026). 11 papers are grouped in this daily page. Common evaluation modes: Automatic Metrics, Simulation Env. Most common rater population: Domain Experts. Common annotation unit: Freeform. Frequently cited benchmark: ALFWorld. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Feb 1, 2026.

Papers: 11 Last published: Feb 1, 2026 Global RSS

Research Narrative

Grounded narrative Model: deterministic-grounded Source: persisted

Updated from current HFEPX corpus (Feb 27, 2026). This page tracks 11 papers for HFEPX Weekly Archive: 2026-W05. Dominant protocol signals include automatic metrics, simulation environments, with frequent benchmark focus on ALFWorld, Amo-Bench and metric focus on accuracy, coherence. Use the grounded sections below to prioritize reproducible protocol choices, benchmark-matched comparisons, and judge-vs-human evaluation checks.

Why This Matters For Eval Research

Protocol Takeaways

Benchmark Interpretation

  • ALFWorld appears in 9.1% of hub papers (1/11); use this cohort for benchmark-matched comparisons.
  • Amo-Bench appears in 9.1% of hub papers (1/11); use this cohort for benchmark-matched comparisons.

Metric Interpretation

  • accuracy is reported in 27.3% of hub papers (3/11); compare with a secondary metric before ranking methods.
  • coherence is reported in 9.1% of hub papers (1/11); compare with a secondary metric before ranking methods.

Researcher Checklist

  • Close gap on Papers with explicit human feedback. Coverage is a replication risk (0% vs 45% target).
  • Close gap on Papers reporting quality controls. Coverage is a replication risk (0% vs 30% target).
  • Tighten coverage on Papers naming benchmarks/datasets. Coverage is usable but incomplete (27.3% vs 35% target).
  • Maintain strength on Papers naming evaluation metrics. Coverage is strong (45.5% vs 35% target).
  • Close gap on Papers with known rater population. Coverage is a replication risk (9.1% vs 35% target).
  • Close gap on Papers with known annotation unit. Coverage is a replication risk (18.2% vs 35% target).

Papers with explicit human feedback

Coverage is a replication risk (0% vs 45% target).

Papers reporting quality controls

Coverage is a replication risk (0% vs 30% target).

Papers naming benchmarks/datasets

Coverage is usable but incomplete (27.3% vs 35% target).

Papers naming evaluation metrics

Coverage is strong (45.5% vs 35% target).

Papers with known rater population

Coverage is a replication risk (9.1% vs 35% target).

Papers with known annotation unit

Coverage is a replication risk (18.2% vs 35% target).

Suggested Reading Order

  1. 1. What If We Allocate Test-Time Compute Adaptively?

    Start here for detailed protocol reporting, including rater and quality-control evidence.

  2. 2. From Associations to Activations: Comparing Behavioral and Hidden-State Semantic Geometry in LLMs

    Start here for detailed protocol reporting, including rater and quality-control evidence.

  3. 3. Unmasking Reasoning Processes: A Process-aware Benchmark for Evaluating Structural Mathematical Reasoning in LLMs

    Start here for detailed protocol reporting, including rater and quality-control evidence.

  4. 4. KBVQ-MoE: KLT-guided SVD with Bias-Corrected Vector Quantization for MoE Large Language Models

    Adds automatic metrics for broader coverage within this hub.

  5. 5. Embodied Task Planning via Graph-Informed Action Generation with Large Language Model

    Adds simulation environments for broader coverage within this hub.

  6. 6. Indic-TunedLens: Interpreting Multilingual Models in Indian Languages

    Adds automatic metrics for broader coverage within this hub.

  7. 7. INSURE-Dial: A Phase-Aware Conversational Dataset & Benchmark for Compliance Verification and Phase Detection

    Adds automatic metrics for broader coverage within this hub.

  8. 8. Understanding LLM Failures: A Multi-Tape Turing Machine Analysis of Systematic Errors in Language Model Reasoning

    Adds automatic metrics for broader coverage within this hub.

Known Limitations

  • Only 0% of papers report quality controls; prioritize calibration/adjudication evidence.
  • Rater population is under-specified (9.1% coverage).
  • Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Links

automatic_metrics vs simulation_env

both=1, left_only=9, right_only=1

1 papers use both Automatic Metrics and Simulation Env.

Benchmark Brief

ALFWorld

Coverage: 1 papers (9.1%)

1 papers (9.1%) mention ALFWorld.

Examples: Embodied Task Planning via Graph-Informed Action Generation with Large Language Model

Benchmark Brief

Amo-Bench

Coverage: 1 papers (9.1%)

1 papers (9.1%) mention Amo-Bench.

Examples: What If We Allocate Test-Time Compute Adaptively?

Benchmark Brief

MATH-500

Coverage: 1 papers (9.1%)

1 papers (9.1%) mention MATH-500.

Examples: What If We Allocate Test-Time Compute Adaptively?

Metric Brief

coherence

Coverage: 1 papers (9.1%)

1 papers (9.1%) mention coherence.

Examples: Embodied Task Planning via Graph-Informed Action Generation with Large Language Model

Metric Brief

cost

Coverage: 1 papers (9.1%)

1 papers (9.1%) mention cost.

Examples: Embodied Task Planning via Graph-Informed Action Generation with Large Language Model

Papers Published On This Date

Recent Daily Archives