Skip to content
← Back to explorer

HFEPX Hub

Coding + Long Horizon Papers

Updated from current HFEPX corpus (Feb 27, 2026). 27 papers are grouped in this hub page. Common evaluation modes: Automatic Metrics, Simulation Env. Most common rater population: Domain Experts. Common annotation unit: Trajectory. Frequently cited benchmark: Retrieval. Common metric signal: cost. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Feb 26, 2026.

Papers: 27 Last published: Feb 26, 2026 Global RSS Tag RSS
CodingLong Horizon

Research Narrative

Grounded narrative Model: deterministic-grounded Source: persisted

Updated from current HFEPX corpus (Feb 27, 2026). This page tracks 27 papers for Coding + Long Horizon Papers. Dominant protocol signals include automatic metrics, simulation environments, human evaluation, with frequent benchmark focus on Retrieval, SWE-bench and metric focus on cost, accuracy. Use the grounded sections below to prioritize reproducible protocol choices, benchmark-matched comparisons, and judge-vs-human evaluation checks.

Why This Matters For Eval Research

Protocol Takeaways

Benchmark Interpretation

  • Retrieval appears in 14.8% of hub papers (4/27); use this cohort for benchmark-matched comparisons.
  • SWE-bench appears in 11.1% of hub papers (3/27); use this cohort for benchmark-matched comparisons.

Metric Interpretation

  • cost is reported in 29.6% of hub papers (8/27); compare with a secondary metric before ranking methods.
  • accuracy is reported in 25.9% of hub papers (7/27); compare with a secondary metric before ranking methods.

Researcher Checklist

  • Close gap on Papers with explicit human feedback. Coverage is a replication risk (11.1% vs 45% target).
  • Close gap on Papers reporting quality controls. Coverage is a replication risk (0% vs 30% target).
  • Tighten coverage on Papers naming benchmarks/datasets. Coverage is usable but incomplete (29.6% vs 35% target).
  • Maintain strength on Papers naming evaluation metrics. Coverage is strong (63% vs 35% target).
  • Close gap on Papers with known rater population. Coverage is a replication risk (14.8% vs 35% target).
  • Maintain strength on Papers with known annotation unit. Coverage is strong (51.9% vs 35% target).

Papers with explicit human feedback

Coverage is a replication risk (11.1% vs 45% target).

Papers reporting quality controls

Coverage is a replication risk (0% vs 30% target).

Papers naming benchmarks/datasets

Coverage is usable but incomplete (29.6% vs 35% target).

Papers naming evaluation metrics

Coverage is strong (63% vs 35% target).

Papers with known rater population

Coverage is a replication risk (14.8% vs 35% target).

Papers with known annotation unit

Coverage is strong (51.9% vs 35% target).

Suggested Reading Order

  1. 1. Test-Time Scaling with Diffusion Language Models via Reward-Guided Stitching

    Start here for detailed protocol reporting, including rater and quality-control evidence.

  2. 2. How Do Latent Reasoning Methods Perform Under Weak and Strong Supervision?

    Start here for detailed protocol reporting, including rater and quality-control evidence.

  3. 3. GUI-Libra: Training Native GUI Agents to Reason and Act with Action-aware Supervision and Partially Verifiable RL

    Start here for detailed protocol reporting, including rater and quality-control evidence.

  4. 4. FrameRef: A Framing Dataset and Simulation Testbed for Modeling Bounded Rational Information Health

    Include a human-eval paper to anchor calibration against automated judge settings.

  5. 5. SWE-Protégé: Learning to Selectively Collaborate With an Expert Unlocks Small Language Models as Software Engineering Agents

    Adds automatic metrics for broader coverage within this hub.

  6. 6. Self-Correcting VLA: Online Action Refinement via Sparse World Imagination

    Adds simulation environments for broader coverage within this hub.

  7. 7. Structurally Aligned Subtask-Level Memory for Software Engineering Agents

    Adds automatic metrics for broader coverage within this hub.

  8. 8. A Hierarchical Multi-Agent System for Autonomous Discovery in Geoscientific Data Archives

    Adds automatic metrics for broader coverage within this hub.

Known Limitations

  • Only 0% of papers report quality controls; prioritize calibration/adjudication evidence.
  • Rater population is under-specified (14.8% coverage).
  • Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Links

human_eval vs automatic_metrics

both=0, left_only=1, right_only=20

0 papers use both Human Eval and Automatic Metrics.

automatic_metrics vs simulation_env

both=0, left_only=20, right_only=7

0 papers use both Automatic Metrics and Simulation Env.

simulation_env vs human_eval

both=1, left_only=6, right_only=0

1 papers use both Simulation Env and Human Eval.

Top Papers

Related Hubs