Skip to content
← Back to explorer

Daily Archive

HFEPX Daily Archive: 2026-02-14

Updated from current HFEPX corpus (Feb 27, 2026). 5 papers are grouped in this daily page. Common evaluation modes: Simulation Env, Automatic Metrics. Common annotation unit: Multi Dim Rubric. Frequent quality control: Inter Annotator Agreement Reported. Common metric signal: agreement. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Feb 14, 2026.

Papers: 5 Last published: Feb 14, 2026 Global RSS

Research Narrative

Grounded narrative Model: deterministic-grounded

Updated from current HFEPX corpus (Feb 27, 2026). This page covers 5 papers centered on HFEPX Daily Archive: 2026-02-14. Common evaluation modes include Simulation Env, Automatic Metrics, with benchmark emphasis on multiple datasets. Metric concentration includes agreement, coherence, and the agentic footprint highlights Multi Agent. Use the anchored takeaways below to compare protocol choices, quality-control patterns, and evidence depth before allocating new eval budget.

Why This Matters For Eval Research

Protocol Takeaways

Metric Interpretation

  • agreement is a common reported metric and should be paired with protocol context before ranking methods.
  • 1 papers (20%) mention agreement.
  • Most common evaluation modes: Human Eval.

Researcher Checklist

  • Papers with explicit human feedback: Coverage is usable but incomplete (40% vs 45% target).
  • Papers reporting quality controls: Coverage is usable but incomplete (20% vs 30% target).
  • Papers naming benchmarks/datasets: Coverage is a replication risk (0% vs 35% target).
  • Papers naming evaluation metrics: Coverage is strong (40% vs 35% target).
  • Papers with known rater population: Coverage is a replication risk (0% vs 35% target).
  • Papers with known annotation unit: Coverage is a replication risk (20% vs 35% target).

Papers with explicit human feedback

Coverage is usable but incomplete (40% vs 45% target).

Papers reporting quality controls

Coverage is usable but incomplete (20% vs 30% target).

Papers naming benchmarks/datasets

Coverage is a replication risk (0% vs 35% target).

Papers naming evaluation metrics

Coverage is strong (40% vs 35% target).

Papers with known rater population

Coverage is a replication risk (0% vs 35% target).

Papers with known annotation unit

Coverage is a replication risk (20% vs 35% target).

Suggested Reading Order

  1. 1. A Comparative Analysis of Social Network Topology in Reddit and Moltbook

    Start with this anchor paper for scope and protocol framing. Covers Simulation Env.

  2. 2. From Pixels to Policies: Reinforcing Spatial Reasoning in Language Models for Content-Aware Layout Design

    Covers Simulation Env. Includes human-feedback signal: Critique Edit.

  3. 3. ADAB: Arabic Dataset for Automated Politeness Benchmarking -- A Large-Scale Resource for Computational Sociopragmatics

    Covers Human Eval.

  4. 4. OR-Agent: Bridging Evolutionary Search and Structured Research for Automated Algorithm Discovery

    Covers Simulation Env.

  5. 5. Small Reward Models via Backward Inference

    Covers Automatic Metrics. Includes human-feedback signal: Rubric Rating.

Known Limitations

  • Narrative synthesis is grounded in metadata and abstracts only; full-paper method details may be missing.
  • Extraction fields are conservative and can under-report implicit protocol details.
  • Daily and rolling archives can be sparse and should be cross-checked with neighboring windows.

Research Utility Links

human_eval vs automatic_metrics

both=0, left_only=1, right_only=1

0 papers use both Human Eval and Automatic Metrics.

simulation_env vs automatic_metrics

both=0, left_only=3, right_only=1

0 papers use both Simulation Env and Automatic Metrics.

simulation_env vs human_eval

both=0, left_only=3, right_only=1

0 papers use both Simulation Env and Human Eval.

Papers Published On This Date

Recent Daily Archives