Daily Archive

HFEPX Daily Archive: 2026-02-14

Updated from current HFEPX corpus (Feb 27, 2026). 5 papers are grouped in this daily page. Common evaluation modes: Simulation Env, Automatic Metrics. Common annotation unit: Multi Dim Rubric. Frequent quality control: Inter Annotator Agreement Reported. Common metric signal: agreement. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Feb 14, 2026.

Papers: 5 Last published: Feb 14, 2026 Global RSS

Research Narrative

Grounded narrative Model: deterministic-grounded

Updated from current HFEPX corpus (Feb 27, 2026). This page covers 5 papers centered on HFEPX Daily Archive: 2026-02-14. Common evaluation modes include Simulation Env, Automatic Metrics, with benchmark emphasis on multiple datasets. Metric concentration includes agreement, coherence, and the agentic footprint highlights Multi Agent. Use the anchored takeaways below to compare protocol choices, quality-control patterns, and evidence depth before allocating new eval budget.

Why This Matters For Eval Research

Evaluation emphasis: Simulation Env and Automatic Metrics appear frequently in this slice.

Evidence: A Comparative Analysis of Social Network Topology in Reddit and Moltbook , From Pixels to Policies: Reinforcing Spatial Reasoning in Language Models for Content-Aware Layout Design , ADAB: Arabic Dataset for Automated Politeness Benchmarking -- A Large-Scale Resource for Computational Sociopragmatics
Metric concentration: agreement, coherence is repeatedly reported in this group.

Evidence: From Pixels to Policies: Reinforcing Spatial Reasoning in Language Models for Content-Aware Layout Design , ADAB: Arabic Dataset for Automated Politeness Benchmarking -- A Large-Scale Resource for Computational Sociopragmatics , OR-Agent: Bridging Evolutionary Search and Structured Research for Automated Algorithm Discovery
Agentic behavior appears in Multi Agent settings, useful for tool-use and long-horizon evaluation triage.

Evidence: ADAB: Arabic Dataset for Automated Politeness Benchmarking -- A Large-Scale Resource for Computational Sociopragmatics , OR-Agent: Bridging Evolutionary Search and Structured Research for Automated Algorithm Discovery , Small Reward Models via Backward Inference

Protocol Takeaways

Pair this hub with llm_as_judge pages to benchmark automated-vs-human evaluation tradeoffs.

Evidence: OR-Agent: Bridging Evolutionary Search and Structured Research for Automated Algorithm Discovery , Small Reward Models via Backward Inference , A Comparative Analysis of Social Network Topology in Reddit and Moltbook
Track metric sensitivity by reporting both agreement and coherence.

Evidence: Small Reward Models via Backward Inference , A Comparative Analysis of Social Network Topology in Reddit and Moltbook , From Pixels to Policies: Reinforcing Spatial Reasoning in Language Models for Content-Aware Layout Design
Papers with explicit human feedback is visible in approximately 40% of papers in this set.

Evidence: A Comparative Analysis of Social Network Topology in Reddit and Moltbook , From Pixels to Policies: Reinforcing Spatial Reasoning in Language Models for Content-Aware Layout Design , ADAB: Arabic Dataset for Automated Politeness Benchmarking -- A Large-Scale Resource for Computational Sociopragmatics

Metric Interpretation

agreement is a common reported metric and should be paired with protocol context before ranking methods.
1 papers (20%) mention agreement.
Most common evaluation modes: Human Eval.

Researcher Checklist

Papers with explicit human feedback: Coverage is usable but incomplete (40% vs 45% target).
Papers reporting quality controls: Coverage is usable but incomplete (20% vs 30% target).
Papers naming benchmarks/datasets: Coverage is a replication risk (0% vs 35% target).
Papers naming evaluation metrics: Coverage is strong (40% vs 35% target).
Papers with known rater population: Coverage is a replication risk (0% vs 35% target).
Papers with known annotation unit: Coverage is a replication risk (20% vs 35% target).

Papers with explicit human feedback

Coverage is usable but incomplete (40% vs 45% target).

Papers reporting quality controls

Coverage is usable but incomplete (20% vs 30% target).

Papers naming benchmarks/datasets

Coverage is a replication risk (0% vs 35% target).

Papers naming evaluation metrics

Coverage is strong (40% vs 35% target).

Papers with known rater population

Coverage is a replication risk (0% vs 35% target).

Papers with known annotation unit

Coverage is a replication risk (20% vs 35% target).

Known Limitations

Narrative synthesis is grounded in metadata and abstracts only; full-paper method details may be missing.
Extraction fields are conservative and can under-report implicit protocol details.
Daily and rolling archives can be sparse and should be cross-checked with neighboring windows.

Research Utility Links

Human Eval Protocols - Surfaces human-rating workflows for rubric and annotator quality analysis.
Metric Slice: agreement - Finds papers where reported metrics are directly comparable.
IAA-Reported Evaluations - Highlights evaluations that explicitly report inter-annotator agreement.
Recent High-Signal Papers - Keeps the hub connected to the latest HFEPX corpus updates.

human_eval vs automatic_metrics

both=0, left_only=1, right_only=1

0 papers use both Human Eval and Automatic Metrics.

simulation_env vs automatic_metrics

both=0, left_only=3, right_only=1

0 papers use both Simulation Env and Automatic Metrics.

simulation_env vs human_eval