Skip to content
← Back to explorer

Daily Archive

HFEPX Daily Archive: 2026-02-15

Updated from current HFEPX corpus (Feb 27, 2026). 6 papers are grouped in this daily page. Common evaluation modes: Automatic Metrics, Human Eval. Most common rater population: Domain Experts. Common annotation unit: Pairwise. Frequent quality control: Adjudication. Frequently cited benchmark: MMBench. Common metric signal: accuracy. Newest paper in this set is from Feb 15, 2026.

Papers: 6 Last published: Feb 15, 2026 Global RSS

Research Narrative

Grounded narrative Model: deterministic-grounded

Updated from current HFEPX corpus (Feb 27, 2026). This page covers 6 papers centered on HFEPX Daily Archive: 2026-02-15. Common evaluation modes include Automatic Metrics, Human Eval, with benchmark emphasis on MMBench, Retrieval. Use the anchored takeaways below to compare protocol choices and identify papers with stronger evidence depth.

Why This Matters For Eval Research

Protocol Takeaways

Benchmark Interpretation

  • MMBench appears as a recurring benchmark anchor in this page.
  • 1 papers (16.7%) mention MMBench.
  • Most common evaluation modes: Human Eval.

Metric Interpretation

  • accuracy is a common reported metric and should be paired with protocol context before ranking methods.
  • 2 papers (33.3%) mention accuracy.
  • Most common evaluation modes: Automatic Metrics.

Researcher Checklist

  • Papers with explicit human feedback: Coverage is usable but incomplete (33.3% vs 45% target).
  • Papers reporting quality controls: Coverage is strong (33.3% vs 30% target).
  • Papers naming benchmarks/datasets: Coverage is usable but incomplete (33.3% vs 35% target).
  • Papers naming evaluation metrics: Coverage is strong (50% vs 35% target).
  • Papers with known rater population: Coverage is a replication risk (16.7% vs 35% target).
  • Papers with known annotation unit: Coverage is a replication risk (16.7% vs 35% target).

Papers with explicit human feedback

Coverage is usable but incomplete (33.3% vs 45% target).

Papers reporting quality controls

Coverage is strong (33.3% vs 30% target).

Papers naming benchmarks/datasets

Coverage is usable but incomplete (33.3% vs 35% target).

Papers naming evaluation metrics

Coverage is strong (50% vs 35% target).

Papers with known rater population

Coverage is a replication risk (16.7% vs 35% target).

Papers with known annotation unit

Coverage is a replication risk (16.7% vs 35% target).

Suggested Reading Order

  1. 1. Does Socialization Emerge in AI Agent Society? A Case Study of Moltbook

    Start with this anchor paper for scope and protocol framing. Covers Simulation Env.

  2. 2. MCPShield: A Security Cognition Layer for Adaptive Trust Calibration in Model Context Protocol Agents

    Covers Automatic Metrics.

  3. 3. Investigation for Relative Voice Impression Estimation

    Covers Automatic Metrics. Includes human-feedback signal: Pairwise Preference.

  4. 4. Annotation-Efficient Vision-Language Model Adaptation to the Polish Language Using the LLaVA Framework

    Covers Human Eval.

  5. 5. Context Shapes LLMs Retrieval-Augmented Fact-Checking Effectiveness

    Covers Automatic Metrics.

  6. 6. HLE-Verified: A Systematic Verification and Structured Revision of Humanity's Last Exam

    Covers Automatic Metrics. Includes human-feedback signal: Expert Verification, Critique Edit.

Known Limitations

  • Narrative synthesis is grounded in metadata and abstracts only; full-paper method details may be missing.
  • Extraction fields are conservative and can under-report implicit protocol details.
  • Daily and rolling archives can be sparse and should be cross-checked with neighboring windows.

Research Utility Links

human_eval vs automatic_metrics

both=0, left_only=1, right_only=4

0 papers use both Human Eval and Automatic Metrics.

automatic_metrics vs simulation_env

both=0, left_only=4, right_only=1

0 papers use both Automatic Metrics and Simulation Env.

human_eval vs simulation_env

both=0, left_only=1, right_only=1

0 papers use both Human Eval and Simulation Env.

Papers Published On This Date

Recent Daily Archives