Skip to content
← Back to explorer

Daily Archive

HFEPX Fortnight Archive: 2026-F02

Updated from current HFEPX corpus (Feb 27, 2026). 27 papers are grouped in this daily page. Common evaluation modes: Automatic Metrics, Simulation Env. Most common rater population: Domain Experts. Common annotation unit: Multi Dim Rubric. Frequent quality control: Calibration. Frequently cited benchmark: Retrieval. Common metric signal: f1. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Jan 24, 2026.

Papers: 27 Last published: Jan 24, 2026 Global RSS

Research Narrative

Grounded narrative Model: deterministic-grounded Source: persisted

Updated from current HFEPX corpus (Feb 27, 2026). This page tracks 27 papers for HFEPX Fortnight Archive: 2026-F02. Dominant protocol signals include automatic metrics, simulation environments, human evaluation, with frequent benchmark focus on Retrieval, DocVQA and metric focus on f1, latency. Use the grounded sections below to prioritize reproducible protocol choices, benchmark-matched comparisons, and judge-vs-human evaluation checks.

Why This Matters For Eval Research

Protocol Takeaways

Benchmark Interpretation

  • Retrieval appears in 7.4% of hub papers (2/27); use this cohort for benchmark-matched comparisons.
  • DocVQA appears in 3.7% of hub papers (1/27); use this cohort for benchmark-matched comparisons.

Metric Interpretation

  • f1 is reported in 11.1% of hub papers (3/27); compare with a secondary metric before ranking methods.
  • latency is reported in 7.4% of hub papers (2/27); compare with a secondary metric before ranking methods.

Researcher Checklist

  • Tighten coverage on Papers with explicit human feedback. Coverage is usable but incomplete (29.6% vs 45% target).
  • Close gap on Papers reporting quality controls. Coverage is a replication risk (11.1% vs 30% target).
  • Maintain strength on Papers naming benchmarks/datasets. Coverage is strong (37% vs 35% target).
  • Maintain strength on Papers naming evaluation metrics. Coverage is strong (37% vs 35% target).
  • Tighten coverage on Papers with known rater population. Coverage is usable but incomplete (22.2% vs 35% target).
  • Close gap on Papers with known annotation unit. Coverage is a replication risk (14.8% vs 35% target).

Papers with explicit human feedback

Coverage is usable but incomplete (29.6% vs 45% target).

Papers reporting quality controls

Coverage is a replication risk (11.1% vs 30% target).

Papers naming benchmarks/datasets

Coverage is strong (37% vs 35% target).

Papers naming evaluation metrics

Coverage is strong (37% vs 35% target).

Papers with known rater population

Coverage is usable but incomplete (22.2% vs 35% target).

Papers with known annotation unit

Coverage is a replication risk (14.8% vs 35% target).

Suggested Reading Order

  1. 1. Cross-Cultural Expert-Level Art Critique Evaluation with Vision-Language Models

    Start here for detailed protocol reporting, including rater and quality-control evidence.

  2. 2. Decoupling Strategy and Execution in Task-Focused Dialogue via Goal-Oriented Preference Optimization

    High citation traction makes this a useful baseline for method and protocol context.

  3. 3. Building Safe and Deployable Clinical Natural Language Processing under Temporal Leakage Constraints

    High citation traction makes this a useful baseline for method and protocol context.

  4. 4. Large Language Models as Automatic Annotators and Annotation Adjudicators for Fine-Grained Opinion Analysis

    High citation traction makes this a useful baseline for method and protocol context.

  5. 5. RebuttalAgent: Strategic Persuasion in Academic Rebuttal via Theory of Mind

    Include a human-eval paper to anchor calibration against automated judge settings.

  6. 6. PhysE-Inv: A Physics-Encoded Inverse Modeling approach for Arctic Snow Depth Prediction

    Adds automatic metrics for broader coverage within this hub.

  7. 7. Between Search and Platform: ChatGPT Under the DSA

    Adds automatic metrics for broader coverage within this hub.

  8. 8. ErrorMap and ErrorAtlas: Charting the Failure Landscape of Large Language Models

    Adds automatic metrics for broader coverage within this hub.

Known Limitations

  • Only 11.1% of papers report quality controls; prioritize calibration/adjudication evidence.
  • Rater population is under-specified (22.2% coverage).
  • Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Links

human_eval vs automatic_metrics

both=0, left_only=2, right_only=21

0 papers use both Human Eval and Automatic Metrics.

automatic_metrics vs simulation_env

both=0, left_only=21, right_only=4

0 papers use both Automatic Metrics and Simulation Env.

simulation_env vs human_eval

both=0, left_only=4, right_only=2

0 papers use both Simulation Env and Human Eval.

Benchmark Brief

DocVQA

Coverage: 1 papers (3.7%)

1 papers (3.7%) mention DocVQA.

Examples: Chain-of-Thought Compression Should Not Be Blind: V-Skip for Efficient Multimodal Reasoning via Dual-Path Anchoring

Benchmark Brief

GAIA

Coverage: 1 papers (3.7%)

1 papers (3.7%) mention GAIA.

Examples: CLiMB: A Domain-Informed Novelty Detection Clustering Framework for Galactic Archaeology and Scientific Discovery

Papers Published On This Date

Recent Daily Archives