Skip to content
← Back to explorer

Daily Archive

HFEPX Weekly Archive: 2026-W02

Updated from current HFEPX corpus (Feb 27, 2026). 10 papers are grouped in this daily page. Common evaluation modes: Automatic Metrics, Human Eval. Most common rater population: Domain Experts. Common annotation unit: Pairwise. Frequent quality control: Inter Annotator Agreement Reported. Frequently cited benchmark: Retrieval. Common metric signal: relevance. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Jan 11, 2026.

Papers: 10 Last published: Jan 11, 2026 Global RSS

Research Narrative

Grounded narrative Model: deterministic-grounded Source: persisted

Updated from current HFEPX corpus (Feb 27, 2026). This page tracks 10 papers for HFEPX Weekly Archive: 2026-W02. Dominant protocol signals include automatic metrics, human evaluation, LLM-as-judge, with frequent benchmark focus on Retrieval, DROP and metric focus on relevance, agreement. Use the grounded sections below to prioritize reproducible protocol choices, benchmark-matched comparisons, and judge-vs-human evaluation checks.

Why This Matters For Eval Research

Protocol Takeaways

Benchmark Interpretation

  • Retrieval appears in 40% of hub papers (4/10); use this cohort for benchmark-matched comparisons.
  • DROP appears in 10% of hub papers (1/10); use this cohort for benchmark-matched comparisons.

Metric Interpretation

  • relevance is reported in 20% of hub papers (2/10); compare with a secondary metric before ranking methods.
  • agreement is reported in 10% of hub papers (1/10); compare with a secondary metric before ranking methods.

Researcher Checklist

  • Close gap on Papers with explicit human feedback. Coverage is a replication risk (20% vs 45% target).
  • Close gap on Papers reporting quality controls. Coverage is a replication risk (10% vs 30% target).
  • Maintain strength on Papers naming benchmarks/datasets. Coverage is strong (50% vs 35% target).
  • Maintain strength on Papers naming evaluation metrics. Coverage is strong (50% vs 35% target).
  • Close gap on Papers with known rater population. Coverage is a replication risk (10% vs 35% target).
  • Close gap on Papers with known annotation unit. Coverage is a replication risk (10% vs 35% target).

Papers with explicit human feedback

Coverage is a replication risk (20% vs 45% target).

Papers reporting quality controls

Coverage is a replication risk (10% vs 30% target).

Papers naming benchmarks/datasets

Coverage is strong (50% vs 35% target).

Papers naming evaluation metrics

Coverage is strong (50% vs 35% target).

Papers with known rater population

Coverage is a replication risk (10% vs 35% target).

Papers with known annotation unit

Coverage is a replication risk (10% vs 35% target).

Suggested Reading Order

  1. 1. Symphonym: Universal Phonetic Embeddings for Cross-Script Name Matching

    Start here for detailed protocol reporting, including rater and quality-control evidence.

  2. 2. Mixture-of-Experts as Soft Clustering: A Dual Jacobian-PCA Spectral Geometry Perspective

    Start here for detailed protocol reporting, including rater and quality-control evidence.

  3. 3. HEART: A Unified Benchmark for Assessing Humans and LLMs in Emotional Support Dialogue

    Start here for detailed protocol reporting, including rater and quality-control evidence.

  4. 4. Neurosymbolic Retrievers for Retrieval-augmented Generation

    Adds automatic metrics for broader coverage within this hub.

  5. 5. What Matters For Safety Alignment?

    Adds automatic metrics with red-team protocols for broader coverage within this hub.

  6. 6. Stratified Hazard Sampling: Minimal-Variance Event Scheduling for CTMC/DTMC Discrete Diffusion and Flow Models

    Adds automatic metrics for broader coverage within this hub.

  7. 7. SYNAPSE: Empowering LLM Agents with Episodic-Semantic Memory via Spreading Activation

    Adds automatic metrics for broader coverage within this hub.

  8. 8. Embedding Retrofitting: Data Engineering for better RAG

    Adds automatic metrics for broader coverage within this hub.

Known Limitations

  • Only 10% of papers report quality controls; prioritize calibration/adjudication evidence.
  • Rater population is under-specified (10% coverage).
  • Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Links

human_eval vs llm_as_judge

both=1, left_only=0, right_only=0

1 papers use both Human Eval and Llm As Judge.

human_eval vs automatic_metrics

both=0, left_only=1, right_only=9

0 papers use both Human Eval and Automatic Metrics.

llm_as_judge vs automatic_metrics

both=0, left_only=1, right_only=9

0 papers use both Llm As Judge and Automatic Metrics.

Benchmark Brief

DROP

Coverage: 1 papers (10%)

1 papers (10%) mention DROP.

Examples: Stratified Hazard Sampling: Minimal-Variance Event Scheduling for CTMC/DTMC Discrete Diffusion and Flow Models

Benchmark Brief

Medieval

Coverage: 1 papers (10%)

1 papers (10%) mention Medieval.

Examples: Symphonym: Universal Phonetic Embeddings for Cross-Script Name Matching

Metric Brief

agreement

Coverage: 1 papers (10%)

1 papers (10%) mention agreement.

Examples: HEART: A Unified Benchmark for Assessing Humans and LLMs in Emotional Support Dialogue

Metric Brief

jailbreak success rate

Coverage: 1 papers (10%)

1 papers (10%) mention jailbreak success rate.

Examples: What Matters For Safety Alignment?

Papers Published On This Date

Recent Daily Archives