Skip to content
← Back to explorer

HFEPX Archive Slice

HFEPX Daily Archive: 2025-05-19

Updated from current HFEPX corpus (Apr 5, 2026). 7 papers are grouped in this daily page.

Read Full Context

Updated from current HFEPX corpus (Apr 5, 2026). 7 papers are grouped in this daily page. Common evaluation modes: Automatic Metrics, Llm As Judge. Most common rater population: Domain Experts. Common annotation unit: Multi Dim Rubric. Frequent quality control: Calibration. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from May 19, 2025.

Papers: 7 Last published: May 19, 2025 Global RSS

Researcher Quick Triage

Use this archive page for time-slice monitoring (what changed in evaluation methods, metrics, and protocol quality this period). Quality band: Developing .

High-Signal Coverage

100.0%

7 / 7 papers are not low-signal flagged.

Benchmark Anchors

0.0%

Papers with benchmark/dataset mentions in extraction output.

Metric Anchors

42.9%

Papers with reported metric mentions in extraction output.

  • 1 papers report explicit quality controls for this archive period.
  • Prioritize papers with both benchmark and metric anchors for reliable longitudinal comparisons.

Primary action: Use this slice as early signal only; benchmark/metric anchoring is limited for rigorous period-over-period claims.

Get this digest every Friday →

Subscribe

Why This Time Slice Matters

  • 14.3% of papers report explicit human-feedback signals, led by rubric ratings.
  • automatic metrics appears in 42.9% of papers in this hub.
  • long-horizon tasks appears in 14.3% of papers, indicating agentic evaluation demand.

Protocol Takeaways For This Period

  • Most common quality-control signal is rater calibration (14.3% of papers).
  • Rater context is mostly domain experts, and annotation is commonly multi-dimensional rubrics; use this to scope replication staffing.
  • Pair this hub with a human_eval-heavy hub to validate judge-model calibration.

Start Here (Highest-Signal Papers In This Slice)

Ranked by protocol completeness and evidence density for faster period-over-period review.

Protocol Matrix (Top 10)

Quickly compare method ingredients across this archive slice.

Paper Eval Modes Benchmarks Metrics Quality Controls
What if Deception Cannot be Detected? A Cross-Linguistic Study on the Limits of Deception Detection from Text

May 19, 2025

Automatic Metrics Not reported Accuracy Not reported
FreeKV: Boosting KV Cache Retrieval for Efficient LLM Inference

May 19, 2025

Automatic Metrics Not reported Accuracy, Recall Not reported
LEXam: Benchmarking Legal Reasoning on 340 Law Exams

May 19, 2025

Llm As Judge, Automatic Metrics Not reported Accuracy, Recall Not reported
Ice Cream Doesn't Cause Drowning: Benchmarking LLMs Against Statistical Pitfalls in Causal Inference

May 19, 2025

Not reported Not reported Not reported Not reported
Advancing Software Quality: A Standards-Focused Review of LLM-Based Assurance Techniques

May 19, 2025

Not reported Not reported Not reported Calibration
A Reality Check of Language Models as Formalizers on Constraint Satisfaction Problems

May 19, 2025

Not reported Not reported Not reported Not reported
Complexity counts: global and local perspectives on Indo-Aryan numeral systems

May 19, 2025

Not reported Not reported Not reported Not reported
Researcher Workflow (Detailed)

Checklist

  • Gap: Papers with explicit human feedback

    Coverage is a replication risk (14.3% vs 45% target).

  • Gap: Papers reporting quality controls

    Coverage is a replication risk (14.3% vs 30% target).

  • Gap: Papers naming benchmarks/datasets

    Coverage is a replication risk (0% vs 35% target).

  • Gap: Papers naming evaluation metrics

    Coverage is a replication risk (14.3% vs 35% target).

  • Moderate: Papers with known rater population

    Coverage is usable but incomplete (28.6% vs 35% target).

  • Gap: Papers with known annotation unit

    Coverage is a replication risk (14.3% vs 35% target).

Strengths

  • This hub still surfaces a concentrated paper set for protocol triage and replication planning.

Known Gaps

  • Only 14.3% of papers report quality controls; prioritize calibration/adjudication evidence.
  • Annotation unit is under-specified (14.3% coverage).
  • Benchmark coverage is thin (0% of papers mention benchmarks/datasets).

Suggested Next Analyses

  • Pair this hub with a human_eval-heavy hub to validate judge-model calibration.
  • Track metric sensitivity by reporting both accuracy and recall.
  • Add inter-annotator agreement checks when reproducing these protocols.

Recommended Queries

Known Limitations
  • Only 14.3% of papers report quality controls; prioritize calibration/adjudication evidence.
  • Annotation unit is under-specified (14.3% coverage).
  • Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.
Research Utility Snapshot (Detailed)

Evaluation Modes

  • Automatic Metrics (3)
  • Llm As Judge (1)

Top Metrics

  • Accuracy (1)
  • Recall (1)

Top Benchmarks

Quality Controls

  • Calibration (1)

Papers In This Archive Slice

Recent Archive Slices

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.