Skip to content
← Back to explorer

HFEPX Archive Slice

HFEPX Daily Archive: 2026-02-23

Updated from current HFEPX corpus (Apr 12, 2026). 61 papers are grouped in this daily page.

Read Full Context

Updated from current HFEPX corpus (Apr 12, 2026). 61 papers are grouped in this daily page. Common evaluation modes: Automatic Metrics, Simulation Env. Most common rater population: Domain Experts. Common annotation unit: Ranking. Frequent quality control: Calibration. Frequently cited benchmark: ContentBench. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Feb 23, 2026.

Papers: 61 Last published: Feb 23, 2026 Global RSS

Researcher Quick Triage

Use this archive page for time-slice monitoring (what changed in evaluation methods, metrics, and protocol quality this period). Quality band: High .

Analysis blocks are computed from the loaded sample (60 of 61 papers).

High-Signal Coverage

100.0%

60 / 60 papers are not low-signal flagged.

Benchmark Anchors

15.0%

Papers with benchmark/dataset mentions in extraction output.

Metric Anchors

41.7%

Papers with reported metric mentions in extraction output.

  • 3 papers report explicit quality controls for this archive period.
  • Prioritize papers with both benchmark and metric anchors for reliable longitudinal comparisons.

Primary action: Use this slice for trend comparison: review top papers first, then validate shifts in the protocol matrix.

Get this digest every Friday →

Subscribe

Why This Time Slice Matters

  • 9.8% of papers report explicit human-feedback signals, led by expert verification.
  • automatic metrics appears in 37.7% of papers in this hub.
  • ContentBench is a recurring benchmark anchor for cross-paper comparisons in this page.

Protocol Takeaways For This Period

  • 1 sampled papers report both human evaluation and LLM-as-judge, supporting direct agreement checks.
  • Most common quality-control signal is rater calibration (3.3% of papers).
  • Rater context is mostly domain experts, and annotation is commonly ranking annotation; use this to scope replication staffing.

Start Here (Highest-Signal Papers In This Slice)

Ranked by protocol completeness and evidence density for faster period-over-period review.

Protocol Matrix (Top 10)

Quickly compare method ingredients across this archive slice.

Paper Eval Modes Benchmarks Metrics Quality Controls
KNIGHT: Knowledge Graph-Driven Multiple-Choice Question Generation with Adaptive Hardness Calibration

Feb 23, 2026

Automatic Metrics MMLU Cost, Relevance Calibration
Pyramid MoA: A Probabilistic Framework for Cost-Optimized Anytime Inference

Feb 23, 2026

Automatic Metrics GSM8K Accuracy, Precision Calibration
Can Large Language Models Replace Human Coders? Introducing ContentBench

Feb 23, 2026

Automatic Metrics ContentBench Agreement, Cost Not reported
An artificial intelligence framework for end-to-end rare disease phenotyping from clinical notes using large language models

Feb 23, 2026

Automatic Metrics Not reported F1, Precision Gold Questions
NanoKnow: How to Know What Your Language Model Knows

Feb 23, 2026

Automatic Metrics NQ, SQuAD Accuracy Not reported
KGHaluBench: A Knowledge Graph-Based Hallucination Benchmark for Evaluating the Breadth and Depth of LLM Knowledge

Feb 23, 2026

Automatic Metrics Kghalubench Accuracy, Coherence Not reported
MedCLIPSeg: Probabilistic Vision-Language Adaptation for Data-Efficient and Generalizable Medical Image Segmentation

Feb 23, 2026

Automatic Metrics Not reported Accuracy Not reported
Case-Aware LLM-as-a-Judge Evaluation for Enterprise-Scale RAG Systems

Feb 23, 2026

Llm As Judge Not reported Precision Not reported
Natural Language Processing Models for Robust Document Categorization

Feb 23, 2026

Automatic Metrics Not reported Accuracy, Throughput Not reported
No One Size Fits All: QueryBandits for Hallucination Mitigation

Feb 23, 2026

Automatic Metrics Not reported Win rate Not reported
Researcher Workflow (Detailed)

Checklist

  • Gap: Papers with explicit human feedback

    Coverage is a replication risk (9.8% vs 45% target).

  • Gap: Papers reporting quality controls

    Coverage is a replication risk (6.6% vs 30% target).

  • Gap: Papers naming benchmarks/datasets

    Coverage is a replication risk (1.6% vs 35% target).

  • Gap: Papers naming evaluation metrics

    Coverage is a replication risk (11.5% vs 35% target).

  • Gap: Papers with known rater population

    Coverage is a replication risk (8.2% vs 35% target).

  • Gap: Papers with known annotation unit

    Coverage is a replication risk (11.5% vs 35% target).

Strengths

  • Contains both human-eval and LLM-as-judge protocols for head-to-head methodology comparison.

Known Gaps

  • Only 6.6% of papers report quality controls; prioritize calibration/adjudication evidence.
  • Rater population is under-specified (8.2% coverage).
  • Annotation unit is under-specified (11.5% coverage).

Suggested Next Analyses

  • Compare papers that report both human_eval and llm_as_judge to quantify judge-human agreement drift.
  • Track metric sensitivity by reporting both accuracy and cost.

Recommended Queries

Known Limitations
  • Only 6.6% of papers report quality controls; prioritize calibration/adjudication evidence.
  • Rater population is under-specified (8.2% coverage).
  • Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.
Research Utility Snapshot (Detailed)

Evaluation Modes

  • Automatic Metrics (23)
  • Simulation Env (4)
  • Llm As Judge (2)
  • Human Eval (1)

Top Metrics

  • Accuracy (4)
  • Cost (2)
  • F1 (2)
  • Agreement (1)

Top Benchmarks

  • ContentBench (1)

Quality Controls

  • Calibration (2)
  • Gold Questions (1)
  • Inter Annotator Agreement Reported (1)

Papers In This Archive Slice

Recent Archive Slices

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.