Skip to content
← Back to explorer

HFEPX Archive Slice

HFEPX Monthly Archive: 2026-03

Updated from current HFEPX corpus (Mar 8, 2026). 268 papers are grouped in this daily page.

Read Full Context

Updated from current HFEPX corpus (Mar 8, 2026). 268 papers are grouped in this daily page. Common evaluation modes: Automatic Metrics, Simulation Env. Most common rater population: Domain Experts. Common annotation unit: Trajectory. Frequent quality control: Calibration. Frequently cited benchmark: AIME. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Mar 5, 2026.

Papers: 268 Last published: Mar 5, 2026 Global RSS

Researcher Quick Triage

Use this archive page for time-slice monitoring (what changed in evaluation methods, metrics, and protocol quality this period). Quality band: High .

Analysis blocks are computed from the loaded sample (60 of 268 papers).

High-Signal Coverage

100.0%

60 / 60 papers are not low-signal flagged.

Benchmark Anchors

10.0%

Papers with benchmark/dataset mentions in extraction output.

Metric Anchors

20.0%

Papers with reported metric mentions in extraction output.

  • 1 papers report explicit quality controls for this archive period.
  • Prioritize papers with both benchmark and metric anchors for reliable longitudinal comparisons.

Primary action: Use this slice for trend comparison: review top papers first, then validate shifts in the protocol matrix.

Why This Time Slice Matters

  • 8.2% of papers report explicit human-feedback signals, led by pairwise preferences.
  • automatic metrics appears in 17.5% of papers in this hub.
  • AIME is a recurring benchmark anchor for cross-paper comparisons in this page.

Protocol Takeaways For This Period

  • Most common quality-control signal is rater calibration (0.7% of papers).
  • Rater context is mostly domain experts, and annotation is commonly trajectory-level annotation; use this to scope replication staffing.
  • Compare papers that report both human_eval and llm_as_judge to quantify judge-human agreement drift.

Start Here (Highest-Signal Papers In This Slice)

Ranked by protocol completeness and evidence density for faster period-over-period review.

Protocol Matrix (Top 10)

Quickly compare method ingredients across this archive slice.

Paper Eval Modes Benchmarks Metrics Quality Controls
ThaiSafetyBench: Assessing Language Model Safety in Thai Cultural Contexts

Mar 5, 2026

Llm As Judge, Automatic Metrics Thaisafetybench F1, F1 weighted Not reported
AILS-NTUA at SemEval-2026 Task 10: Agentic LLMs for Psycholinguistic Marker Extraction and Conspiracy Endorsement Detection

Mar 5, 2026

Automatic Metrics Semeval F1, F1 macro Not reported
Free Lunch for Pass@$k$? Low Cost Diverse Sampling for Diffusion Language Models

Mar 5, 2026

Automatic Metrics GSM8K, HumanEval+ Cost Not reported
VRM: Teaching Reward Models to Understand Authentic Human Preferences

Mar 5, 2026

Human Eval Not reported Coherence Not reported
When Weak LLMs Speak with Confidence, Preference Alignment Gets Stronger

Mar 5, 2026

Automatic Metrics Not reported Cost Not reported
LocalSUG: Geography-Aware LLM for Query Suggestion in Local-Life Services

Mar 5, 2026

Automatic Metrics Not reported Latency, Relevance Not reported
Can LLMs Capture Expert Uncertainty? A Comparative Analysis of Value Alignment in Ethnographic Qualitative Research

Mar 5, 2026

Automatic Metrics Not reported F1, Agreement Adjudication
Functionality-Oriented LLM Merging on the Fisher--Rao Manifold

Mar 5, 2026

Automatic Metrics Not reported Accuracy Not reported
Replaying pre-training data improves fine-tuning

Mar 5, 2026

Automatic Metrics Not reported Accuracy Not reported
POET-X: Memory-efficient LLM Training by Scaling Orthogonal Transformation

Mar 5, 2026

Not reported Not reported Throughput, Cost Not reported
Researcher Workflow (Detailed)

Checklist

  • Gap: Papers with explicit human feedback

    Coverage is a replication risk (8.2% vs 45% target).

  • Gap: Papers reporting quality controls

    Coverage is a replication risk (1.5% vs 30% target).

  • Gap: Papers naming benchmarks/datasets

    Coverage is a replication risk (9.3% vs 35% target).

  • Moderate: Papers naming evaluation metrics

    Coverage is usable but incomplete (28.7% vs 35% target).

  • Gap: Papers with known rater population

    Coverage is a replication risk (6.7% vs 35% target).

  • Gap: Papers with known annotation unit

    Coverage is a replication risk (9.3% vs 35% target).

Strengths

  • Contains both human-eval and LLM-as-judge protocols for head-to-head methodology comparison.

Known Gaps

  • Only 1.5% of papers report quality controls; prioritize calibration/adjudication evidence.
  • Rater population is under-specified (6.7% coverage).
  • Annotation unit is under-specified (9.3% coverage).

Suggested Next Analyses

  • Compare papers that report both human_eval and llm_as_judge to quantify judge-human agreement drift.
  • Stratify by benchmark (AIME vs GPQA) before comparing methods.
  • Track metric sensitivity by reporting both accuracy and latency.

Recommended Queries

Known Limitations
  • Only 1.5% of papers report quality controls; prioritize calibration/adjudication evidence.
  • Rater population is under-specified (6.7% coverage).
  • Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.
Research Utility Snapshot (Detailed)

Evaluation Modes

  • Automatic Metrics (47)
  • Simulation Env (7)
  • Llm As Judge (5)
  • Human Eval (3)

Top Metrics

  • Accuracy (28)
  • Latency (11)
  • Cost (9)
  • F1 (9)

Top Benchmarks

  • AIME (3)
  • GPQA (2)
  • LongBench (2)
  • MMLU (2)

Quality Controls

  • Calibration (2)
  • Adjudication (1)
  • Inter Annotator Agreement Reported (1)

Papers In This Archive Slice

Recent Archive Slices

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.