HFEPX Archive Slice

HFEPX Fortnight Archive: 2025-F04

Updated from current HFEPX corpus (Mar 8, 2026). 9 papers are grouped in this daily page.

Read Full Context

Updated from current HFEPX corpus (Mar 8, 2026). 9 papers are grouped in this daily page. Common evaluation modes: Automatic Metrics. Most common rater population: Domain Experts. Common annotation unit: Freeform. Frequent quality control: Calibration. Frequently cited benchmark: AlpacaEval 2.0. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Feb 22, 2025.

Papers: 9 Last published: Feb 22, 2025 Global RSS

Researcher Quick Triage

Use this archive page for time-slice monitoring (what changed in evaluation methods, metrics, and protocol quality this period). Quality band: Medium .

High-Signal Coverage

100.0%

9 / 9 papers are not low-signal flagged.

Benchmark Anchors

33.3%

Papers with benchmark/dataset mentions in extraction output.

Metric Anchors

33.3%

Papers with reported metric mentions in extraction output.

1 papers report explicit quality controls for this archive period.
Prioritize papers with both benchmark and metric anchors for reliable longitudinal comparisons.

Primary action: Use this slice as early signal only; benchmark/metric anchoring is limited for rigorous period-over-period claims.

Why This Time Slice Matters

33.3% of papers report explicit human-feedback signals, led by pairwise preferences.
automatic metrics appears in 33.3% of papers in this hub.
AlpacaEval 2.0 is a recurring benchmark anchor for cross-paper comparisons in this page.

Protocol Takeaways For This Period

Most common quality-control signal is rater calibration (11.1% of papers).
Rater context is mostly domain experts, and annotation is commonly Freeform; use this to scope replication staffing.
Add inter-annotator agreement checks when reproducing these protocols.

Start Here (Highest-Signal Papers In This Slice)

Ranked by protocol completeness and evidence density for faster period-over-period review.

Moving Beyond Medical Exams: A Clinician-Annotated Fairness Dataset of Real-World Tasks and Ambiguity in Mental Healthcare
Feb 22, 2025 · Citations: 0 · Score: 3.9

Eval: Automatic Metrics · Metrics: Accuracy
Hallucination, Monofacts, and Miscalibration: An Empirical Investigation
Feb 11, 2025 · Citations: 0 · Score: 3.9

Eval: Automatic Metrics · Metrics: Accuracy
Less is More: Improving LLM Alignment via Preference Data Selection
Feb 20, 2025 · Citations: 0 · Score: 2.9

Eval: Not reported · Metrics: Not reported
Glycemic-Aware and Architecture-Agnostic Training Framework for Blood Glucose Forecasting in Type 1 Diabetes
Feb 20, 2025 · Citations: 0 · Score: 2.9

Eval: Automatic Metrics · Metrics: Accuracy, F1
MathFimer: Enhancing Mathematical Reasoning by Expanding Reasoning Steps through Fill-in-the-Middle Task
Feb 17, 2025 · Citations: 0 · Score: 1.9

Eval: Not reported · Metrics: Not reported
Enhancing Multilingual LLM Pretraining with Model-Based Data Selection
Feb 14, 2025 · Citations: 0 · Score: 1.9

Eval: Not reported · Metrics: Not reported

Protocol Matrix (Top 10)

Quickly compare method ingredients across this archive slice.

Paper	Eval Modes	Benchmarks	Metrics	Quality Controls
Moving Beyond Medical Exams: A Clinician-Annotated Fairness Dataset of Real-World Tasks and Ambiguity in Mental Healthcare Feb 22, 2025	Automatic Metrics	Not reported	Accuracy	Not reported
Hallucination, Monofacts, and Miscalibration: An Empirical Investigation Feb 11, 2025	Automatic Metrics	Not reported	Accuracy	Calibration
Less is More: Improving LLM Alignment via Preference Data Selection Feb 20, 2025	Not reported	AlpacaEval 2.0	Not reported	Not reported
Glycemic-Aware and Architecture-Agnostic Training Framework for Blood Glucose Forecasting in Type 1 Diabetes Feb 20, 2025	Automatic Metrics	Not reported	Accuracy, F1	Not reported
MathFimer: Enhancing Mathematical Reasoning by Expanding Reasoning Steps through Fill-in-the-Middle Task Feb 17, 2025	Not reported	GSM8K	Not reported	Not reported
Enhancing Multilingual LLM Pretraining with Model-Based Data Selection Feb 14, 2025	Not reported	MMLU	Not reported	Not reported
SEFL: A Framework for Generating Synthetic Educational Assignment Feedback with LLM Agents Feb 18, 2025	Not reported	Not reported	Not reported	Not reported
Using the Path of Least Resistance to Explain Deep Networks Feb 17, 2025	Not reported	Not reported	Not reported	Not reported
Sparse Shift Autoencoders for Identifying Concepts from Large Language Model Activations Feb 14, 2025	Not reported	Not reported	Not reported	Not reported

Researcher Workflow (Detailed)

Checklist

Moderate: Papers with explicit human feedback

Coverage is usable but incomplete (33.3% vs 45% target).
Gap: Papers reporting quality controls

Coverage is a replication risk (11.1% vs 30% target).
Gap: Papers naming benchmarks/datasets

Coverage is a replication risk (11.1% vs 35% target).
Gap: Papers naming evaluation metrics

Coverage is a replication risk (11.1% vs 35% target).
Moderate: Papers with known rater population

Coverage is usable but incomplete (22.2% vs 35% target).
Gap: Papers with known annotation unit

Coverage is a replication risk (11.1% vs 35% target).

Strengths

This hub still surfaces a concentrated paper set for protocol triage and replication planning.

Known Gaps

Only 11.1% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (22.2% coverage).
Annotation unit is under-specified (11.1% coverage).

Suggested Next Analyses

Add inter-annotator agreement checks when reproducing these protocols.

Recommended Queries

Benchmark Slice: AlpacaEval 2.0 Metric Slice: accuracy Recent High-Signal Papers

Known Limitations

Only 11.1% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (22.2% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Snapshot (Detailed)

Evaluation Modes

Automatic Metrics (3)

Top Metrics

Accuracy (1)

Top Benchmarks

AlpacaEval 2.0 (1)

Quality Controls

Calibration (1)

Papers In This Archive Slice

Moving Beyond Medical Exams: A Clinician-Annotated Fairness Dataset of Real-World Tasks and Ambiguity in Mental Healthcare
Max Lamparth, Declan Grabb, Amy Franks, Scott Gershan, Kaitlyn N. Kunstman · Feb 22, 2025 · Citations: 0

Pairwise PreferenceExpert Verification

Current medical language model (LM) benchmarks often over-simplify the complexities of day-to-day clinical practice tasks and instead rely on evaluating LMs on multiple-choice board exam questions.
Less is More: Improving LLM Alignment via Preference Data Selection
Xun Deng, Han Zhong, Rui Ai, Fuli Feng, Zheng Wang · Feb 20, 2025 · Citations: 0

Pairwise Preference

Direct Preference Optimization (DPO) has emerged as a promising approach for aligning large language models with human preferences.
Glycemic-Aware and Architecture-Agnostic Training Framework for Blood Glucose Forecasting in Type 1 Diabetes
Saman Khamesian, Asiful Arefeen, Maria Adela Grando, Bithika M. Thompson, Hassan Ghasemzadeh · Feb 20, 2025 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
SEFL: A Framework for Generating Synthetic Educational Assignment Feedback with LLM Agents
Mike Zhang, Amalie Pernille Dilling, Léon Gondelman, Niels Erik Ruan Lyngdorf, Euan D. Lindsay · Feb 18, 2025 · Citations: 0

Critique Edit

Through comprehensive evaluations with three LLM judges and three human experts, across a subset of 900 outputs, we demonstrate that SEFL-tuned models outperform both their untuned counterparts and an existing baseline in terms of feedback…
Using the Path of Least Resistance to Explain Deep Networks
Sina Salek, Joseph Enguehard · Feb 17, 2025 · Citations: 0

Through experiments on both synthetic and real-world image classification data, we provide empirical evidence supporting our theoretical analysis and showing that GIG produces more faithful attributions than existing methods, including IG,…
MathFimer: Enhancing Mathematical Reasoning by Expanding Reasoning Steps through Fill-in-the-Middle Task
Yuchen Yan, Yongliang Shen, Yang Liu, Jin Jiang, Xin Xu · Feb 17, 2025 · Citations: 0

Through comprehensive experiments on multiple mathematical reasoning datasets, including MathInstruct, MetaMathQA and etc., we demonstrate that models trained on MathFimer-expanded data consistently outperform their counterparts trained on…
Enhancing Multilingual LLM Pretraining with Model-Based Data Selection
Bettina Messmer, Vinko Sabolčec, Martin Jaggi · Feb 14, 2025 · Citations: 0

Training a 1B-parameter Llama model for 70B and 119B tokens, our approach can match the baseline MMLU score with as little as 15% of the training tokens, while also improving across other benchmarks and mitigating the curse of…
Sparse Shift Autoencoders for Identifying Concepts from Large Language Model Activations
Shruti Joshi, Andrea Dittadi, Sébastien Lachapelle, Dhanya Sridhar · Feb 14, 2025 · Citations: 0
Hallucination, Monofacts, and Miscalibration: An Empirical Investigation
Miranda Muqing Miao, Michael Kearns · Feb 11, 2025 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.

Post a Job Get a Quote