HFEPX Archive Slice

HFEPX Daily Archive: 2026-03-01

Updated from current HFEPX corpus (Mar 10, 2026). 13 papers are grouped in this daily page.

Read Full Context

Updated from current HFEPX corpus (Mar 10, 2026). 13 papers are grouped in this daily page. Common evaluation modes: Automatic Metrics. Most common rater population: Domain Experts. Frequent quality control: Calibration. Frequently cited benchmark: AIME. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Mar 1, 2026.

Papers: 13 Last published: Mar 1, 2026 Global RSS

Researcher Quick Triage

Use this archive page for time-slice monitoring (what changed in evaluation methods, metrics, and protocol quality this period). Quality band: Medium .

High-Signal Coverage

100.0%

13 / 13 papers are not low-signal flagged.

Benchmark Anchors

7.7%

Papers with benchmark/dataset mentions in extraction output.

Metric Anchors

23.1%

Papers with reported metric mentions in extraction output.

1 papers report explicit quality controls for this archive period.
Prioritize papers with both benchmark and metric anchors for reliable longitudinal comparisons.

Primary action: Use this slice as early signal only; benchmark/metric anchoring is limited for rigorous period-over-period claims.

Get this digest every Friday →

Why This Time Slice Matters

automatic metrics appears in 23.1% of papers in this hub.
AIME is a recurring benchmark anchor for cross-paper comparisons in this page.
accuracy is a repeated reporting metric here, enabling more consistent cross-paper score interpretation.

Protocol Takeaways For This Period

Most common quality-control signal is rater calibration (7.7% of papers).
Rater context is mostly domain experts, and annotation is commonly mixed annotation units; use this to scope replication staffing.
Stratify by benchmark (AIME vs BioProBench) before comparing methods.

Start Here (Highest-Signal Papers In This Slice)

Ranked by protocol completeness and evidence density for faster period-over-period review.

KVSlimmer: Theoretical Insights and Practical Optimizations for Asymmetric KV Merging
Mar 1, 2026 · Citations: 0 · Score: 6.5

Eval: Automatic Metrics · Metrics: Latency
Conformal Prediction for Risk-Controlled Medical Entity Extraction Across Clinical Domains
Mar 1, 2026 · Citations: 0 · Score: 6.0

Eval: Automatic Metrics · Metrics: Accuracy, F1
Learn Hard Problems During RL with Reference Guided Fine-tuning
Mar 1, 2026 · Citations: 0 · Score: 5.0

Eval: Automatic Metrics · Metrics: Accuracy
VoxKnesset: A Large-Scale Longitudinal Hebrew Speech Dataset for Aging Speaker Modeling
Mar 1, 2026 · Citations: 0 · Score: 2.5

Eval: Not reported · Metrics: Not reported
A Study on Building Efficient Zero-Shot Relation Extraction Models
Mar 1, 2026 · Citations: 0 · Score: 0.0

Eval: Not reported · Metrics: Not reported
Prompt Sensitivity and Answer Consistency of Small Open-Source Large Language Models on Clinical Question Answering: Implications for Low-Resource Healthcare Deployment
Mar 1, 2026 · Citations: 0 · Score: 0.0

Eval: Not reported · Metrics: Not reported

Protocol Matrix (Top 10)

Quickly compare method ingredients across this archive slice.

Paper	Eval Modes	Benchmarks	Metrics	Quality Controls
KVSlimmer: Theoretical Insights and Practical Optimizations for Asymmetric KV Merging Mar 1, 2026	Automatic Metrics	LongBench	Latency	Not reported
Conformal Prediction for Risk-Controlled Medical Entity Extraction Across Clinical Domains Mar 1, 2026	Automatic Metrics	Not reported	Accuracy, F1	Calibration
Learn Hard Problems During RL with Reference Guided Fine-tuning Mar 1, 2026	Automatic Metrics	Not reported	Accuracy	Not reported
VoxKnesset: A Large-Scale Longitudinal Hebrew Speech Dataset for Aging Speaker Modeling Mar 1, 2026	Not reported	Not reported	Not reported	Not reported
A Study on Building Efficient Zero-Shot Relation Extraction Models Mar 1, 2026	Not reported	Not reported	Not reported	Not reported
Prompt Sensitivity and Answer Consistency of Small Open-Source Large Language Models on Clinical Question Answering: Implications for Low-Resource Healthcare Deployment Mar 1, 2026	Not reported	Not reported	Not reported	Not reported
Curvature-Weighted Capacity Allocation: A Minimum Description Length Framework for Layer-Adaptive Large Language Model Optimization Mar 1, 2026	Not reported	Not reported	Not reported	Not reported
CHIMERA: Compact Synthetic Data for Generalizable LLM Reasoning Mar 1, 2026	Not reported	Not reported	Not reported	Not reported
Knowledge without Wisdom: Measuring Misalignment between LLMs and Intended Impact Mar 1, 2026	Not reported	Not reported	Not reported	Not reported
BioProAgent: Neuro-Symbolic Grounding for Constrained Scientific Planning Mar 1, 2026	Not reported	Not reported	Not reported	Not reported

Researcher Workflow (Detailed)

Checklist

Gap: Papers with explicit human feedback

Coverage is a replication risk (0% vs 45% target).
Gap: Papers reporting quality controls

Coverage is a replication risk (7.7% vs 30% target).
Moderate: Papers naming benchmarks/datasets

Coverage is usable but incomplete (23.1% vs 35% target).
Moderate: Papers naming evaluation metrics

Coverage is usable but incomplete (23.1% vs 35% target).
Gap: Papers with known rater population

Coverage is a replication risk (7.7% vs 35% target).
Gap: Papers with known annotation unit

Coverage is a replication risk (0% vs 35% target).

Strengths

This hub still surfaces a concentrated paper set for protocol triage and replication planning.

Known Gaps

Only 7.7% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (7.7% coverage).
Annotation unit is under-specified (0% coverage).

Suggested Next Analyses

Stratify by benchmark (AIME vs BioProBench) before comparing methods.
Add inter-annotator agreement checks when reproducing these protocols.

Recommended Queries

Benchmark Slice: AIME Metric Slice: accuracy Recent High-Signal Papers

Known Limitations

Only 7.7% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (7.7% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Snapshot (Detailed)

Evaluation Modes

Automatic Metrics (3)

Top Metrics

Accuracy (3)

Top Benchmarks

AIME (1)
BioProBench (1)
GPQA (1)
HLE (1)

Quality Controls

Calibration (1)

Papers In This Archive Slice

VoxKnesset: A Large-Scale Longitudinal Hebrew Speech Dataset for Aging Speaker Modeling
Yanir Marmor, Arad Zulti, David Krongauz, Adam Gabet, Yoad Snapir · Mar 1, 2026 · Citations: 0

Speech processing systems face a fundamental challenge: the human voice changes with age, yet few datasets support rigorous longitudinal evaluation.
A Study on Building Efficient Zero-Shot Relation Extraction Models
Hugo Thomas, Caio Corro, Guillaume Gravier, Pascale Sébillot · Mar 1, 2026 · Citations: 0
Learn Hard Problems During RL with Reference Guided Fine-tuning
Yangzhen Wu, Shanda Li, Zixin Wen, Xin Zhou, Ameet Talwalkar · Mar 1, 2026 · Citations: 0

We introduce Reference-Guided Fine-Tuning (ReGFT), a simple and effective method that utilizes human-written reference solutions to synthesize positive trajectories on hard problems and train on them before RL.
Conformal Prediction for Risk-Controlled Medical Entity Extraction Across Clinical Domains
Manil Shrestha, Edward Kim · Mar 1, 2026 · Citations: 0

First, we extract structured entities from 1,000 FDA drug labels across eight sections using GPT-4.1, verified via FactScore-based atomic statement evaluation (97.7\% accuracy over 128,906 entities).
Prompt Sensitivity and Answer Consistency of Small Open-Source Large Language Models on Clinical Question Answering: Implications for Low-Resource Healthcare Deployment
Shravani Hariprasad · Mar 1, 2026 · Citations: 0
Curvature-Weighted Capacity Allocation: A Minimum Description Length Framework for Layer-Adaptive Large Language Model Optimization
Theophilus Amaefuna, Hitesh Vaidya, Anshuman Chhabra, Ankur Mali · Mar 1, 2026 · Citations: 0
KVSlimmer: Theoretical Insights and Practical Optimizations for Asymmetric KV Merging
Lianjun Liu, Hongli An, Weiqi Yan, Xin Du, Shengchuan Zhang · Mar 1, 2026 · Citations: 0

Extensive experiments across various models and benchmarks demonstrate that KVSlimmer consistently outperforms SOTA methods.
CHIMERA: Compact Synthetic Data for Generalizable LLM Reasoning
Xinyu Zhu, Yihao Feng, Yanchao Sun, Xianzhi Du, Pingzhi Li · Mar 1, 2026 · Citations: 0
Knowledge without Wisdom: Measuring Misalignment between LLMs and Intended Impact
Michael Hardy, Yunsung Kim · Mar 1, 2026 · Citations: 0
BioProAgent: Neuro-Symbolic Grounding for Constrained Scientific Planning
Yuyang Liu, Jingya Wang, Liuzhenghao Lv, Yonghong Tian · Mar 1, 2026 · Citations: 0
MC-Search: Evaluating and Enhancing Multimodal Agentic Search with Structured Long Reasoning Chains
Xuying Ning, Dongqi Fu, Tianxin Wei, Mengting Ai, Jiaru Zou · Mar 1, 2026 · Citations: 0
PPC-MT: Parallel Point Cloud Completion with Mamba-Transformer Hybrid Architecture
Jie Li, Shengwei Tian, Long Yu, Xin Ning · Mar 1, 2026 · Citations: 0
MedGPT-oss: Training a General-Purpose Vision-Language Model for Biomedicine
Kai Zhang, Zhengqing Yuan, Cheng Peng, Songlin Zhao, Mengxian Lyu · Mar 1, 2026 · Citations: 0

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.

Post a Job Get a Quote