HFEPX Archive Slice

HFEPX Weekly Archive: 2025-W26

Updated from current HFEPX corpus (Mar 8, 2026). 10 papers are grouped in this daily page.

Read Full Context

Updated from current HFEPX corpus (Mar 8, 2026). 10 papers are grouped in this daily page. Common evaluation modes: Automatic Metrics. Most common rater population: Domain Experts. Common annotation unit: Ranking. Frequent quality control: Adjudication. Frequently cited benchmark: LMSYS Chatbot Arena. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Jun 26, 2025.

Papers: 10 Last published: Jun 26, 2025 Global RSS

Researcher Quick Triage

Use this archive page for time-slice monitoring (what changed in evaluation methods, metrics, and protocol quality this period). Quality band: Medium .

High-Signal Coverage

100.0%

10 / 10 papers are not low-signal flagged.

Benchmark Anchors

0.0%

Papers with benchmark/dataset mentions in extraction output.

Metric Anchors

30.0%

Papers with reported metric mentions in extraction output.

1 papers report explicit quality controls for this archive period.
Prioritize papers with both benchmark and metric anchors for reliable longitudinal comparisons.

Primary action: Use this slice as early signal only; benchmark/metric anchoring is limited for rigorous period-over-period claims.

Why This Time Slice Matters

20% of papers report explicit human-feedback signals, led by demonstration data.
automatic metrics appears in 30% of papers in this hub.
LMSYS Chatbot Arena is a recurring benchmark anchor for cross-paper comparisons in this page.

Protocol Takeaways For This Period

Most common quality-control signal is adjudication (10% of papers).
Rater context is mostly domain experts, and annotation is commonly ranking annotation; use this to scope replication staffing.
Stratify by benchmark (LMSYS Chatbot Arena vs WritingBench) before comparing methods.

Start Here (Highest-Signal Papers In This Slice)

Ranked by protocol completeness and evidence density for faster period-over-period review.

An Agentic System for Rare Disease Diagnosis with Traceable Reasoning
Jun 25, 2025 · Citations: 0 · Score: 5.5

Eval: Automatic Metrics · Metrics: Recall, Agreement
Complexity-aware fine-tuning
Jun 26, 2025 · Citations: 0 · Score: 3.5

Eval: Automatic Metrics · Metrics: Accuracy, Cost
TTSDS2: Resources and Benchmark for Evaluating Human-Quality Text to Speech Systems
Jun 24, 2025 · Citations: 0 · Score: 3.5

Eval: Automatic Metrics · Metrics: Spearman
Programming by Backprop: An Instruction is Worth 100 Examples When Finetuning LLMs
Jun 23, 2025 · Citations: 0 · Score: 2.0

Eval: Not reported · Metrics: Not reported
$π$-CoT: Prolog-Initialized Chain-of-Thought Prompting for Multi-Hop Question-Answering
Jun 25, 2025 · Citations: 0 · Score: 1.0

Eval: Not reported · Metrics: Not reported
Parallel Continuous Chain-of-Thought with Jacobi Iteration
Jun 23, 2025 · Citations: 0 · Score: 1.0

Eval: Not reported · Metrics: Not reported

Protocol Matrix (Top 10)

Quickly compare method ingredients across this archive slice.

Paper	Eval Modes	Benchmarks	Metrics	Quality Controls
An Agentic System for Rare Disease Diagnosis with Traceable Reasoning Jun 25, 2025	Automatic Metrics	Not reported	Recall, Agreement	Adjudication
Complexity-aware fine-tuning Jun 26, 2025	Automatic Metrics	Not reported	Accuracy, Cost	Not reported
TTSDS2: Resources and Benchmark for Evaluating Human-Quality Text to Speech Systems Jun 24, 2025	Automatic Metrics	Not reported	Spearman	Not reported
Programming by Backprop: An Instruction is Worth 100 Examples When Finetuning LLMs Jun 23, 2025	Not reported	Not reported	Not reported	Not reported
$π$-CoT: Prolog-Initialized Chain-of-Thought Prompting for Multi-Hop Question-Answering Jun 25, 2025	Not reported	Not reported	Not reported	Not reported
Parallel Continuous Chain-of-Thought with Jacobi Iteration Jun 23, 2025	Not reported	Not reported	Not reported	Not reported
Cognitive models can reveal interpretable value trade-offs in language models Jun 25, 2025	Not reported	Not reported	Not reported	Not reported
Breaking Barriers: Do Reinforcement Post Training Gains Transfer To Unseen Domains? Jun 24, 2025	Not reported	Not reported	Not reported	Not reported
LongWriter-Zero: Mastering Ultra-Long Text Generation via Reinforcement Learning Jun 23, 2025	Not reported	Not reported	Not reported	Not reported
Context Biasing for Pronunciation-Orthography Mismatch in Automatic Speech Recognition Jun 23, 2025	Not reported	Not reported	Not reported	Not reported

Researcher Workflow (Detailed)

Checklist

Gap: Papers with explicit human feedback

Coverage is a replication risk (20% vs 45% target).
Gap: Papers reporting quality controls

Coverage is a replication risk (10% vs 30% target).
Gap: Papers naming benchmarks/datasets

Coverage is a replication risk (10% vs 35% target).
Moderate: Papers naming evaluation metrics

Coverage is usable but incomplete (30% vs 35% target).
Gap: Papers with known rater population

Coverage is a replication risk (10% vs 35% target).
Gap: Papers with known annotation unit

Coverage is a replication risk (10% vs 35% target).

Strengths

This hub still surfaces a concentrated paper set for protocol triage and replication planning.

Known Gaps

Only 10% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (10% coverage).
Annotation unit is under-specified (10% coverage).

Suggested Next Analyses

Stratify by benchmark (LMSYS Chatbot Arena vs WritingBench) before comparing methods.
Track metric sensitivity by reporting both accuracy and agreement.
Add inter-annotator agreement checks when reproducing these protocols.

Recommended Queries

Benchmark Slice: LMSYS Chatbot Arena Metric Slice: accuracy Recent High-Signal Papers

Known Limitations

Only 10% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (10% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Snapshot (Detailed)

Evaluation Modes

Automatic Metrics (3)

Top Metrics

Accuracy (1)
Agreement (1)
Coherence (1)
Error rate (1)

Top Benchmarks

LMSYS Chatbot Arena (1)
WritingBench (1)

Quality Controls

Adjudication (1)

Papers In This Archive Slice

Complexity-aware fine-tuning
Andrey Goncharov, Daniil Vyazhev, Petr Sychev, Edvard Khalafyan, Alexey Zaytsev · Jun 26, 2025 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Cognitive models can reveal interpretable value trade-offs in language models
Sonia K. Murthy, Rosie Zhao, Jennifer Hu, Sham Kakade, Markus Wulfmeier · Jun 25, 2025 · Citations: 0
$π$-CoT: Prolog-Initialized Chain-of-Thought Prompting for Multi-Hop Question-Answering
Chao Wan, Albert Gong, Mihir Mishra, Carl-Leander Henneking, Claas Beger · Jun 25, 2025 · Citations: 0

Extensive experiments demonstrate that π-CoT significantly outperforms standard RAG and in-context CoT on multi-hop question-answering benchmarks.
An Agentic System for Rare Disease Diagnosis with Traceable Reasoning
Weike Zhao, Chaoyi Wu, Yanjie Fan, Xiaoman Zhang, Pengcheng Qiu · Jun 25, 2025 · Citations: 0

Expert Verification Multi Agent

Here we present DeepRare, a multi-agent system for rare disease differential diagnosis decision support powered by large language models, integrating over 40 specialized tools and up-to-date knowledge sources.
Breaking Barriers: Do Reinforcement Post Training Gains Transfer To Unseen Domains?
Chuxuan Hu, Yuxuan Zhu, Antony Kellermann, Caleb Biddulph, Suppakit Waiwitlikhit · Jun 24, 2025 · Citations: 0
TTSDS2: Resources and Benchmark for Evaluating Human-Quality Text to Speech Systems
Christoph Minixhofer, Ondrej Klejch, Peter Bell · Jun 24, 2025 · Citations: 0

Evaluation of Text to Speech (TTS) systems is challenging and resource-intensive.
LongWriter-Zero: Mastering Ultra-Long Text Generation via Reinforcement Learning
Yuhao Wu, Yushi Bai, Zhiqiang Hu, Roy Ka-Wei Lee, Juanzi Li · Jun 23, 2025 · Citations: 0
Programming by Backprop: An Instruction is Worth 100 Examples When Finetuning LLMs
Jonathan Cook, Silvia Sapora, Arash Ahmadian, Akbir Khan, Tim Rocktaschel · Jun 23, 2025 · Citations: 0

Demonstrations

Though execution of instructions in training data remains less reliable than when instructions are given in-context, our results demonstrate that procedural knowledge can be noisily `programmed' into LLMs through PBB, with important…
Context Biasing for Pronunciation-Orthography Mismatch in Automatic Speech Recognition
Christian Huber, Alexander Waibel · Jun 23, 2025 · Citations: 0
Parallel Continuous Chain-of-Thought with Jacobi Iteration
Haoyi Wu, Zhihao Teng, Kewei Tu · Jun 23, 2025 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.

Post a Job Get a Quote