HFEPX Archive Slice

HFEPX Fortnight Archive: 2025-F13

Updated from current HFEPX corpus (Mar 1, 2026). 10 papers are grouped in this daily page.

Read Full Context

Updated from current HFEPX corpus (Mar 1, 2026). 10 papers are grouped in this daily page. Common evaluation modes: Automatic Metrics, Llm As Judge. Most common rater population: Domain Experts. Common annotation unit: Ranking. Frequent quality control: Adjudication. Common metric signal: agreement. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Jun 26, 2025.

Papers: 10 Last published: Jun 26, 2025 Global RSS

Researcher Quick Triage

Use this archive page for time-slice monitoring (what changed in evaluation methods, metrics, and protocol quality this period). Quality band: Medium .

High-Signal Coverage

100.0%

10 / 10 papers are not low-signal flagged.

Benchmark Anchors

20.0%

Papers with benchmark/dataset mentions in extraction output.

Metric Anchors

60.0%

Papers with reported metric mentions in extraction output.

1 papers report explicit quality controls for this archive period.
Prioritize papers with both benchmark and metric anchors for reliable longitudinal comparisons.

Primary action: Use this slice as early signal only; benchmark/metric anchoring is limited for rigorous period-over-period claims.

Why This Slice Matters (Expanded)

Why This Time Slice Matters

30% of papers report explicit human-feedback signals, led by expert verification.
automatic metrics appears in 60% of papers in this hub.
multi-agent setups appears in 10% of papers, indicating agentic evaluation demand.

Protocol Notes (Expanded)

Protocol Takeaways For This Period

Most common quality-control signal is adjudication (10% of papers).
Rater context is mostly domain experts, and annotation is commonly ranking annotation; use this to scope replication staffing.
Pair this hub with a human_eval-heavy hub to validate judge-model calibration.

Start Here (Highest-Signal Papers In This Slice)

Ranked by protocol completeness and evidence density for faster period-over-period review.

An Agentic System for Rare Disease Diagnosis with Traceable Reasoning
Jun 25, 2025 · Citations: 0 · Score: 5.5

Eval: Automatic Metrics · Metrics: Recall, Agreement
PersonalAI: A Systematic Comparison of Knowledge Graph Storage and Retrieval Approaches for Personalized LLM agents
Jun 20, 2025 · Citations: 0 · Score: 5.0

Eval: Automatic Metrics · Metrics: Accuracy
DistillNote: Toward a Functional Evaluation Framework of LLM-Generated Clinical Note Summaries
Jun 20, 2025 · Citations: 0 · Score: 4.5

Eval: Llm As Judge, Automatic Metrics · Metrics: Auroc
Complexity-aware fine-tuning
Jun 26, 2025 · Citations: 0 · Score: 3.5

Eval: Automatic Metrics · Metrics: Accuracy, Cost
A Scoping Review of Synthetic Data Generation by Language Models in Biomedical Research and Application: Data Utility and Quality Perspectives
Jun 19, 2025 · Citations: 0 · Score: 3.5

Eval: Automatic Metrics · Metrics: Relevance
DeVisE: Behavioral Testing of Medical Large Language Models
Jun 18, 2025 · Citations: 0 · Score: 3.5

Eval: Automatic Metrics · Metrics: Perplexity

Protocol Matrix (Top 10)

Quickly compare method ingredients across this archive slice.

Paper	Eval Modes	Benchmarks	Metrics	Quality Controls
An Agentic System for Rare Disease Diagnosis with Traceable Reasoning Jun 25, 2025	Automatic Metrics	Not reported	Recall, Agreement	Adjudication
PersonalAI: A Systematic Comparison of Knowledge Graph Storage and Retrieval Approaches for Personalized LLM agents Jun 20, 2025	Automatic Metrics	HotpotQA, TriviaQA	Accuracy	Not reported
DistillNote: Toward a Functional Evaluation Framework of LLM-Generated Clinical Note Summaries Jun 20, 2025	Llm As Judge, Automatic Metrics	Not reported	Auroc	Not reported
Complexity-aware fine-tuning Jun 26, 2025	Automatic Metrics	Not reported	Accuracy, Cost	Not reported
A Scoping Review of Synthetic Data Generation by Language Models in Biomedical Research and Application: Data Utility and Quality Perspectives Jun 19, 2025	Automatic Metrics	Not reported	Relevance	Not reported
DeVisE: Behavioral Testing of Medical Large Language Models Jun 18, 2025	Automatic Metrics	Not reported	Perplexity	Not reported
Revela: Dense Retriever Learning via Language Modeling Jun 19, 2025	Not reported	BEIR	Not reported	Not reported
Programming by Backprop: An Instruction is Worth 100 Examples When Finetuning LLMs Jun 23, 2025	Not reported	Not reported	Not reported	Not reported
$π$-CoT: Prolog-Initialized Chain-of-Thought Prompting for Multi-Hop Question-Answering Jun 25, 2025	Not reported	Not reported	Not reported	Not reported
Parallel Continuous Chain-of-Thought with Jacobi Iteration Jun 23, 2025	Not reported	Not reported	Not reported	Not reported

Researcher Workflow (Detailed)

Checklist

Moderate: Papers with explicit human feedback

Coverage is usable but incomplete (30% vs 45% target).
Gap: Papers reporting quality controls

Coverage is a replication risk (10% vs 30% target).
Gap: Papers naming benchmarks/datasets

Coverage is a replication risk (0% vs 35% target).
Gap: Papers naming evaluation metrics

Coverage is a replication risk (20% vs 35% target).
Gap: Papers with known rater population

Coverage is a replication risk (20% vs 35% target).
Gap: Papers with known annotation unit

Coverage is a replication risk (10% vs 35% target).

Strengths

This hub still surfaces a concentrated paper set for protocol triage and replication planning.

Known Gaps

Only 10% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (20% coverage).
Annotation unit is under-specified (10% coverage).

Suggested Next Analyses

Pair this hub with a human_eval-heavy hub to validate judge-model calibration.
Track metric sensitivity by reporting both agreement and auroc.
Add inter-annotator agreement checks when reproducing these protocols.

Recommended Queries

LLM-as-Judge Protocols Metric Slice: agreement Recent High-Signal Papers

Known Limitations

Only 10% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (20% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Snapshot (Detailed)

Evaluation Modes

Automatic Metrics (6)
Llm As Judge (1)

Top Metrics

Agreement (1)
Auroc (1)
Recall (1)
Recall@1 (1)

Top Benchmarks

Quality Controls

Adjudication (1)

Papers In This Archive Slice

Complexity-aware fine-tuning
Andrey Goncharov, Daniil Vyazhev, Petr Sychev, Edvard Khalafyan, Alexey Zaytsev · Jun 26, 2025 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
$π$-CoT: Prolog-Initialized Chain-of-Thought Prompting for Multi-Hop Question-Answering
Chao Wan, Albert Gong, Mihir Mishra, Carl-Leander Henneking, Claas Beger · Jun 25, 2025 · Citations: 0

Extensive experiments demonstrate that π-CoT significantly outperforms standard RAG and in-context CoT on multi-hop question-answering benchmarks.
An Agentic System for Rare Disease Diagnosis with Traceable Reasoning
Weike Zhao, Chaoyi Wu, Yanjie Fan, Xiaoman Zhang, Pengcheng Qiu · Jun 25, 2025 · Citations: 0

Expert Verification Multi Agent

Here we present DeepRare, a multi-agent system for rare disease differential diagnosis decision support powered by large language models, integrating over 40 specialized tools and up-to-date knowledge sources.
Programming by Backprop: An Instruction is Worth 100 Examples When Finetuning LLMs
Jonathan Cook, Silvia Sapora, Arash Ahmadian, Akbir Khan, Tim Rocktaschel · Jun 23, 2025 · Citations: 0

Demonstrations

Though execution of instructions in training data remains less reliable than when instructions are given in-context, our results demonstrate that procedural knowledge can be noisily `programmed' into LLMs through PBB, with important…
Parallel Continuous Chain-of-Thought with Jacobi Iteration
Haoyi Wu, Zhihao Teng, Kewei Tu · Jun 23, 2025 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
PersonalAI: A Systematic Comparison of Knowledge Graph Storage and Retrieval Approaches for Personalized LLM agents
Mikhail Menschikov, Dmitry Evseev, Victoria Dochkina, Ruslan Kostoev, Ilia Perepechkin · Jun 20, 2025 · Citations: 0

We evaluate our system on three benchmarks: TriviaQA, HotpotQA, DiaASQ and demonstrate that different memory and retrieval configurations yield optimal performance depending on the task.
DistillNote: Toward a Functional Evaluation Framework of LLM-Generated Clinical Note Summaries
Heloisa Oss Boll, Antonio Oss Boll, Leticia Puttlitz Boll, Ameen Abu Hanna, Iacer Calixto · Jun 20, 2025 · Citations: 0

Expert Verification

This study introduces DistillNote, an evaluation framework for LLM summaries that targets their functional utility by applying the generated summary downstream in a complex clinical prediction task, explicitly quantifying how much…
A Scoping Review of Synthetic Data Generation by Language Models in Biomedical Research and Application: Data Utility and Quality Perspectives
Hanshu Rao, Weisi Liu, Haohan Wang, I-Chan Huang, Zhe He · Jun 19, 2025 · Citations: 0

Evaluations were heterogeneous: intrinsic metrics (27.1\%), human-in-the-loop assessments (44.1\%), and LLM-based evaluations (13.6\%).
Revela: Dense Retriever Learning via Language Modeling
Fengyu Cai, Tong Chen, Xinran Zhao, Sihao Chen, Hongming Zhang · Jun 19, 2025 · Citations: 0

We evaluate Revela on domain-specific (CoIR), reasoning-intensive (BRIGHT), and general-domain (BEIR) benchmarks across various retriever backbones.
DeVisE: Behavioral Testing of Medical Large Language Models
Camila Zurdo Tagliabue, Heloisa Oss Boll, Aykut Erdem, Erkut Erdem, Iacer Calixto · Jun 18, 2025 · Citations: 0

Large language models (LLMs) are increasingly applied in clinical decision support, yet current evaluations rarely reveal whether their outputs reflect genuine medical reasoning or superficial correlations.

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.

Post a Job Get a Quote