HFEPX Archive Slice

HFEPX Weekly Archive: 2025-W25

Updated from current HFEPX corpus (Mar 8, 2026). 13 papers are grouped in this daily page.

Read Full Context

Updated from current HFEPX corpus (Mar 8, 2026). 13 papers are grouped in this daily page. Common evaluation modes: Automatic Metrics, Llm As Judge. Most common rater population: Domain Experts. Common annotation unit: Ranking. Frequently cited benchmark: DROP. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Jun 22, 2025.

Papers: 13 Last published: Jun 22, 2025 Global RSS

Researcher Quick Triage

Use this archive page for time-slice monitoring (what changed in evaluation methods, metrics, and protocol quality this period). Quality band: Medium .

High-Signal Coverage

100.0%

13 / 13 papers are not low-signal flagged.

Benchmark Anchors

30.8%

Papers with benchmark/dataset mentions in extraction output.

Metric Anchors

53.8%

Papers with reported metric mentions in extraction output.

0 papers report explicit quality controls for this archive period.
Prioritize papers with both benchmark and metric anchors for reliable longitudinal comparisons.

Primary action: Use this slice as early signal only; benchmark/metric anchoring is limited for rigorous period-over-period claims.

Why This Time Slice Matters

7.7% of papers report explicit human-feedback signals, led by expert verification.
automatic metrics appears in 53.8% of papers in this hub.
DROP is a recurring benchmark anchor for cross-paper comparisons in this page.

Protocol Takeaways For This Period

Quality-control reporting is sparse in this slice; prioritize papers with explicit calibration or adjudication steps.
Rater context is mostly domain experts, and annotation is commonly ranking annotation; use this to scope replication staffing.
Pair this hub with a human_eval-heavy hub to validate judge-model calibration.

Start Here (Highest-Signal Papers In This Slice)

Ranked by protocol completeness and evidence density for faster period-over-period review.

PersonalAI: A Systematic Comparison of Knowledge Graph Storage and Retrieval Approaches for Personalized LLM agents
Jun 20, 2025 · Citations: 0 · Score: 5.0

Eval: Automatic Metrics · Metrics: Accuracy
SPARE: Single-Pass Annotation with Reference-Guided Evaluation for Automatic Process Supervision and Reward Modelling
Jun 18, 2025 · Citations: 0 · Score: 5.0

Eval: Automatic Metrics · Metrics: Accuracy, Precision
AgentSynth: Scalable Task Generation for Generalist Computer-Use Agents
Jun 17, 2025 · Citations: 0 · Score: 5.0

Eval: Automatic Metrics · Metrics: Cost
DistillNote: Toward a Functional Evaluation Framework of LLM-Generated Clinical Note Summaries
Jun 20, 2025 · Citations: 0 · Score: 4.5

Eval: Llm As Judge, Automatic Metrics · Metrics: Auroc
Long-Context Generalization with Sparse Attention
Jun 19, 2025 · Citations: 0 · Score: 3.5

Eval: Automatic Metrics · Metrics: Perplexity
A Scoping Review of Synthetic Data Generation by Language Models in Biomedical Research and Application: Data Utility and Quality Perspectives
Jun 19, 2025 · Citations: 0 · Score: 3.5

Eval: Automatic Metrics · Metrics: Relevance

Protocol Matrix (Top 10)

Quickly compare method ingredients across this archive slice.

Paper	Eval Modes	Benchmarks	Metrics	Quality Controls
PersonalAI: A Systematic Comparison of Knowledge Graph Storage and Retrieval Approaches for Personalized LLM agents Jun 20, 2025	Automatic Metrics	HotpotQA, TriviaQA	Accuracy	Not reported
SPARE: Single-Pass Annotation with Reference-Guided Evaluation for Automatic Process Supervision and Reward Modelling Jun 18, 2025	Automatic Metrics	GSM8K, Processbench	Accuracy, Precision	Not reported
AgentSynth: Scalable Task Generation for Generalist Computer-Use Agents Jun 17, 2025	Automatic Metrics	DROP	Cost	Not reported
DistillNote: Toward a Functional Evaluation Framework of LLM-Generated Clinical Note Summaries Jun 20, 2025	Llm As Judge, Automatic Metrics	Not reported	Auroc	Not reported
Long-Context Generalization with Sparse Attention Jun 19, 2025	Automatic Metrics	Not reported	Perplexity	Not reported
A Scoping Review of Synthetic Data Generation by Language Models in Biomedical Research and Application: Data Utility and Quality Perspectives Jun 19, 2025	Automatic Metrics	Not reported	Relevance	Not reported
DeVisE: Behavioral Testing of Medical Large Language Models Jun 18, 2025	Automatic Metrics	Not reported	Perplexity	Not reported
Revela: Dense Retriever Learning via Language Modeling Jun 19, 2025	Not reported	BEIR	Not reported	Not reported
LLM Probability Concentration: How Alignment Shrinks the Generative Horizon Jun 22, 2025	Not reported	Not reported	Not reported	Not reported
When Does Divide and Conquer Work for Long Context LLM? A Noise Decomposition Framework Jun 19, 2025	Not reported	Not reported	Not reported	Not reported

Researcher Workflow (Detailed)

Checklist

Gap: Papers with explicit human feedback

Coverage is a replication risk (7.7% vs 45% target).
Gap: Papers reporting quality controls

Coverage is a replication risk (0% vs 30% target).
Gap: Papers naming benchmarks/datasets

Coverage is a replication risk (15.4% vs 35% target).
Moderate: Papers naming evaluation metrics

Coverage is usable but incomplete (23.1% vs 35% target).
Gap: Papers with known rater population

Coverage is a replication risk (7.7% vs 35% target).
Gap: Papers with known annotation unit

Coverage is a replication risk (15.4% vs 35% target).

Strengths

This hub still surfaces a concentrated paper set for protocol triage and replication planning.

Known Gaps

Only 0% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (7.7% coverage).
Annotation unit is under-specified (15.4% coverage).

Suggested Next Analyses

Pair this hub with a human_eval-heavy hub to validate judge-model calibration.
Stratify by benchmark (DROP vs GSM8K) before comparing methods.
Track metric sensitivity by reporting both accuracy and auroc.

Recommended Queries

LLM-as-Judge Protocols Benchmark Slice: DROP Metric Slice: accuracy Recent High-Signal Papers

Known Limitations

Only 0% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (7.7% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Snapshot (Detailed)

Evaluation Modes

Automatic Metrics (7)
Llm As Judge (1)

Top Metrics

Accuracy (1)
Auroc (1)
Cost (1)
Precision (1)

Top Benchmarks

DROP (1)
GSM8K (1)
Processbench (1)

Quality Controls

Papers In This Archive Slice

LLM Probability Concentration: How Alignment Shrinks the Generative Horizon
Chenghao Yang, Sida Li, Ari Holtzman · Jun 22, 2025 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
PersonalAI: A Systematic Comparison of Knowledge Graph Storage and Retrieval Approaches for Personalized LLM agents
Mikhail Menschikov, Dmitry Evseev, Victoria Dochkina, Ruslan Kostoev, Ilia Perepechkin · Jun 20, 2025 · Citations: 0

We evaluate our system on three benchmarks: TriviaQA, HotpotQA, DiaASQ and demonstrate that different memory and retrieval configurations yield optimal performance depending on the task.
DistillNote: Toward a Functional Evaluation Framework of LLM-Generated Clinical Note Summaries
Heloisa Oss Boll, Antonio Oss Boll, Leticia Puttlitz Boll, Ameen Abu Hanna, Iacer Calixto · Jun 20, 2025 · Citations: 0

Expert Verification

This study introduces DistillNote, an evaluation framework for LLM summaries that targets their functional utility by applying the generated summary downstream in a complex clinical prediction task, explicitly quantifying how much…
Long-Context Generalization with Sparse Attention
Pavlo Vasylenko, Hugo Pitorro, André F. T. Martins, Marcos Treviso · Jun 19, 2025 · Citations: 0

Our empirical evaluation on synthetic tasks and language modeling demonstrates that ASEntmax substantially outperforms softmax, scalable softmax, and fixed-temperature α-entmax baselines, achieving up to 1000\times length extrapolation on…
A Scoping Review of Synthetic Data Generation by Language Models in Biomedical Research and Application: Data Utility and Quality Perspectives
Hanshu Rao, Weisi Liu, Haohan Wang, I-Chan Huang, Zhe He · Jun 19, 2025 · Citations: 0

Evaluations were heterogeneous: intrinsic metrics (27.1\%), human-in-the-loop assessments (44.1\%), and LLM-based evaluations (13.6\%).
Revela: Dense Retriever Learning via Language Modeling
Fengyu Cai, Tong Chen, Xinran Zhao, Sihao Chen, Hongming Zhang · Jun 19, 2025 · Citations: 0

We evaluate Revela on domain-specific (CoIR), reasoning-intensive (BRIGHT), and general-domain (BEIR) benchmarks across various retriever backbones.
When Does Divide and Conquer Work for Long Context LLM? A Noise Decomposition Framework
Zhen Xu, Shang Zhu, Jue Wang, Junlin Wang, Ben Athiwaratkun · Jun 19, 2025 · Citations: 0
OJBench: A Competition Level Code Benchmark For Large Language Models
Zhexu Wang, Yiping Liu, Yejie Wang, Wenyang He, Bofei Gao · Jun 19, 2025 · Citations: 0
GenRecal: Generation after Recalibration from Large to Small Vision-Language Models
Byung-Kwan Lee, Ryo Hachiuma, Yong Man Ro, Yu-Chiang Frank Wang, Yueh-Hua Wu · Jun 18, 2025 · Citations: 0
SPARE: Single-Pass Annotation with Reference-Guided Evaluation for Automatic Process Supervision and Reward Modelling
Md Imbesat Hassan Rizvi, Xiaodan Zhu, Iryna Gurevych · Jun 18, 2025 · Citations: 0

Long Horizon

To address this, we introduce Single-Pass Annotation with Reference-Guided Evaluation (SPARE), a novel structured framework that enables efficient per-step annotation by jointly aligning solution steps to reference solutions and determine…
DeVisE: Behavioral Testing of Medical Large Language Models
Camila Zurdo Tagliabue, Heloisa Oss Boll, Aykut Erdem, Erkut Erdem, Iacer Calixto · Jun 18, 2025 · Citations: 0

Large language models (LLMs) are increasingly applied in clinical decision support, yet current evaluations rarely reveal whether their outputs reflect genuine medical reasoning or superficial correlations.
AgentSynth: Scalable Task Generation for Generalist Computer-Use Agents
Jingxu Xie, Dylan Xu, Xuandong Zhao, Dawn Song · Jun 17, 2025 · Citations: 0

Long Horizon

We introduce AgentSynth, a scalable and cost-efficient pipeline for automatically synthesizing high-quality tasks and trajectory datasets for generalist computer-use agents.
Language Agents for Hypothesis-driven Clinical Decision Making with Reinforcement Learning
David Bani-Harouni, Chantal Pellegrini, Ege Özsoy, Nassir Navab, Matthias Keicher · Jun 16, 2025 · Citations: 0

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.

Post a Job Get a Quote