HFEPX Archive Slice

HFEPX Weekly Archive: 2025-W17

Updated from current HFEPX corpus (Mar 10, 2026). 7 papers are grouped in this daily page.

Read Full Context

Updated from current HFEPX corpus (Mar 10, 2026). 7 papers are grouped in this daily page. Common evaluation modes: Automatic Metrics, Human Eval. Most common rater population: Domain Experts. Common annotation unit: Ranking. Frequently cited benchmark: Paperbench. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Apr 26, 2025.

Papers: 7 Last published: Apr 26, 2025 Global RSS

Researcher Quick Triage

Use this archive page for time-slice monitoring (what changed in evaluation methods, metrics, and protocol quality this period). Quality band: Developing .

High-Signal Coverage

100.0%

7 / 7 papers are not low-signal flagged.

Benchmark Anchors

14.3%

Papers with benchmark/dataset mentions in extraction output.

Metric Anchors

71.4%

Papers with reported metric mentions in extraction output.

0 papers report explicit quality controls for this archive period.
Prioritize papers with both benchmark and metric anchors for reliable longitudinal comparisons.

Primary action: Use this slice as early signal only; benchmark/metric anchoring is limited for rigorous period-over-period claims.

Get this digest every Friday →

Why This Time Slice Matters

14.3% of papers report explicit human-feedback signals, led by pairwise preferences.
automatic metrics appears in 71.4% of papers in this hub.
Paperbench is a recurring benchmark anchor for cross-paper comparisons in this page.

Protocol Takeaways For This Period

Quality-control reporting is sparse in this slice; prioritize papers with explicit calibration or adjudication steps.
Rater context is mostly domain experts, and annotation is commonly ranking annotation; use this to scope replication staffing.
Pair this hub with llm_as_judge pages to benchmark automated-vs-human evaluation tradeoffs.

Start Here (Highest-Signal Papers In This Slice)

Ranked by protocol completeness and evidence density for faster period-over-period review.

Toward Safe and Human-Aligned Game Conversational Recommendation via Multi-Agent Decomposition
Apr 26, 2025 · Citations: 0 · Score: 4.5

Eval: Automatic Metrics · Metrics: Hit@5
Reshaping MOFs text mining with a dynamic multi-agents framework of large language model
Apr 26, 2025 · Citations: 0 · Score: 3.5

Eval: Automatic Metrics · Metrics: Accuracy, Precision
Reason Like a Radiologist: Chain-of-Thought and Reinforcement Learning for Verifiable Report Generation
Apr 25, 2025 · Citations: 0 · Score: 3.5

Eval: Automatic Metrics · Metrics: Rouge
How much does context affect the accuracy of AI health advice?
Apr 25, 2025 · Citations: 0 · Score: 3.5

Eval: Automatic Metrics · Metrics: Accuracy
Paper2Code: Automating Code Generation from Scientific Papers in Machine Learning
Apr 24, 2025 · Citations: 0 · Score: 3.5

Eval: Human Eval · Metrics: Not reported
ConformalNL2LTL: Translating Natural Language Instructions into Temporal Logic Formulas with Conformal Correctness Guarantees
Apr 22, 2025 · Citations: 0 · Score: 3.5

Eval: Automatic Metrics · Metrics: Accuracy

Protocol Matrix (Top 10)

Quickly compare method ingredients across this archive slice.

Paper	Eval Modes	Benchmarks	Metrics	Quality Controls
Toward Safe and Human-Aligned Game Conversational Recommendation via Multi-Agent Decomposition Apr 26, 2025	Automatic Metrics	Not reported	Hit@5	Not reported
Reshaping MOFs text mining with a dynamic multi-agents framework of large language model Apr 26, 2025	Automatic Metrics	Not reported	Accuracy, Precision	Not reported
Reason Like a Radiologist: Chain-of-Thought and Reinforcement Learning for Verifiable Report Generation Apr 25, 2025	Automatic Metrics	Not reported	Rouge	Not reported
How much does context affect the accuracy of AI health advice? Apr 25, 2025	Automatic Metrics	Not reported	Accuracy	Not reported
Paper2Code: Automating Code Generation from Scientific Papers in Machine Learning Apr 24, 2025	Human Eval	Paperbench	Not reported	Not reported
ConformalNL2LTL: Translating Natural Language Instructions into Temporal Logic Formulas with Conformal Correctness Guarantees Apr 22, 2025	Automatic Metrics	Not reported	Accuracy	Not reported
FLUKE: A Linguistically-Driven and Task-Agnostic Framework for Robustness Evaluation Apr 24, 2025	Not reported	Not reported	Not reported	Not reported

Researcher Workflow (Detailed)

Checklist

Gap: Papers with explicit human feedback

Coverage is a replication risk (14.3% vs 45% target).
Gap: Papers reporting quality controls

Coverage is a replication risk (0% vs 30% target).
Gap: Papers naming benchmarks/datasets

Coverage is a replication risk (14.3% vs 35% target).
Moderate: Papers naming evaluation metrics

Coverage is usable but incomplete (28.6% vs 35% target).
Moderate: Papers with known rater population

Coverage is usable but incomplete (28.6% vs 35% target).
Gap: Papers with known annotation unit

Coverage is a replication risk (14.3% vs 35% target).

Strengths

Agentic evaluation appears in 42.9% of papers.

Known Gaps

Only 0% of papers report quality controls; prioritize calibration/adjudication evidence.
Annotation unit is under-specified (14.3% coverage).
Benchmark coverage is thin (14.3% of papers mention benchmarks/datasets).

Suggested Next Analyses

Pair this hub with llm_as_judge pages to benchmark automated-vs-human evaluation tradeoffs.
Track metric sensitivity by reporting both accuracy and hit@5.

Recommended Queries

Human Eval Protocols Benchmark Slice: Paperbench Metric Slice: accuracy Recent High-Signal Papers

Known Limitations

Only 0% of papers report quality controls; prioritize calibration/adjudication evidence.
Annotation unit is under-specified (14.3% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Snapshot (Detailed)

Evaluation Modes

Automatic Metrics (5)
Human Eval (1)

Top Metrics

Accuracy (1)
Hit@5 (1)
Precision (1)

Top Benchmarks

Paperbench (1)

Quality Controls

Papers In This Archive Slice

Reshaping MOFs text mining with a dynamic multi-agents framework of large language model
Zuhong Lin, Daoyuan Ren, Kai Ran, Jing Sun, Songlin Yu · Apr 26, 2025 · Citations: 0

Multi Agent

We present MOFh6, a large language model driven system that reads raw articles or crystal codes and converts them into standardized synthesis tables.
Toward Safe and Human-Aligned Game Conversational Recommendation via Multi-Agent Decomposition
Zheng Hui, Xiaokai Wei, Yexi Jiang, Kevin Gao, Chen Wang · Apr 26, 2025 · Citations: 0

Pairwise Preference Multi Agent

These domains typically involve fixed content and passive consumption, where user preferences can be matched by genre or theme.
Reason Like a Radiologist: Chain-of-Thought and Reinforcement Learning for Verifiable Report Generation
Peiyuan Jing, Kinhei Lee, Zhenxuan Zhang, Huichi Zhou, Zhengqing Yuan · Apr 25, 2025 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
How much does context affect the accuracy of AI health advice?
Prashant Garg, Thiemo Fetzer · Apr 25, 2025 · Citations: 0

English-language performance does not reliably generalise across contexts, underscoring the need for multilingual, domain-specific evaluation before deployment in public-health communication.
FLUKE: A Linguistically-Driven and Task-Agnostic Framework for Robustness Evaluation
Yulia Otmakhova, Hung Thinh Truong, Rahmad Mahendra, Zenan Zhai, Rongxin Zhu · Apr 24, 2025 · Citations: 0

We present FLUKE (Framework for LingUistically-driven and tasK-agnostic robustness Evaluation), a framework for assessing model robustness through systematic minimal variations of test data.
Paper2Code: Automating Code Generation from Scientific Papers in Machine Learning
Minju Seo, Jinheon Baek, Seongyun Lee, Sung Ju Hwang · Apr 24, 2025 · Citations: 0

Multi Agent

Inspired by this, we introduce PaperCoder, a multi-agent LLM framework that transforms machine learning papers into operational code repositories.
ConformalNL2LTL: Translating Natural Language Instructions into Temporal Logic Formulas with Conformal Correctness Guarantees
David Smith Sundarsingh, Jun Wang, Jyotirmoy V. Deshmukh, Yiannis Kantaros · Apr 22, 2025 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.

Post a Job Get a Quote