HFEPX Archive Slice

HFEPX Fortnight Archive: 2025-F09

Updated from current HFEPX corpus (Mar 8, 2026). 12 papers are grouped in this daily page.

Read Full Context

Updated from current HFEPX corpus (Mar 8, 2026). 12 papers are grouped in this daily page. Common evaluation modes: Automatic Metrics, Simulation Env. Most common rater population: Domain Experts. Common annotation unit: Freeform. Frequently cited benchmark: PaperBench. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from May 4, 2025.

Papers: 12 Last published: May 4, 2025 Global RSS

Researcher Quick Triage

Use this archive page for time-slice monitoring (what changed in evaluation methods, metrics, and protocol quality this period). Quality band: Medium .

High-Signal Coverage

100.0%

12 / 12 papers are not low-signal flagged.

Benchmark Anchors

0.0%

Papers with benchmark/dataset mentions in extraction output.

Metric Anchors

41.7%

Papers with reported metric mentions in extraction output.

0 papers report explicit quality controls for this archive period.
Prioritize papers with both benchmark and metric anchors for reliable longitudinal comparisons.

Primary action: Use this slice as early signal only; benchmark/metric anchoring is limited for rigorous period-over-period claims.

Why This Time Slice Matters

8.3% of papers report explicit human-feedback signals, led by pairwise preferences.
automatic metrics appears in 41.7% of papers in this hub.
PaperBench is a recurring benchmark anchor for cross-paper comparisons in this page.

Protocol Takeaways For This Period

Quality-control reporting is sparse in this slice; prioritize papers with explicit calibration or adjudication steps.
Rater context is mostly domain experts, and annotation is commonly Freeform; use this to scope replication staffing.
Track metric sensitivity by reporting both accuracy and hit@5.

Start Here (Highest-Signal Papers In This Slice)

Ranked by protocol completeness and evidence density for faster period-over-period review.

Toward Safe and Human-Aligned Game Conversational Recommendation via Multi-Agent Decomposition
Apr 26, 2025 · Citations: 0 · Score: 4.5

Eval: Automatic Metrics · Metrics: Hit@5
Reshaping MOFs text mining with a dynamic multi-agents framework of large language model
Apr 26, 2025 · Citations: 0 · Score: 3.5

Eval: Automatic Metrics · Metrics: Accuracy, Precision
Reason Like a Radiologist: Chain-of-Thought and Reinforcement Learning for Verifiable Report Generation
Apr 25, 2025 · Citations: 0 · Score: 3.5

Eval: Automatic Metrics · Metrics: Rouge
How much does context affect the accuracy of AI health advice?
Apr 25, 2025 · Citations: 0 · Score: 3.5

Eval: Automatic Metrics · Metrics: Accuracy
ConformalNL2LTL: Translating Natural Language Instructions into Temporal Logic Formulas with Conformal Correctness Guarantees
Apr 22, 2025 · Citations: 0 · Score: 3.5

Eval: Automatic Metrics · Metrics: Accuracy
Adaptive Social Learning via Mode Policy Optimization for Language Agents
May 4, 2025 · Citations: 0 · Score: 2.0

Eval: Simulation Env · Metrics: Not reported

Protocol Matrix (Top 10)

Quickly compare method ingredients across this archive slice.

Paper	Eval Modes	Benchmarks	Metrics	Quality Controls
Toward Safe and Human-Aligned Game Conversational Recommendation via Multi-Agent Decomposition Apr 26, 2025	Automatic Metrics	Not reported	Hit@5	Not reported
Reshaping MOFs text mining with a dynamic multi-agents framework of large language model Apr 26, 2025	Automatic Metrics	Not reported	Accuracy, Precision	Not reported
Reason Like a Radiologist: Chain-of-Thought and Reinforcement Learning for Verifiable Report Generation Apr 25, 2025	Automatic Metrics	Not reported	Rouge	Not reported
How much does context affect the accuracy of AI health advice? Apr 25, 2025	Automatic Metrics	Not reported	Accuracy	Not reported
ConformalNL2LTL: Translating Natural Language Instructions into Temporal Logic Formulas with Conformal Correctness Guarantees Apr 22, 2025	Automatic Metrics	Not reported	Accuracy	Not reported
Adaptive Social Learning via Mode Policy Optimization for Language Agents May 4, 2025	Simulation Env	Not reported	Not reported	Not reported
Decoding Open-Ended Information Seeking Goals from Eye Movements in Reading May 4, 2025	Not reported	Not reported	Not reported	Not reported
Large Language Model Compression with Global Rank and Sparsity Optimization May 2, 2025	Not reported	Not reported	Not reported	Not reported
A False Sense of Privacy: Evaluating Textual Data Sanitization Beyond Surface-level Privacy Leakage Apr 28, 2025	Not reported	Not reported	Not reported	Not reported
FLUKE: A Linguistically-Driven and Task-Agnostic Framework for Robustness Evaluation Apr 24, 2025	Not reported	Not reported	Not reported	Not reported

Researcher Workflow (Detailed)

Checklist

Gap: Papers with explicit human feedback

Coverage is a replication risk (8.3% vs 45% target).
Gap: Papers reporting quality controls

Coverage is a replication risk (0% vs 30% target).
Gap: Papers naming benchmarks/datasets

Coverage is a replication risk (8.3% vs 35% target).
Moderate: Papers naming evaluation metrics

Coverage is usable but incomplete (25% vs 35% target).
Gap: Papers with known rater population

Coverage is a replication risk (16.7% vs 35% target).
Gap: Papers with known annotation unit

Coverage is a replication risk (16.7% vs 35% target).

Strengths

This hub still surfaces a concentrated paper set for protocol triage and replication planning.

Known Gaps

Only 0% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (16.7% coverage).
Annotation unit is under-specified (16.7% coverage).

Suggested Next Analyses

Track metric sensitivity by reporting both accuracy and hit@5.

Recommended Queries

Benchmark Slice: PaperBench Metric Slice: accuracy Recent High-Signal Papers

Known Limitations

Only 0% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (16.7% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Snapshot (Detailed)

Evaluation Modes

Automatic Metrics (5)
Simulation Env (1)

Top Metrics

Accuracy (2)
Hit@5 (1)
Precision (1)

Top Benchmarks

PaperBench (1)

Quality Controls

Papers In This Archive Slice

Adaptive Social Learning via Mode Policy Optimization for Language Agents
Minzheng Wang, Yongbin Li, Haobo Wang, Xinghua Zhang, Nan Xu · May 4, 2025 · Citations: 0

To address this, we propose an Adaptive Social Learning (ASL) framework in this paper, aiming to improve the adaptive reasoning ability of language agents in dynamic social interactions.
Decoding Open-Ended Information Seeking Goals from Eye Movements in Reading
Cfir Avraham Hadar, Omer Shubi, Yoav Meiri, Amit Heshes, Yevgeni Berzak · May 4, 2025 · Citations: 0

To address this question, we introduce goal decoding tasks and evaluation frameworks using large-scale eye tracking for reading data in English with hundreds of text-specific information seeking tasks.
Large Language Model Compression with Global Rank and Sparsity Optimization
Changhai Zhou, Qian Qiao, Yuhua Zhou, Yuxin Wu, Shichao Weng · May 2, 2025 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
FineScope : SAE-guided Data Selection Enables Domain Specific LLM Pruning and Finetuning
Chaitali Bhattacharyya, Hyunsei Lee, Junyoung Lee, Shinhyoung Jang, Il hong Suh · May 1, 2025 · Citations: 0
A False Sense of Privacy: Evaluating Textual Data Sanitization Beyond Surface-level Privacy Leakage
Rui Xin, Niloofar Mireshghallah, Shuyue Stella Li, Michael Duan, Hyunwoo Kim · Apr 28, 2025 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Reshaping MOFs text mining with a dynamic multi-agents framework of large language model
Zuhong Lin, Daoyuan Ren, Kai Ran, Jing Sun, Songlin Yu · Apr 26, 2025 · Citations: 0

Multi Agent

We present MOFh6, a large language model driven system that reads raw articles or crystal codes and converts them into standardized synthesis tables.
Toward Safe and Human-Aligned Game Conversational Recommendation via Multi-Agent Decomposition
Zheng Hui, Xiaokai Wei, Yexi Jiang, Kevin Gao, Chen Wang · Apr 26, 2025 · Citations: 0

Pairwise Preference Multi Agent

These domains typically involve fixed content and passive consumption, where user preferences can be matched by genre or theme.
Reason Like a Radiologist: Chain-of-Thought and Reinforcement Learning for Verifiable Report Generation
Peiyuan Jing, Kinhei Lee, Zhenxuan Zhang, Huichi Zhou, Zhengqing Yuan · Apr 25, 2025 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
How much does context affect the accuracy of AI health advice?
Prashant Garg, Thiemo Fetzer · Apr 25, 2025 · Citations: 0

English-language performance does not reliably generalise across contexts, underscoring the need for multilingual, domain-specific evaluation before deployment in public-health communication.
FLUKE: A Linguistically-Driven and Task-Agnostic Framework for Robustness Evaluation
Yulia Otmakhova, Hung Thinh Truong, Rahmad Mahendra, Zenan Zhai, Rongxin Zhu · Apr 24, 2025 · Citations: 0

We present FLUKE (Framework for LingUistically-driven and tasK-agnostic robustness Evaluation), a framework for assessing model robustness through systematic minimal variations of test data.
Paper2Code: Automating Code Generation from Scientific Papers in Machine Learning
Minju Seo, Jinheon Baek, Seongyun Lee, Sung Ju Hwang · Apr 24, 2025 · Citations: 0
ConformalNL2LTL: Translating Natural Language Instructions into Temporal Logic Formulas with Conformal Correctness Guarantees
David Smith Sundarsingh, Jun Wang, Jyotirmoy V. Deshmukh, Yiannis Kantaros · Apr 22, 2025 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.

Post a Job Get a Quote