HFEPX Archive Slice

HFEPX Daily Archive: 2025-10-17

Updated from current HFEPX corpus (Apr 5, 2026). 12 papers are grouped in this daily page.

Read Full Context

Updated from current HFEPX corpus (Apr 5, 2026). 12 papers are grouped in this daily page. Common evaluation modes: Automatic Metrics. Most common rater population: Domain Experts. Common annotation unit: Multi Dim Rubric. Frequently cited benchmark: Mind2Web. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Oct 17, 2025.

Papers: 12 Last published: Oct 17, 2025 Global RSS

Researcher Quick Triage

Use this archive page for time-slice monitoring (what changed in evaluation methods, metrics, and protocol quality this period). Quality band: Medium .

High-Signal Coverage

100.0%

12 / 12 papers are not low-signal flagged.

Benchmark Anchors

16.7%

Papers with benchmark/dataset mentions in extraction output.

Metric Anchors

33.3%

Papers with reported metric mentions in extraction output.

0 papers report explicit quality controls for this archive period.
Prioritize papers with both benchmark and metric anchors for reliable longitudinal comparisons.

Primary action: Use this slice as early signal only; benchmark/metric anchoring is limited for rigorous period-over-period claims.

Get this digest every Friday →

Why This Time Slice Matters

16.7% of papers report explicit human-feedback signals, led by demonstration data.
automatic metrics appears in 41.7% of papers in this hub.
Mind2Web is a recurring benchmark anchor for cross-paper comparisons in this page.

Protocol Takeaways For This Period

Quality-control reporting is sparse in this slice; prioritize papers with explicit calibration or adjudication steps.
Rater context is mostly domain experts, and annotation is commonly multi-dimensional rubrics; use this to scope replication staffing.
Stratify by benchmark (Mind2Web vs Scholareval) before comparing methods.

Start Here (Highest-Signal Papers In This Slice)

Ranked by protocol completeness and evidence density for faster period-over-period review.

When to Ensemble: Identifying Token-Level Points for Stable and Fast LLM Ensembling
Oct 17, 2025 · Citations: 0 · Score: 5.5

Eval: Automatic Metrics · Metrics: Accuracy
ScholarEval: Research Idea Evaluation Grounded in Literature
Oct 17, 2025 · Citations: 0 · Score: 4.0

Eval: Not reported · Metrics: Not reported
BIOGEN: Evidence-Grounded Multi-Agent Reasoning Framework for Transcriptomic Interpretation in Antimicrobial Resistance
Oct 17, 2025 · Citations: 0 · Score: 4.0

Eval: Automatic Metrics · Metrics: Bertscore, Hallucination rate
HypoSpace: Evaluating LLM Creativity as Set-Valued Hypothesis Generators under Underdetermination
Oct 17, 2025 · Citations: 0 · Score: 4.0

Eval: Automatic Metrics · Metrics: Precision
MNO: Multiscale Neural Operator for 3D Computational Fluid Dynamics
Oct 17, 2025 · Citations: 0 · Score: 4.0

Eval: Automatic Metrics · Metrics: Accuracy
Learning to Answer from Correct Demonstrations
Oct 17, 2025 · Citations: 0 · Score: 3.5

Eval: Automatic Metrics · Metrics: Not reported

Protocol Matrix (Top 10)

Quickly compare method ingredients across this archive slice.

Paper	Eval Modes	Benchmarks	Metrics	Quality Controls
When to Ensemble: Identifying Token-Level Points for Stable and Fast LLM Ensembling Oct 17, 2025	Automatic Metrics	MATH 500, BBH	Accuracy	Not reported
ScholarEval: Research Idea Evaluation Grounded in Literature Oct 17, 2025	Not reported	Scholareval	Not reported	Not reported
BIOGEN: Evidence-Grounded Multi-Agent Reasoning Framework for Transcriptomic Interpretation in Antimicrobial Resistance Oct 17, 2025	Automatic Metrics	Not reported	Bertscore, Hallucination rate	Not reported
HypoSpace: Evaluating LLM Creativity as Set-Valued Hypothesis Generators under Underdetermination Oct 17, 2025	Automatic Metrics	Not reported	Precision	Not reported
MNO: Multiscale Neural Operator for 3D Computational Fluid Dynamics Oct 17, 2025	Automatic Metrics	Not reported	Accuracy	Not reported
Learning to Answer from Correct Demonstrations Oct 17, 2025	Automatic Metrics	Not reported	Not reported	Not reported
In Generative AI We (Dis)Trust? Computational Analysis of Trust and Distrust in Reddit Discussions Oct 17, 2025	Not reported	Not reported	Not reported	Not reported
SentinelNet: Safeguarding Multi-Agent Collaboration Through Credit-Based Dynamic Threat Detection Oct 17, 2025	Not reported	Not reported	Not reported	Not reported
PolySkill: Learning Generalizable Skills Through Polymorphic Abstraction Oct 17, 2025	Not reported	Not reported	Not reported	Not reported
Language Models are Injective and Hence Invertible Oct 17, 2025	Not reported	Not reported	Not reported	Not reported

Researcher Workflow (Detailed)

Checklist

Gap: Papers with explicit human feedback

Coverage is a replication risk (16.7% vs 45% target).
Gap: Papers reporting quality controls

Coverage is a replication risk (0% vs 30% target).
Gap: Papers naming benchmarks/datasets

Coverage is a replication risk (16.7% vs 35% target).
Gap: Papers naming evaluation metrics

Coverage is a replication risk (16.7% vs 35% target).
Moderate: Papers with known rater population

Coverage is usable but incomplete (25% vs 35% target).
Gap: Papers with known annotation unit

Coverage is a replication risk (8.3% vs 35% target).

Strengths

This hub still surfaces a concentrated paper set for protocol triage and replication planning.

Known Gaps

Only 0% of papers report quality controls; prioritize calibration/adjudication evidence.
Annotation unit is under-specified (8.3% coverage).
Benchmark coverage is thin (16.7% of papers mention benchmarks/datasets).

Suggested Next Analyses

Stratify by benchmark (Mind2Web vs Scholareval) before comparing methods.
Track metric sensitivity by reporting both accuracy and bertscore.

Recommended Queries

Benchmark Slice: Mind2Web Metric Slice: accuracy Recent High-Signal Papers

Known Limitations

Only 0% of papers report quality controls; prioritize calibration/adjudication evidence.
Annotation unit is under-specified (8.3% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Snapshot (Detailed)

Evaluation Modes

Automatic Metrics (5)

Top Metrics

Accuracy (1)
Bertscore (1)
Hallucination rate (1)

Top Benchmarks

Mind2Web (1)
Scholareval (1)

Quality Controls

Papers In This Archive Slice

ScholarEval: Research Idea Evaluation Grounded in Literature
Hanane Nour Moussa, Patrick Queiroz Da Silva, Daniel Adu-Ampratwum, Alyson East, Zitong Lu · Oct 17, 2025 · Citations: 0

Rubric Rating

As AI tools become increasingly common for research ideation, robust evaluation is critical to ensure the validity and usefulness of generated ideas.
SentinelNet: Safeguarding Multi-Agent Collaboration Through Credit-Based Dynamic Threat Detection
Yang Feng, Xudong Pan · Oct 17, 2025 · Citations: 0
In Generative AI We (Dis)Trust? Computational Analysis of Trust and Distrust in Reddit Discussions
Aria Pessianzadeh, Naima Sultana, Hildegarde Van den Bulck, David Gefen, Shahin Jabbari · Oct 17, 2025 · Citations: 0

The rise of generative AI (GenAI) has impacted many aspects of human life.
PolySkill: Learning Generalizable Skills Through Polymorphic Abstraction
Simon Yu, Gang Li, Weiyan Shi, Peng Qi · Oct 17, 2025 · Citations: 0
BIOGEN: Evidence-Grounded Multi-Agent Reasoning Framework for Transcriptomic Interpretation in Antimicrobial Resistance
Elias Hossain, Mehrdad Shoeibi, Ivan Garibay, Niloofar Yousefi · Oct 17, 2025 · Citations: 0

Multi Agent

We present BIOGEN, an evidence-grounded multi-agent framework for post hoc interpretation of RNA-seq transcriptional modules.
HypoSpace: Evaluating LLM Creativity as Set-Valued Hypothesis Generators under Underdetermination
Tingting Chen, Beibei Lin, Zifeng Yuan, Qiran Zou, Hongyu He · Oct 17, 2025 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Language Models are Injective and Hence Invertible
Giorgos Nikolaou, Tommaso Mencattini, Donato Crisostomi, Andrea Santilli, Yannis Panagakis · Oct 17, 2025 · Citations: 0
OffSim: Offline Simulator for Model-based Offline Inverse Reinforcement Learning
Woo-Jin Ahn, Sang-Ryul Baek, Yong-Jun Lee, Hyun-Duck Choi, Myo-Taeg Lim · Oct 17, 2025 · Citations: 0
Learning to Answer from Correct Demonstrations
Nirmit Joshi, Gene Li, Siddharth Bhandari, Shiva Prasad Kasiviswanathan, Cong Ma · Oct 17, 2025 · Citations: 0

Demonstrations

We study the problem of learning to generate an answer (or completion) to a question (or prompt), where there could be multiple correct answers, any one of which is acceptable at test time.
MNO: Multiscale Neural Operator for 3D Computational Fluid Dynamics
Qinxuan Wang, Chuang Wang, Mingyu Zhang, Jingwei Sun, Peipei Yang · Oct 17, 2025 · Citations: 0

We evaluate MNO on diverse benchmarks, covering steady-state and unsteady flow scenarios with up to 300k points.
When to Ensemble: Identifying Token-Level Points for Stable and Fast LLM Ensembling
Heecheol Yun, Kwangmin Ki, Junghyun Lee, Eunho Yang · Oct 17, 2025 · Citations: 0

Our experiments on diverse benchmarks, including MATH500 and BBH, demonstrate that SAFE outperforms existing methods in both accuracy and efficiency, with gains achieved even when ensembling fewer than 1% of tokens.
SAG-Agent: Enabling Long-Horizon Reasoning in Strategy Games via Dynamic Knowledge Graphs
Chenwei Tang, Lin Long, Xinyu Liu, Jingyu Xing, Zizhou Wang · Oct 17, 2025 · Citations: 0

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.

Post a Job Get a Quote