Daily Archive

HFEPX Weekly Archive: 2025-W46

Updated from current HFEPX corpus (Feb 27, 2026). 10 papers are grouped in this daily page. Common evaluation modes: Automatic Metrics, Simulation Env. Most common rater population: Domain Experts. Frequently cited benchmark: Retrieval. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Nov 15, 2025.

Papers: 10 Last published: Nov 15, 2025 Global RSS

Research Narrative

Grounded narrative Model: deterministic-grounded Source: persisted

Updated from current HFEPX corpus (Feb 27, 2026). This page tracks 10 papers for HFEPX Weekly Archive: 2025-W46. Dominant protocol signals include automatic metrics, simulation environments, with frequent benchmark focus on Retrieval, Rpts-Eval and metric focus on accuracy, coherence. Use the grounded sections below to prioritize reproducible protocol choices, benchmark-matched comparisons, and judge-vs-human evaluation checks.

Why This Matters For Eval Research

10% of papers report explicit human-feedback signals, led by critique/edit feedback.

Evidence: Multimodal Peer Review Simulation with Actionable To-Do Recommendations for Community-Aware Manuscript Revisions , EARL: Entropy-Aware RL Alignment of LLMs for Reliable RTL Code Generation , CLARITY: Contextual Linguistic Adaptation and Accent Retrieval for Dual-Bias Mitigation in Text-to-Speech Generation , Mastering Olympiad-Level Physics with Artificial Intelligence
automatic metrics appears in 90% of papers in this hub.

Evidence: EARL: Entropy-Aware RL Alignment of LLMs for Reliable RTL Code Generation , CLARITY: Contextual Linguistic Adaptation and Accent Retrieval for Dual-Bias Mitigation in Text-to-Speech Generation , Mastering Olympiad-Level Physics with Artificial Intelligence , Chain of Summaries: Summarization Through Iterative Questioning
Retrieval is a recurring benchmark anchor for cross-paper comparisons in this page.

Evidence: CLARITY: Contextual Linguistic Adaptation and Accent Retrieval for Dual-Bias Mitigation in Text-to-Speech Generation , Multimodal Peer Review Simulation with Actionable To-Do Recommendations for Community-Aware Manuscript Revisions , Beyond Fact Retrieval: Episodic Memory for RAG with Generative Semantic Workspaces , EARL: Entropy-Aware RL Alignment of LLMs for Reliable RTL Code Generation

Protocol Takeaways

Quality-control reporting is sparse in this slice; prioritize papers with explicit calibration or adjudication steps.

Evidence: EARL: Entropy-Aware RL Alignment of LLMs for Reliable RTL Code Generation , CLARITY: Contextual Linguistic Adaptation and Accent Retrieval for Dual-Bias Mitigation in Text-to-Speech Generation , Multimodal Peer Review Simulation with Actionable To-Do Recommendations for Community-Aware Manuscript Revisions , Mastering Olympiad-Level Physics with Artificial Intelligence
Rater context is mostly domain experts, and annotation is commonly mixed annotation units; use this to scope replication staffing.

Evidence: Multimodal Peer Review Simulation with Actionable To-Do Recommendations for Community-Aware Manuscript Revisions , EARL: Entropy-Aware RL Alignment of LLMs for Reliable RTL Code Generation , CLARITY: Contextual Linguistic Adaptation and Accent Retrieval for Dual-Bias Mitigation in Text-to-Speech Generation , Mastering Olympiad-Level Physics with Artificial Intelligence
Stratify by benchmark (Retrieval vs Rpts-Eval) before comparing methods.

Evidence: EARL: Entropy-Aware RL Alignment of LLMs for Reliable RTL Code Generation , CLARITY: Contextual Linguistic Adaptation and Accent Retrieval for Dual-Bias Mitigation in Text-to-Speech Generation , Multimodal Peer Review Simulation with Actionable To-Do Recommendations for Community-Aware Manuscript Revisions , Mastering Olympiad-Level Physics with Artificial Intelligence

Benchmark Interpretation

Retrieval appears in 30% of hub papers (3/10); use this cohort for benchmark-matched comparisons.
Rpts-Eval appears in 10% of hub papers (1/10); use this cohort for benchmark-matched comparisons.

Metric Interpretation

accuracy is reported in 30% of hub papers (3/10); compare with a secondary metric before ranking methods.
coherence is reported in 10% of hub papers (1/10); compare with a secondary metric before ranking methods.

Researcher Checklist

Close gap on Papers with explicit human feedback. Coverage is a replication risk (10% vs 45% target).
Close gap on Papers reporting quality controls. Coverage is a replication risk (0% vs 30% target).
Maintain strength on Papers naming benchmarks/datasets. Coverage is strong (60% vs 35% target).
Maintain strength on Papers naming evaluation metrics. Coverage is strong (60% vs 35% target).
Close gap on Papers with known rater population. Coverage is a replication risk (10% vs 35% target).
Close gap on Papers with known annotation unit. Coverage is a replication risk (0% vs 35% target).

Papers with explicit human feedback

Coverage is a replication risk (10% vs 45% target).

Papers reporting quality controls

Coverage is a replication risk (0% vs 30% target).

Papers naming benchmarks/datasets

Coverage is strong (60% vs 35% target).

Papers naming evaluation metrics

Coverage is strong (60% vs 35% target).

Papers with known rater population

Coverage is a replication risk (10% vs 35% target).

Papers with known annotation unit

Coverage is a replication risk (0% vs 35% target).

Known Limitations

Only 0% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (10% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Links

Benchmark Slice: Retrieval - Prioritizes benchmark-specific protocol comparisons.
Metric Slice: accuracy - Finds papers where reported metrics are directly comparable.
Recent High-Signal Papers - Keeps the hub connected to the latest HFEPX corpus updates.

automatic_metrics vs simulation_env

both=0, left_only=9, right_only=1

0 papers use both Automatic Metrics and Simulation Env.

Benchmark Brief

Retrieval

Coverage: 3 papers (30%)

3 papers (30%) mention Retrieval.

Examples: CLARITY: Contextual Linguistic Adaptation and Accent Retrieval for Dual-Bias Mitigation in Text-to-Speech Generation , Multimodal Peer Review Simulation with Actionable To-Do Recommendations for Community-Aware Manuscript Revisions , Beyond Fact Retrieval: Episodic Memory for RAG with Generative Semantic Workspaces

Benchmark Brief

Rpts-Eval

Coverage: 1 papers (10%)

1 papers (10%) mention Rpts-Eval.

Examples: RPTS: Tree-Structured Reasoning Process Scoring for Faithful Multimodal Evaluation

Benchmark Brief

SQuAD

Coverage: 1 papers (10%)

1 papers (10%) mention SQuAD.

Examples: Chain of Summaries: Summarization Through Iterative Questioning

Metric Brief

accuracy

Coverage: 3 papers (30%)

3 papers (30%) mention accuracy.

Examples: CLARITY: Contextual Linguistic Adaptation and Accent Retrieval for Dual-Bias Mitigation in Text-to-Speech Generation , Intelligence per Watt: Measuring Intelligence Efficiency of Local AI , Graph Representation-based Model Poisoning on the Heterogeneous Internet of Agents

Metric Brief

coherence

Coverage: 1 papers (10%)

1 papers (10%) mention coherence.

Examples: Beyond Fact Retrieval: Episodic Memory for RAG with Generative Semantic Workspaces

Metric Brief

context length

Coverage: 1 papers (10%)

1 papers (10%) mention context length.

Examples: Chain of Summaries: Summarization Through Iterative Questioning

Most Cited In This Hub

Fast path to methods with the strongest citation traction in this scope.

Papers: EARL: Entropy-Aware RL Alignment of LLMs for Reliable RTL Code Generation , CLARITY: Contextual Linguistic Adaptation and Accent Retrieval for Dual-Bias Mitigation in Text-to-Speech Generation , Multimodal Peer Review Simulation with Actionable To-Do Recommendations for Community-Aware Manuscript Revisions

Most Recent

Fast path to latest protocol changes and newly published evaluation setups.

Best Protocol Detail

Papers with explicit rater/unit metadata and quality-control signals for reproducibility.

Papers Published On This Date

EARL: Entropy-Aware RL Alignment of LLMs for Reliable RTL Code Generation
Jiahe Shi, Zhengqi Gao, Ching-Yun Ko, Duane Boning · Nov 15, 2025

Recent advances in large language models (LLMs) have demonstrated significant potential in hardware design automation, particularly in using natural language to synthesize Register-Transfer Level (RTL) code.
CLARITY: Contextual Linguistic Adaptation and Accent Retrieval for Dual-Bias Mitigation in Text-to-Speech Generation
Crystal Min Hui Poon, Pai Chet Ng, Xiaoxiao Miao, Immanuel Jun Kai Loh, Bowen Zhang · Nov 14, 2025

Instruction-guided text-to-speech (TTS) research has reached a maturity level where excellent speech generation quality is possible on demand, yet two coupled biases persist in reducing perceived quality: accent bias, where models default t
Multimodal Peer Review Simulation with Actionable To-Do Recommendations for Community-Aware Manuscript Revisions
Mengze Hong, Di Jiang, Weiwei Zhao, Yawen Li, Yihang Wang · Nov 14, 2025

Critique Edit

Experimental results highlight the effectiveness of the proposed system in generating more comprehensive and useful reviews aligned with expert standards, surpassing ablated baselines and advancing transparent, human-centered scholarly assi
Mastering Olympiad-Level Physics with Artificial Intelligence
Dong-Shan Jian, Xiang Li, Chen-Xu Yan, Hui-Wen Zheng, Zhi-Zhang Bian · Nov 13, 2025

Olympiad-level physics problem-solving significantly challenges both humans and artificial intelligence (AI), as it requires integrating appropriate modeling, application of physical principles, and precise calculation within long reasoning
Chain of Summaries: Summarization Through Iterative Questioning
William Brach, Kristián Košťál, Lukas Galke Poech · Nov 12, 2025

CoS thus resembles an appealing option for website maintainers to make their content more accessible for LLMs, while retaining possibilities for human oversight.
State of the Art in Text Classification for South Slavic Languages: Fine-Tuning or Prompting?
Taja Kuzman Pungeršek, Peter Rupnik, Ivan Porupski, Vuk Dinić, Nikola Ljubešić · Nov 11, 2025

Until recently, fine-tuned BERT-like models provided state-of-the-art performance on text classification tasks.
Intelligence per Watt: Measuring Intelligence Efficiency of Local AI
Jon Saad-Falcon, Avanika Narayan, Hakki Orhun Akengin, J. Wes Griffin, Herumb Shandilya · Nov 11, 2025

Large language model (LLM) queries are predominantly processed by frontier models in centralized cloud infrastructure.
Beyond Fact Retrieval: Episodic Memory for RAG with Generative Semantic Workspaces
Shreyas Rajesh, Pavan Holur, Chenda Duan, David Chong, Vwani Roychowdhury · Nov 10, 2025

Long Horizon

On the Episodic Memory Benchmark (EpBench) \cite{huet_episodic_2025} comprising corpora ranging from 100k to 1M tokens in length, GSW outperforms existing RAG based baselines by up to \textbf{20\%}.
Graph Representation-based Model Poisoning on the Heterogeneous Internet of Agents
Hanlin Cai, Houtianfu Wang, Haofan Dong, Kai Li, Sai Zou · Nov 10, 2025

Internet of Agents (IoA) envisions a unified, agent-centric paradigm where heterogeneous large language model (LLM) agents can interconnect and collaborate at scale.
RPTS: Tree-Structured Reasoning Process Scoring for Faithful Multimodal Evaluation
Haofeng Wang, Yu Zhang · Nov 10, 2025

Large Vision-Language Models (LVLMs) excel in multimodal reasoning and have shown impressive performance on various multimodal benchmarks.

Recent Daily Archives

fortnight-2026-f04 (335) week-2026-w08 (288) quarter-2026-q1 (732) month-2026-02 (682) fortnight-2026-f05 (323) week-2026-w09 (323) 2026-02-23 (56) 2026-02-25 (89) 2026-02-24 (118) month-2026-01 (50) 2026-02-20 (37) week-2026-w07 (47) 2026-02-26 (60) quarter-2025-q4 (124) 2026-02-18 (56) 2026-02-21 (22) 2026-02-17 (54) 2026-02-19 (60) fortnight-2026-f02 (27) 2026-02-22 (23) 2026-02-16 (36) month-2025-10 (69) quarter-2025-q2 (78) fortnight-2026-f01 (16) fortnight-2026-f03 (34) quarter-2025-q3 (90) week-2026-w06 (23) month-2025-12 (30) fortnight-2025-f22 (32) quarter-2025-q1 (35) month-2025-06 (39) month-2025-09 (44) month-2025-11 (25) week-2026-w03 (15) week-2026-w04 (12) 2026-02-13 (8) fortnight-2025-f21 (32) 2026-02-15 (7) fortnight-2025-f20 (34) fortnight-2025-f12 (29)

HFEPX Weekly Archive: 2025-W46

Research Narrative

Why This Matters For Eval Research

Protocol Takeaways

Benchmark Interpretation

Metric Interpretation

Researcher Checklist

Suggested Reading Order

Known Limitations

Research Utility Links

Papers Published On This Date

Recent Daily Archives