Daily Archive

HFEPX Fortnight Archive: 2025-F23

Updated from current HFEPX corpus (Feb 27, 2026). 15 papers are grouped in this daily page. Common evaluation modes: Automatic Metrics, Simulation Env. Most common rater population: Domain Experts. Frequently cited benchmark: Retrieval. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Nov 15, 2025.

Papers: 15 Last published: Nov 15, 2025 Global RSS

Research Narrative

Grounded narrative Model: deterministic-grounded Source: persisted

Updated from current HFEPX corpus (Feb 27, 2026). This page tracks 15 papers for HFEPX Fortnight Archive: 2025-F23. Dominant protocol signals include automatic metrics, simulation environments, with frequent benchmark focus on Retrieval, Cv-Bench and metric focus on accuracy, cost. Use the grounded sections below to prioritize reproducible protocol choices, benchmark-matched comparisons, and judge-vs-human evaluation checks.

Why This Matters For Eval Research

20% of papers report explicit human-feedback signals, led by critique/edit feedback.

Evidence: Multimodal Peer Review Simulation with Actionable To-Do Recommendations for Community-Aware Manuscript Revisions , Error-Aware Knowledge Distillation via Targeted Revision for Customer-Service Summarization , EARL: Entropy-Aware RL Alignment of LLMs for Reliable RTL Code Generation , CLARITY: Contextual Linguistic Adaptation and Accent Retrieval for Dual-Bias Mitigation in Text-to-Speech Generation
automatic metrics appears in 93.3% of papers in this hub.

Evidence: EARL: Entropy-Aware RL Alignment of LLMs for Reliable RTL Code Generation , CLARITY: Contextual Linguistic Adaptation and Accent Retrieval for Dual-Bias Mitigation in Text-to-Speech Generation , Mastering Olympiad-Level Physics with Artificial Intelligence , Chain of Summaries: Summarization Through Iterative Questioning
Retrieval is a recurring benchmark anchor for cross-paper comparisons in this page.

Evidence: CLARITY: Contextual Linguistic Adaptation and Accent Retrieval for Dual-Bias Mitigation in Text-to-Speech Generation , Multimodal Peer Review Simulation with Actionable To-Do Recommendations for Community-Aware Manuscript Revisions , Beyond Fact Retrieval: Episodic Memory for RAG with Generative Semantic Workspaces , EARL: Entropy-Aware RL Alignment of LLMs for Reliable RTL Code Generation

Protocol Takeaways

Quality-control reporting is sparse in this slice; prioritize papers with explicit calibration or adjudication steps.

Evidence: EARL: Entropy-Aware RL Alignment of LLMs for Reliable RTL Code Generation , CLARITY: Contextual Linguistic Adaptation and Accent Retrieval for Dual-Bias Mitigation in Text-to-Speech Generation , Multimodal Peer Review Simulation with Actionable To-Do Recommendations for Community-Aware Manuscript Revisions , Mastering Olympiad-Level Physics with Artificial Intelligence
Rater context is mostly domain experts, and annotation is commonly mixed annotation units; use this to scope replication staffing.

Evidence: Multimodal Peer Review Simulation with Actionable To-Do Recommendations for Community-Aware Manuscript Revisions , EARL: Entropy-Aware RL Alignment of LLMs for Reliable RTL Code Generation , CLARITY: Contextual Linguistic Adaptation and Accent Retrieval for Dual-Bias Mitigation in Text-to-Speech Generation , Mastering Olympiad-Level Physics with Artificial Intelligence
Stratify by benchmark (Retrieval vs Cv-Bench) before comparing methods.

Evidence: EARL: Entropy-Aware RL Alignment of LLMs for Reliable RTL Code Generation , CLARITY: Contextual Linguistic Adaptation and Accent Retrieval for Dual-Bias Mitigation in Text-to-Speech Generation , Multimodal Peer Review Simulation with Actionable To-Do Recommendations for Community-Aware Manuscript Revisions , Mastering Olympiad-Level Physics with Artificial Intelligence

Benchmark Interpretation

Retrieval appears in 20% of hub papers (3/15); use this cohort for benchmark-matched comparisons.
Cv-Bench appears in 6.7% of hub papers (1/15); use this cohort for benchmark-matched comparisons.

Metric Interpretation

accuracy is reported in 40% of hub papers (6/15); compare with a secondary metric before ranking methods.
cost is reported in 13.3% of hub papers (2/15); compare with a secondary metric before ranking methods.

Researcher Checklist

Close gap on Papers with explicit human feedback. Coverage is a replication risk (20% vs 45% target).
Close gap on Papers reporting quality controls. Coverage is a replication risk (0% vs 30% target).
Maintain strength on Papers naming benchmarks/datasets. Coverage is strong (53.3% vs 35% target).
Maintain strength on Papers naming evaluation metrics. Coverage is strong (60% vs 35% target).
Close gap on Papers with known rater population. Coverage is a replication risk (6.7% vs 35% target).
Close gap on Papers with known annotation unit. Coverage is a replication risk (0% vs 35% target).

Papers with explicit human feedback

Coverage is a replication risk (20% vs 45% target).

Papers reporting quality controls

Coverage is a replication risk (0% vs 30% target).

Papers naming benchmarks/datasets

Coverage is strong (53.3% vs 35% target).

Papers naming evaluation metrics

Coverage is strong (60% vs 35% target).

Papers with known rater population

Coverage is a replication risk (6.7% vs 35% target).

Papers with known annotation unit

Coverage is a replication risk (0% vs 35% target).

Known Limitations

Only 0% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (6.7% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Links

Benchmark Slice: Retrieval - Prioritizes benchmark-specific protocol comparisons.
Metric Slice: accuracy - Finds papers where reported metrics are directly comparable.
Recent High-Signal Papers - Keeps the hub connected to the latest HFEPX corpus updates.

automatic_metrics vs simulation_env

both=0, left_only=14, right_only=1

0 papers use both Automatic Metrics and Simulation Env.

Benchmark Brief

Retrieval

Coverage: 3 papers (20%)

3 papers (20%) mention Retrieval.

Examples: CLARITY: Contextual Linguistic Adaptation and Accent Retrieval for Dual-Bias Mitigation in Text-to-Speech Generation , Multimodal Peer Review Simulation with Actionable To-Do Recommendations for Community-Aware Manuscript Revisions , Beyond Fact Retrieval: Episodic Memory for RAG with Generative Semantic Workspaces

Benchmark Brief

Cv-Bench

Coverage: 1 papers (6.7%)

1 papers (6.7%) mention Cv-Bench.

Examples: Long Grounded Thoughts: Synthesizing Visual Problems and Reasoning Chains at Scale

Benchmark Brief

MATH

Coverage: 1 papers (6.7%)

1 papers (6.7%) mention MATH.

Examples: Long Grounded Thoughts: Synthesizing Visual Problems and Reasoning Chains at Scale

Metric Brief

accuracy

Coverage: 6 papers (40%)

6 papers (40%) mention accuracy.

Examples: CLARITY: Contextual Linguistic Adaptation and Accent Retrieval for Dual-Bias Mitigation in Text-to-Speech Generation , Intelligence per Watt: Measuring Intelligence Efficiency of Local AI , Graph Representation-based Model Poisoning on the Heterogeneous Internet of Agents

Metric Brief

cost

Coverage: 2 papers (13.3%)

2 papers (13.3%) mention cost.

Examples: Batch Prompting Suppresses Overthinking Reasoning Under Constraint: How Batch Prompting Suppresses Overthinking in Reasoning Models , Error-Aware Knowledge Distillation via Targeted Revision for Customer-Service Summarization

Metric Brief

latency

Coverage: 2 papers (13.3%)

2 papers (13.3%) mention latency.

Examples: Intelligence per Watt: Measuring Intelligence Efficiency of Local AI , OckBench: Measuring the Efficiency of LLM Reasoning

Most Cited In This Hub

Fast path to methods with the strongest citation traction in this scope.

Papers: EARL: Entropy-Aware RL Alignment of LLMs for Reliable RTL Code Generation , CLARITY: Contextual Linguistic Adaptation and Accent Retrieval for Dual-Bias Mitigation in Text-to-Speech Generation , Multimodal Peer Review Simulation with Actionable To-Do Recommendations for Community-Aware Manuscript Revisions

Most Recent

Fast path to latest protocol changes and newly published evaluation setups.

Best Protocol Detail

Papers with explicit rater/unit metadata and quality-control signals for reproducibility.

Papers Published On This Date

EARL: Entropy-Aware RL Alignment of LLMs for Reliable RTL Code Generation
Jiahe Shi, Zhengqi Gao, Ching-Yun Ko, Duane Boning · Nov 15, 2025

Recent advances in large language models (LLMs) have demonstrated significant potential in hardware design automation, particularly in using natural language to synthesize Register-Transfer Level (RTL) code.
CLARITY: Contextual Linguistic Adaptation and Accent Retrieval for Dual-Bias Mitigation in Text-to-Speech Generation
Crystal Min Hui Poon, Pai Chet Ng, Xiaoxiao Miao, Immanuel Jun Kai Loh, Bowen Zhang · Nov 14, 2025

Instruction-guided text-to-speech (TTS) research has reached a maturity level where excellent speech generation quality is possible on demand, yet two coupled biases persist in reducing perceived quality: accent bias, where models default t
Multimodal Peer Review Simulation with Actionable To-Do Recommendations for Community-Aware Manuscript Revisions
Mengze Hong, Di Jiang, Weiwei Zhao, Yawen Li, Yihang Wang · Nov 14, 2025

Critique Edit

Experimental results highlight the effectiveness of the proposed system in generating more comprehensive and useful reviews aligned with expert standards, surpassing ablated baselines and advancing transparent, human-centered scholarly assi
Mastering Olympiad-Level Physics with Artificial Intelligence
Dong-Shan Jian, Xiang Li, Chen-Xu Yan, Hui-Wen Zheng, Zhi-Zhang Bian · Nov 13, 2025

Olympiad-level physics problem-solving significantly challenges both humans and artificial intelligence (AI), as it requires integrating appropriate modeling, application of physical principles, and precise calculation within long reasoning
Chain of Summaries: Summarization Through Iterative Questioning
William Brach, Kristián Košťál, Lukas Galke Poech · Nov 12, 2025

CoS thus resembles an appealing option for website maintainers to make their content more accessible for LLMs, while retaining possibilities for human oversight.
State of the Art in Text Classification for South Slavic Languages: Fine-Tuning or Prompting?
Taja Kuzman Pungeršek, Peter Rupnik, Ivan Porupski, Vuk Dinić, Nikola Ljubešić · Nov 11, 2025

Until recently, fine-tuned BERT-like models provided state-of-the-art performance on text classification tasks.
Intelligence per Watt: Measuring Intelligence Efficiency of Local AI
Jon Saad-Falcon, Avanika Narayan, Hakki Orhun Akengin, J. Wes Griffin, Herumb Shandilya · Nov 11, 2025

Large language model (LLM) queries are predominantly processed by frontier models in centralized cloud infrastructure.
Beyond Fact Retrieval: Episodic Memory for RAG with Generative Semantic Workspaces
Shreyas Rajesh, Pavan Holur, Chenda Duan, David Chong, Vwani Roychowdhury · Nov 10, 2025

Long Horizon

On the Episodic Memory Benchmark (EpBench) \cite{huet_episodic_2025} comprising corpora ranging from 100k to 1M tokens in length, GSW outperforms existing RAG based baselines by up to \textbf{20\%}.
Graph Representation-based Model Poisoning on the Heterogeneous Internet of Agents
Hanlin Cai, Houtianfu Wang, Haofan Dong, Kai Li, Sai Zou · Nov 10, 2025

Internet of Agents (IoA) envisions a unified, agent-centric paradigm where heterogeneous large language model (LLM) agents can interconnect and collaborate at scale.
RPTS: Tree-Structured Reasoning Process Scoring for Faithful Multimodal Evaluation
Haofeng Wang, Yu Zhang · Nov 10, 2025

Large Vision-Language Models (LVLMs) excel in multimodal reasoning and have shown impressive performance on various multimodal benchmarks.
OckBench: Measuring the Efficiency of LLM Reasoning
Zheng Du, Hao Kang, Song Han, Tushar Krishna, Ligeng Zhu · Nov 7, 2025

Yet current benchmarks emphasize accuracy and output quality, neglecting a critical dimension: efficiency of token usage.
Long Grounded Thoughts: Synthesizing Visual Problems and Reasoning Chains at Scale
David Acuna, Chao-Han Huck Yang, Yuntian Deng, Jaehun Jung, Ximing Lu · Nov 7, 2025

Pairwise Preference

We introduce a framework able to synthesize vision-centric problems spanning diverse levels of complexity, and the resulting dataset with over 1M high-quality problems including: reasoning traces, preference data, and instruction prompts su
Batch Prompting Suppresses Overthinking Reasoning Under Constraint: How Batch Prompting Suppresses Overthinking in Reasoning Models
Saurabh Srivastava, Janit Bidhan, Hao Yan, Abhishek Dey, Tanu Kansal · Nov 6, 2025

Across 13 diverse benchmarks with DeepSeek-R1 and OpenAI-o1, batch prompting {reduces reasoning tokens by 76\% (2{,}950$\mapsto$710), on average, while preserving or improving accuracy}.
Error-Aware Knowledge Distillation via Targeted Revision for Customer-Service Summarization
Hee-Jin Lee, Zhen Guo, Luchao Jin, Morteza Moazami Goudarzi · Nov 4, 2025

Critique Edit

We introduce an Analyze-Revise-Finetune (ARF) pipeline that enables smaller open-source language models (LLMs) to surpass substantially larger proprietary models in customer service summarization tasks.
A Proof of Learning Rate Transfer under $μ$P
Soufiane Hayou · Nov 3, 2025

We provide the first proof of learning rate transfer with width in a linear multi-layer perceptron (MLP) parametrized with $μ$P, a neural network parameterization designed to ``maximize'' feature learning in the infinite-width limit.

Recent Daily Archives

fortnight-2026-f04 (335) week-2026-w08 (288) quarter-2026-q1 (732) month-2026-02 (682) fortnight-2026-f05 (323) week-2026-w09 (323) 2026-02-23 (56) 2026-02-25 (89) 2026-02-24 (118) month-2026-01 (50) 2026-02-20 (37) week-2026-w07 (47) 2026-02-26 (60) quarter-2025-q4 (124) 2026-02-18 (56) 2026-02-21 (22) 2026-02-17 (54) 2026-02-19 (60) fortnight-2026-f02 (27) 2026-02-22 (23) 2026-02-16 (36) month-2025-10 (69) quarter-2025-q2 (78) fortnight-2026-f01 (16) fortnight-2026-f03 (34) quarter-2025-q3 (90) week-2026-w06 (23) month-2025-12 (30) fortnight-2025-f22 (32) quarter-2025-q1 (35) month-2025-06 (39) month-2025-09 (44) month-2025-11 (25) week-2026-w03 (15) week-2026-w04 (12) 2026-02-13 (8) fortnight-2025-f21 (32) 2026-02-15 (7) fortnight-2025-f20 (34) fortnight-2025-f12 (29)

HFEPX Fortnight Archive: 2025-F23

Research Narrative

Why This Matters For Eval Research

Protocol Takeaways

Benchmark Interpretation

Metric Interpretation

Researcher Checklist

Suggested Reading Order

Known Limitations

Research Utility Links

Papers Published On This Date

Recent Daily Archives