Daily Archive

HFEPX Weekly Archive: 2025-W43

Updated from current HFEPX corpus (Feb 27, 2026). 14 papers are grouped in this daily page. Common evaluation modes: Automatic Metrics, Simulation Env. Most common rater population: Domain Experts. Common annotation unit: Freeform. Frequent quality control: Calibration. Frequently cited benchmark: Caparena. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Oct 26, 2025.

Papers: 14 Last published: Oct 26, 2025 Global RSS

Research Narrative

Grounded narrative Model: deterministic-grounded Source: persisted

Updated from current HFEPX corpus (Feb 27, 2026). This page tracks 14 papers for HFEPX Weekly Archive: 2025-W43. Dominant protocol signals include automatic metrics, simulation environments, human evaluation, with frequent benchmark focus on Caparena, Honestybench and metric focus on accuracy, calibration error. Use the grounded sections below to prioritize reproducible protocol choices, benchmark-matched comparisons, and judge-vs-human evaluation checks.

Why This Matters For Eval Research

35.7% of papers report explicit human-feedback signals, led by demonstration data.

Evidence: MoMaGen: Generating Demonstrations under Soft and Hard Constraints for Multi-Step Bimanual Mobile Manipulation , SPACeR: Self-Play Anchoring with Centralized Reference Models , Towards Scalable Oversight via Partitioned Human Supervision , ATLAS: Adaptive Transfer Scaling Laws for Multilingual Pretraining, Finetuning, and Decoding the Curse of Multilinguality
automatic metrics appears in 64.3% of papers in this hub.

Evidence: Towards Scalable Oversight via Partitioned Human Supervision , ATLAS: Adaptive Transfer Scaling Laws for Multilingual Pretraining, Finetuning, and Decoding the Curse of Multilinguality , Designing and Evaluating Chain-of-Hints for Scientific Question Answering , RELOOP: Recursive Retrieval with Multi-Hop Reasoner and Planners for Heterogeneous QA
Caparena is a recurring benchmark anchor for cross-paper comparisons in this page.

Evidence: PoSh: Using Scene Graphs To Guide LLMs-as-a-Judge For Detailed Image Descriptions , Towards Scalable Oversight via Partitioned Human Supervision , ATLAS: Adaptive Transfer Scaling Laws for Multilingual Pretraining, Finetuning, and Decoding the Curse of Multilinguality , PARL: Prompt-based Agents for Reinforcement Learning

Protocol Takeaways

1 sampled papers report both human evaluation and LLM-as-judge, supporting direct agreement checks.

Evidence: PoSh: Using Scene Graphs To Guide LLMs-as-a-Judge For Detailed Image Descriptions , Estonian Native Large Language Model Benchmark , Towards Scalable Oversight via Partitioned Human Supervision , ATLAS: Adaptive Transfer Scaling Laws for Multilingual Pretraining, Finetuning, and Decoding the Curse of Multilinguality
Most common quality-control signal is rater calibration (14.3% of papers).

Evidence: A Multi-faceted Analysis of Cognitive Abilities: Evaluating Prompt Methods with Large Language Models on the CONSORT Checklist , Annotation-Efficient Universal Honesty Alignment , Towards Scalable Oversight via Partitioned Human Supervision , ATLAS: Adaptive Transfer Scaling Laws for Multilingual Pretraining, Finetuning, and Decoding the Curse of Multilinguality
Rater context is mostly domain experts, and annotation is commonly Freeform; use this to scope replication staffing.

Evidence: Towards Scalable Oversight via Partitioned Human Supervision , A Multi-faceted Analysis of Cognitive Abilities: Evaluating Prompt Methods with Large Language Models on the CONSORT Checklist , PoSh: Using Scene Graphs To Guide LLMs-as-a-Judge For Detailed Image Descriptions , ATLAS: Adaptive Transfer Scaling Laws for Multilingual Pretraining, Finetuning, and Decoding the Curse of Multilinguality

Benchmark Interpretation

Caparena appears in 7.1% of hub papers (1/14); use this cohort for benchmark-matched comparisons.
Honestybench appears in 7.1% of hub papers (1/14); use this cohort for benchmark-matched comparisons.

Metric Interpretation

accuracy is reported in 14.3% of hub papers (2/14); compare with a secondary metric before ranking methods.
calibration error is reported in 7.1% of hub papers (1/14); compare with a secondary metric before ranking methods.

Researcher Checklist

Tighten coverage on Papers with explicit human feedback. Coverage is usable but incomplete (35.7% vs 45% target).
Close gap on Papers reporting quality controls. Coverage is a replication risk (14.3% vs 30% target).
Tighten coverage on Papers naming benchmarks/datasets. Coverage is usable but incomplete (21.4% vs 35% target).
Maintain strength on Papers naming evaluation metrics. Coverage is strong (35.7% vs 35% target).
Tighten coverage on Papers with known rater population. Coverage is usable but incomplete (21.4% vs 35% target).
Close gap on Papers with known annotation unit. Coverage is a replication risk (14.3% vs 35% target).

Papers with explicit human feedback

Coverage is usable but incomplete (35.7% vs 45% target).

Papers reporting quality controls

Coverage is a replication risk (14.3% vs 30% target).

Papers naming benchmarks/datasets

Coverage is usable but incomplete (21.4% vs 35% target).

Papers naming evaluation metrics

Coverage is strong (35.7% vs 35% target).

Papers with known rater population

Coverage is usable but incomplete (21.4% vs 35% target).

Papers with known annotation unit

Coverage is a replication risk (14.3% vs 35% target).

Known Limitations

Only 14.3% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (21.4% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Links

Judge vs Human Agreement - Compares papers that evaluate with both human raters and LLM judges.
Benchmark Slice: Caparena - Prioritizes benchmark-specific protocol comparisons.
Metric Slice: accuracy - Finds papers where reported metrics are directly comparable.
Recent High-Signal Papers - Keeps the hub connected to the latest HFEPX corpus updates.

human_eval vs llm_as_judge

both=1, left_only=1, right_only=0

1 papers use both Human Eval and Llm As Judge.

human_eval vs automatic_metrics

both=0, left_only=2, right_only=9

0 papers use both Human Eval and Automatic Metrics.

llm_as_judge vs automatic_metrics

both=0, left_only=1, right_only=9

0 papers use both Llm As Judge and Automatic Metrics.

Benchmark Brief

Caparena

Coverage: 1 papers (7.1%)

1 papers (7.1%) mention Caparena.

Examples: PoSh: Using Scene Graphs To Guide LLMs-as-a-Judge For Detailed Image Descriptions

Benchmark Brief

Honestybench

Coverage: 1 papers (7.1%)

1 papers (7.1%) mention Honestybench.

Examples: Annotation-Efficient Universal Honesty Alignment

Benchmark Brief

HotpotQA

Coverage: 1 papers (7.1%)

1 papers (7.1%) mention HotpotQA.

Examples: RELOOP: Recursive Retrieval with Multi-Hop Reasoner and Planners for Heterogeneous QA

Metric Brief

accuracy

Coverage: 2 papers (14.3%)

2 papers (14.3%) mention accuracy.

Examples: Towards Scalable Oversight via Partitioned Human Supervision , RELOOP: Recursive Retrieval with Multi-Hop Reasoner and Planners for Heterogeneous QA

Metric Brief

calibration error

Coverage: 1 papers (7.1%)

1 papers (7.1%) mention calibration error.

Examples: A Multi-faceted Analysis of Cognitive Abilities: Evaluating Prompt Methods with Large Language Models on the CONSORT Checklist

Metric Brief

Coverage: 1 papers (7.1%)

1 papers (7.1%) mention f1.

Examples: RELOOP: Recursive Retrieval with Multi-Hop Reasoner and Planners for Heterogeneous QA

Most Cited In This Hub

Fast path to methods with the strongest citation traction in this scope.

Papers: Towards Scalable Oversight via Partitioned Human Supervision , ATLAS: Adaptive Transfer Scaling Laws for Multilingual Pretraining, Finetuning, and Decoding the Curse of Multilinguality , PARL: Prompt-based Agents for Reinforcement Learning

Most Recent

Fast path to latest protocol changes and newly published evaluation setups.

Best Protocol Detail

Papers with explicit rater/unit metadata and quality-control signals for reproducibility.

Papers Published On This Date

Towards Scalable Oversight via Partitioned Human Supervision
Ren Yin, Takashi Ishida, Masashi Sugiyama · Oct 26, 2025

As artificial intelligence (AI) systems approach and surpass expert human performance across a broad range of tasks, obtaining high-quality human supervision for evaluation and training becomes increasingly challenging.
ATLAS: Adaptive Transfer Scaling Laws for Multilingual Pretraining, Finetuning, and Decoding the Curse of Multilinguality
Shayne Longpre, Sneha Kudugunta, Niklas Muennighoff, I-Hung Hsu, Isaac Caswell · Oct 24, 2025

In this work, we undertake the largest multilingual scaling laws study to date, totaling 774 multilingual training experiments, spanning 10M-8B model parameters, 400+ training languages and 48 evaluation languages.
PARL: Prompt-based Agents for Reinforcement Learning
Yarik Menchaca Resendiz, Roman Klinger · Oct 24, 2025

However, limited work evaluates LLMs as agents in reinforcement learning (RL) tasks (e.g., playing games), where learning occurs through interaction with an environment and a reward system.
Estonian Native Large Language Model Benchmark
Helena Grete Lillepalu, Tanel Alumäe · Oct 24, 2025

The availability of LLM benchmarks for the Estonian language is limited, and a comprehensive evaluation comparing the performance of different LLMs on Estonian tasks has yet to be conducted.
Designing and Evaluating Chain-of-Hints for Scientific Question Answering
Anubhav Jangra, Smaranda Muresan · Oct 24, 2025

Pairwise Preference

Using the best performing LLM as the backbone of a quantitative study with 41 participants, we uncover distinct user preferences across hinting strategies, and identify the limitations of automatic evaluation metrics to capture them.
RELOOP: Recursive Retrieval with Multi-Hop Reasoner and Planners for Heterogeneous QA
Ruiyi Yang, Hao Xue, Imran Razzak, Hakim Hacid, Flora D. Salim · Oct 23, 2025

Long Horizon

A Head Agent provides guidance that leads retrieval, while an Iteration Agent selects and expands HSeq via structure-respecting actions (e.g., parent/child hops, table row/column neighbors, KG relations); Finally the head agent composes can
Robust Preference Alignment via Directional Neighborhood Consensus
Ruochen Mao, Yuling Shi, Xiaodong Gu, Jiaheng Wei · Oct 23, 2025

Pairwise Preference

Aligning large language models with human preferences is critical for creating reliable and controllable AI systems.
CreativityPrism: A Holistic Evaluation Framework for Large Language Model Creativity
Zhaoyi Joey Hou, Bowei Alvin Zhang, Yining Lu, Bhiman Kumar Baghel, Anneliese Brei · Oct 23, 2025

Creativity is often seen as a hallmark of human intelligence.
A Multi-faceted Analysis of Cognitive Abilities: Evaluating Prompt Methods with Large Language Models on the CONSORT Checklist
Sohyeon Jeon, Hyung-Chul Lee · Oct 22, 2025

Despite the rapid expansion of Large Language Models (LLMs) in healthcare, robust and explainable evaluation of their ability to assess clinical trial reporting according to CONSORT standards remains an open challenge.
PoSh: Using Scene Graphs To Guide LLMs-as-a-Judge For Detailed Image Descriptions
Amith Ananthram, Elias Stengel-Eskin, Lorena A. Bradford, Julia Demarest, Adam Purvis · Oct 21, 2025

Rubric Rating

While vision-language models (VLMs) have advanced into detailed image description, evaluation remains a challenge.
MoMaGen: Generating Demonstrations under Soft and Hard Constraints for Multi-Step Bimanual Mobile Manipulation
Chengshu Li, Mengdi Xu, Arpit Bahety, Hang Yin, Yunfan Jiang · Oct 21, 2025

Demonstrations Long Horizon

Imitation learning from large-scale, diverse human demonstrations has been shown to be effective for training robots, but collecting such data is costly and time-consuming.
Latent-Augmented Discrete Diffusion Models
Dario Shariatian, Alain Durmus, Umut Simsekli, Stefano Peluchetti · Oct 20, 2025

Discrete diffusion models have emerged as a powerful class of models and a promising route to fast language generation, but practical implementations typically rely on factored reverse transitions that ignore cross-token dependencies and de
SPACeR: Self-Play Anchoring with Centralized Reference Models
Wei-Jer Chang, Akshay Rangesh, Kevin Joseph, Matthew Strong, Masayoshi Tomizuka · Oct 20, 2025

Demonstrations Multi Agent

Developing autonomous vehicles (AVs) requires not only safety and efficiency, but also realistic, human-like behaviors that are socially aware and predictable.
Annotation-Efficient Universal Honesty Alignment
Shiyu Ni, Keping Bi, Jiafeng Guo, Minghao Tang, Jingtong Wu · Oct 20, 2025

To support a large-scale study, we release HonestyBench, a benchmark covering ten free-form QA datasets with 560k training and 70k evaluation instances annotated with correctness and self-consistency signals.

Recent Daily Archives

fortnight-2026-f04 (335) week-2026-w08 (288) quarter-2026-q1 (732) month-2026-02 (682) fortnight-2026-f05 (323) week-2026-w09 (323) 2026-02-23 (56) 2026-02-25 (89) 2026-02-24 (118) month-2026-01 (50) 2026-02-20 (37) week-2026-w07 (47) 2026-02-26 (60) quarter-2025-q4 (124) 2026-02-18 (56) 2026-02-21 (22) 2026-02-17 (54) 2026-02-19 (60) fortnight-2026-f02 (27) 2026-02-22 (23) 2026-02-16 (36) month-2025-10 (69) quarter-2025-q2 (78) fortnight-2026-f01 (16) fortnight-2026-f03 (34) quarter-2025-q3 (90) week-2026-w06 (23) month-2025-12 (30) fortnight-2025-f22 (32) quarter-2025-q1 (35) month-2025-06 (39) month-2025-09 (44) month-2025-11 (25) week-2026-w03 (15) week-2026-w04 (12) 2026-02-13 (8) fortnight-2025-f21 (32) 2026-02-15 (7) fortnight-2025-f20 (34) fortnight-2025-f12 (29)

HFEPX Weekly Archive: 2025-W43

Research Narrative

Why This Matters For Eval Research

Protocol Takeaways

Benchmark Interpretation

Metric Interpretation

Researcher Checklist

Suggested Reading Order

Known Limitations

Research Utility Links

Papers Published On This Date

Recent Daily Archives