Daily Archive

HFEPX Weekly Archive: 2025-W24

Updated from current HFEPX corpus (Feb 27, 2026). 10 papers are grouped in this daily page. Common evaluation modes: Automatic Metrics, Simulation Env. Most common rater population: Domain Experts. Frequently cited benchmark: MATH. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Jun 15, 2025.

Papers: 10 Last published: Jun 15, 2025 Global RSS

Research Narrative

Grounded narrative Model: deterministic-grounded Source: persisted

Updated from current HFEPX corpus (Feb 27, 2026). This page tracks 10 papers for HFEPX Weekly Archive: 2025-W24. Dominant protocol signals include automatic metrics, simulation environments, with frequent benchmark focus on MATH, MATH-500 and metric focus on accuracy, coherence. Use the grounded sections below to prioritize reproducible protocol choices, benchmark-matched comparisons, and judge-vs-human evaluation checks.

Why This Matters For Eval Research

20% of papers report explicit human-feedback signals, led by expert verification.

Evidence: From Raw Corpora to Domain Benchmarks: Automated Evaluation of LLM Domain Expertise , $\texttt{SPECS}$: Faster Test-Time Scaling through Speculative Drafts , Persona-driven Simulation of Voting Behavior in the European Parliament with Large Language Models , Spurious Rewards: Rethinking Training Signals in RLVR
automatic metrics appears in 100% of papers in this hub.

Evidence: $\texttt{SPECS}$: Faster Test-Time Scaling through Speculative Drafts , Persona-driven Simulation of Voting Behavior in the European Parliament with Large Language Models , Spurious Rewards: Rethinking Training Signals in RLVR , Probabilistic distances-based hallucination detection in LLMs with RAG
MATH is a recurring benchmark anchor for cross-paper comparisons in this page.

Evidence: Spurious Rewards: Rethinking Training Signals in RLVR , AbstRaL: Augmenting LLMs' Reasoning by Reinforcing Abstract Thinking , $\texttt{SPECS}$: Faster Test-Time Scaling through Speculative Drafts , Persona-driven Simulation of Voting Behavior in the European Parliament with Large Language Models

Protocol Takeaways

Quality-control reporting is sparse in this slice; prioritize papers with explicit calibration or adjudication steps.

Evidence: $\texttt{SPECS}$: Faster Test-Time Scaling through Speculative Drafts , Persona-driven Simulation of Voting Behavior in the European Parliament with Large Language Models , Spurious Rewards: Rethinking Training Signals in RLVR , Probabilistic distances-based hallucination detection in LLMs with RAG
Rater context is mostly domain experts, and annotation is commonly mixed annotation units; use this to scope replication staffing.

Evidence: ICE-ID: A Novel Historical Census Dataset for Longitudinal Identity Resolution , From Raw Corpora to Domain Benchmarks: Automated Evaluation of LLM Domain Expertise , $\texttt{SPECS}$: Faster Test-Time Scaling through Speculative Drafts , Persona-driven Simulation of Voting Behavior in the European Parliament with Large Language Models
Stratify by benchmark (MATH vs MATH-500) before comparing methods.

Evidence: $\texttt{SPECS}$: Faster Test-Time Scaling through Speculative Drafts , Persona-driven Simulation of Voting Behavior in the European Parliament with Large Language Models , Spurious Rewards: Rethinking Training Signals in RLVR , Probabilistic distances-based hallucination detection in LLMs with RAG

Benchmark Interpretation

MATH appears in 20% of hub papers (2/10); use this cohort for benchmark-matched comparisons.
MATH-500 appears in 20% of hub papers (2/10); use this cohort for benchmark-matched comparisons.

Metric Interpretation

accuracy is reported in 30% of hub papers (3/10); compare with a secondary metric before ranking methods.
coherence is reported in 10% of hub papers (1/10); compare with a secondary metric before ranking methods.

Researcher Checklist

Close gap on Papers with explicit human feedback. Coverage is a replication risk (20% vs 45% target).
Close gap on Papers reporting quality controls. Coverage is a replication risk (0% vs 30% target).
Maintain strength on Papers naming benchmarks/datasets. Coverage is strong (50% vs 35% target).
Maintain strength on Papers naming evaluation metrics. Coverage is strong (60% vs 35% target).
Close gap on Papers with known rater population. Coverage is a replication risk (20% vs 35% target).
Close gap on Papers with known annotation unit. Coverage is a replication risk (0% vs 35% target).

Papers with explicit human feedback

Coverage is a replication risk (20% vs 45% target).

Papers reporting quality controls

Coverage is a replication risk (0% vs 30% target).

Papers naming benchmarks/datasets

Coverage is strong (50% vs 35% target).

Papers naming evaluation metrics

Coverage is strong (60% vs 35% target).

Papers with known rater population

Coverage is a replication risk (20% vs 35% target).

Papers with known annotation unit

Coverage is a replication risk (0% vs 35% target).

Known Limitations

Only 0% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (20% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Links

Benchmark Slice: MATH - Prioritizes benchmark-specific protocol comparisons.
Metric Slice: accuracy - Finds papers where reported metrics are directly comparable.
Recent High-Signal Papers - Keeps the hub connected to the latest HFEPX corpus updates.

automatic_metrics vs simulation_env

both=1, left_only=9, right_only=0

1 papers use both Automatic Metrics and Simulation Env.

Benchmark Brief

MATH

Coverage: 2 papers (20%)

2 papers (20%) mention MATH.

Examples: Spurious Rewards: Rethinking Training Signals in RLVR , AbstRaL: Augmenting LLMs' Reasoning by Reinforcing Abstract Thinking

Benchmark Brief

MATH-500

Coverage: 2 papers (20%)

2 papers (20%) mention MATH-500.

Examples: $\texttt{SPECS}$: Faster Test-Time Scaling through Speculative Drafts , Spurious Rewards: Rethinking Training Signals in RLVR

Benchmark Brief

Retrieval

Coverage: 2 papers (20%)

2 papers (20%) mention Retrieval.

Examples: Probabilistic distances-based hallucination detection in LLMs with RAG , Structure-Augmented Reasoning Generation

Metric Brief

accuracy

Coverage: 3 papers (30%)

3 papers (30%) mention accuracy.

Examples: $\texttt{SPECS}$: Faster Test-Time Scaling through Speculative Drafts , Towards Robust Real-World Multivariate Time Series Forecasting: A Unified Framework for Dependency, Asynchrony, and Missingness , Structure-Augmented Reasoning Generation

Metric Brief

coherence

Coverage: 1 papers (10%)

1 papers (10%) mention coherence.

Examples: Structure-Augmented Reasoning Generation

Metric Brief

cost

Coverage: 1 papers (10%)

1 papers (10%) mention cost.

Examples: From Raw Corpora to Domain Benchmarks: Automated Evaluation of LLM Domain Expertise

Most Cited In This Hub

Fast path to methods with the strongest citation traction in this scope.

Papers: $\texttt{SPECS}$: Faster Test-Time Scaling through Speculative Drafts , Persona-driven Simulation of Voting Behavior in the European Parliament with Large Language Models , Spurious Rewards: Rethinking Training Signals in RLVR

Most Recent

Fast path to latest protocol changes and newly published evaluation setups.

Best Protocol Detail

Papers with explicit rater/unit metadata and quality-control signals for reproducibility.

Papers Published On This Date

$\texttt{SPECS}$: Faster Test-Time Scaling through Speculative Drafts
Mert Cemri, Nived Rajaraman, Rishabh Tiwari, Xiaoxuan Liu, Kurt Keutzer · Jun 15, 2025

Scaling test-time compute has driven the recent advances in the reasoning capabilities of large language models (LLMs), typically by allocating additional computation for more thorough exploration.
Persona-driven Simulation of Voting Behavior in the European Parliament with Large Language Models
Maximilian Kreutner, Marlene Lutz, Markus Strohmaier · Jun 13, 2025

Large Language Models (LLMs) display remarkable capabilities to understand or even produce political discourse but have been found to consistently exhibit a progressive left-leaning bias.
Spurious Rewards: Rethinking Training Signals in RLVR
Rulin Shao, Shuyue Stella Li, Rui Xin, Scott Geng, Yiping Wang · Jun 12, 2025

We show that reinforcement learning with verifiable rewards (RLVR) can elicit strong mathematical reasoning in certain language models even with spurious rewards that have little, no, or even negative correlation with the correct answer.
Probabilistic distances-based hallucination detection in LLMs with RAG
Rodion Oblovatny, Alexandra Kuleshova, Konstantin Polev, Alexey Zaytsev · Jun 11, 2025

Detecting hallucinations in large language models (LLMs) is critical for their safety in many applications.
ICE-ID: A Novel Historical Census Dataset for Longitudinal Identity Resolution
Gonçalo Hora de Carvalho, Lazar S. Popov, Sander Kaatee, Mário S. Correia, Kristinn R. Thórisson · Jun 11, 2025

We introduce \textbf{ICE-ID}, a benchmark dataset comprising 984,028 records from 16 Icelandic census waves spanning 220 years (1703--1920), with 226,864 expert-curated person identifiers.
Towards Robust Real-World Multivariate Time Series Forecasting: A Unified Framework for Dependency, Asynchrony, and Missingness
Jinkwan Jang, Hyungjin Park, Jinmyeong Choi, Taesup Kim · Jun 10, 2025

Extensive experiments on public benchmark datasets reflecting practical settings, along with one private real-world industrial dataset, demonstrate the superior robustness and accuracy of ChannelTokenFormer under challenging real-world cond
Structure-Augmented Reasoning Generation
Jash Rajesh Parekh, Pengcheng Jiang, Jiawei Han · Jun 10, 2025

Extensive experiments on open-domain QA benchmarks and specialized reasoning datasets in finance and medicine demonstrate that SARG significantly outperforms state-of-the-art flat-context RAG baselines in both factual accuracy and reasoning
AbstRaL: Augmenting LLMs' Reasoning by Reinforcing Abstract Thinking
Silin Gao, Antoine Bosselut, Samy Bengio, Emmanuel Abbe · Jun 9, 2025

Our method, AbstRaL -- which promotes abstract reasoning in LLMs using RL on granular abstraction data -- significantly mitigates performance degradation on recent GSM perturbation benchmarks.
From Raw Corpora to Domain Benchmarks: Automated Evaluation of LLM Domain Expertise
Nitin Sharma, Thomas Wolfers, Çağatay Yıldız · Jun 9, 2025

Expert Verification

Accurate domain-specific benchmarking of LLMs is essential, specifically in domains with direct implications for humans, such as law, healthcare, and education.
When Style Breaks Safety: Defending LLMs Against Superficial Style Alignment
Yuxin Xiao, Sana Tonekaboni, Walter Gerych, Vinith Suriyakumar, Marzyeh Ghassemi · Jun 9, 2025

Red Team

In this work, we seek to understand whether style patterns compromise LLM safety, how superficial style alignment increases model vulnerability, and how best to mitigate these risks during alignment.

Recent Daily Archives

fortnight-2026-f04 (335) week-2026-w08 (288) quarter-2026-q1 (732) month-2026-02 (682) fortnight-2026-f05 (323) week-2026-w09 (323) 2026-02-23 (56) 2026-02-25 (89) 2026-02-24 (118) month-2026-01 (50) 2026-02-20 (37) week-2026-w07 (47) 2026-02-26 (60) quarter-2025-q4 (124) 2026-02-18 (56) 2026-02-21 (22) 2026-02-17 (54) 2026-02-19 (60) fortnight-2026-f02 (27) 2026-02-22 (23) 2026-02-16 (36) month-2025-10 (69) quarter-2025-q2 (78) fortnight-2026-f01 (16) fortnight-2026-f03 (34) quarter-2025-q3 (90) week-2026-w06 (23) month-2025-12 (30) fortnight-2025-f22 (32) quarter-2025-q1 (35) month-2025-06 (39) month-2025-09 (44) month-2025-11 (25) week-2026-w03 (15) week-2026-w04 (12) 2026-02-13 (8) fortnight-2025-f21 (32) 2026-02-15 (7) fortnight-2025-f20 (34) fortnight-2025-f12 (29)

HFEPX Weekly Archive: 2025-W24

Research Narrative

Why This Matters For Eval Research

Protocol Takeaways

Benchmark Interpretation

Metric Interpretation

Researcher Checklist

Suggested Reading Order

Known Limitations

Research Utility Links

Papers Published On This Date

Recent Daily Archives