Daily Archive

HFEPX Fortnight Archive: 2025-F12

Updated from current HFEPX corpus (Feb 27, 2026). 29 papers are grouped in this daily page. Common evaluation modes: Automatic Metrics, Simulation Env. Most common rater population: Domain Experts. Common annotation unit: Trajectory. Frequently cited benchmark: Retrieval. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Jun 15, 2025.

Papers: 29 Last published: Jun 15, 2025 Global RSS

Research Narrative

Grounded narrative Model: deterministic-grounded Source: persisted

Updated from current HFEPX corpus (Feb 27, 2026). This page tracks 29 papers for HFEPX Fortnight Archive: 2025-F12. Dominant protocol signals include automatic metrics, simulation environments, human evaluation, with frequent benchmark focus on Retrieval, MATH and metric focus on accuracy, cost. Use the grounded sections below to prioritize reproducible protocol choices, benchmark-matched comparisons, and judge-vs-human evaluation checks.

Why This Matters For Eval Research

13.8% of papers report explicit human-feedback signals, led by expert verification.

Evidence: From Raw Corpora to Domain Benchmarks: Automated Evaluation of LLM Domain Expertise , $\texttt{SPECS}$: Faster Test-Time Scaling through Speculative Drafts , Persona-driven Simulation of Voting Behavior in the European Parliament with Large Language Models , Spurious Rewards: Rethinking Training Signals in RLVR
automatic metrics appears in 93.1% of papers in this hub.

Evidence: $\texttt{SPECS}$: Faster Test-Time Scaling through Speculative Drafts , Persona-driven Simulation of Voting Behavior in the European Parliament with Large Language Models , Spurious Rewards: Rethinking Training Signals in RLVR , Probabilistic distances-based hallucination detection in LLMs with RAG
Retrieval is a recurring benchmark anchor for cross-paper comparisons in this page.

Evidence: Probabilistic distances-based hallucination detection in LLMs with RAG , Structure-Augmented Reasoning Generation , Resisting Contextual Interference in RAG via Parametric-Knowledge Reinforcement , $\texttt{SPECS}$: Faster Test-Time Scaling through Speculative Drafts

Protocol Takeaways

Quality-control reporting is sparse in this slice; prioritize papers with explicit calibration or adjudication steps.

Evidence: $\texttt{SPECS}$: Faster Test-Time Scaling through Speculative Drafts , Persona-driven Simulation of Voting Behavior in the European Parliament with Large Language Models , Spurious Rewards: Rethinking Training Signals in RLVR , Probabilistic distances-based hallucination detection in LLMs with RAG
Rater context is mostly domain experts, and annotation is commonly trajectory-level annotation; use this to scope replication staffing.

Evidence: ICE-ID: A Novel Historical Census Dataset for Longitudinal Identity Resolution , From Raw Corpora to Domain Benchmarks: Automated Evaluation of LLM Domain Expertise , $\texttt{SPECS}$: Faster Test-Time Scaling through Speculative Drafts , Persona-driven Simulation of Voting Behavior in the European Parliament with Large Language Models
Pair this hub with llm_as_judge pages to benchmark automated-vs-human evaluation tradeoffs.

Evidence: $\texttt{SPECS}$: Faster Test-Time Scaling through Speculative Drafts , Persona-driven Simulation of Voting Behavior in the European Parliament with Large Language Models , Spurious Rewards: Rethinking Training Signals in RLVR , Probabilistic distances-based hallucination detection in LLMs with RAG

Benchmark Interpretation

Retrieval appears in 13.8% of hub papers (4/29); use this cohort for benchmark-matched comparisons.
MATH appears in 10.3% of hub papers (3/29); use this cohort for benchmark-matched comparisons.

Metric Interpretation

accuracy is reported in 24.1% of hub papers (7/29); compare with a secondary metric before ranking methods.
cost is reported in 10.3% of hub papers (3/29); compare with a secondary metric before ranking methods.

Researcher Checklist

Close gap on Papers with explicit human feedback. Coverage is a replication risk (13.8% vs 45% target).
Close gap on Papers reporting quality controls. Coverage is a replication risk (0% vs 30% target).
Maintain strength on Papers naming benchmarks/datasets. Coverage is strong (37.9% vs 35% target).
Maintain strength on Papers naming evaluation metrics. Coverage is strong (51.7% vs 35% target).
Close gap on Papers with known rater population. Coverage is a replication risk (13.8% vs 35% target).
Close gap on Papers with known annotation unit. Coverage is a replication risk (6.9% vs 35% target).

Papers with explicit human feedback

Coverage is a replication risk (13.8% vs 45% target).

Papers reporting quality controls

Coverage is a replication risk (0% vs 30% target).

Papers naming benchmarks/datasets

Coverage is strong (37.9% vs 35% target).

Papers naming evaluation metrics

Coverage is strong (51.7% vs 35% target).

Papers with known rater population

Coverage is a replication risk (13.8% vs 35% target).

Papers with known annotation unit

Coverage is a replication risk (6.9% vs 35% target).

Known Limitations

Only 0% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (13.8% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Links

Human Eval Protocols - Surfaces human-rating workflows for rubric and annotator quality analysis.
Benchmark Slice: Retrieval - Prioritizes benchmark-specific protocol comparisons.
Metric Slice: accuracy - Finds papers where reported metrics are directly comparable.
Recent High-Signal Papers - Keeps the hub connected to the latest HFEPX corpus updates.

human_eval vs automatic_metrics

both=1, left_only=0, right_only=26

1 papers use both Human Eval and Automatic Metrics.

automatic_metrics vs simulation_env

both=2, left_only=25, right_only=2

2 papers use both Automatic Metrics and Simulation Env.

simulation_env vs human_eval

both=0, left_only=4, right_only=1

0 papers use both Simulation Env and Human Eval.

Benchmark Brief

Retrieval

Coverage: 4 papers (13.8%)

4 papers (13.8%) mention Retrieval.

Examples: Probabilistic distances-based hallucination detection in LLMs with RAG , Structure-Augmented Reasoning Generation , When to use Graphs in RAG: A Comprehensive Analysis for Graph Retrieval-Augmented Generation

Benchmark Brief

MATH

Coverage: 3 papers (10.3%)

3 papers (10.3%) mention MATH.

Examples: Spurious Rewards: Rethinking Training Signals in RLVR , AbstRaL: Augmenting LLMs' Reasoning by Reinforcing Abstract Thinking , Cross-lingual Collapse: How Language-Centric Foundation Models Shape Reasoning in Large Language Models

Benchmark Brief

MATH-500

Coverage: 2 papers (6.9%)

2 papers (6.9%) mention MATH-500.

Examples: $\texttt{SPECS}$: Faster Test-Time Scaling through Speculative Drafts , Spurious Rewards: Rethinking Training Signals in RLVR

Metric Brief

accuracy

Coverage: 7 papers (24.1%)

7 papers (24.1%) mention accuracy.

Examples: $\texttt{SPECS}$: Faster Test-Time Scaling through Speculative Drafts , Towards Robust Real-World Multivariate Time Series Forecasting: A Unified Framework for Dependency, Asynchrony, and Missingness , Structure-Augmented Reasoning Generation

Metric Brief

cost

Coverage: 3 papers (10.3%)

3 papers (10.3%) mention cost.

Examples: From Raw Corpora to Domain Benchmarks: Automated Evaluation of LLM Domain Expertise , Improving Data Efficiency for LLM Reinforcement Fine-tuning Through Difficulty-targeted Online Data Selection and Rollout Replay , "Don't Do That!": Guiding Embodied Systems through Large Language Model-based Constraint Generation

Metric Brief

perplexity

Coverage: 2 papers (6.9%)

2 papers (6.9%) mention perplexity.

Examples: Watermarking Degrades Alignment in Language Models: Analysis and Mitigation , Esoteric Language Models: Bridging Autoregressive and Masked Diffusion LLMs

Most Cited In This Hub

Fast path to methods with the strongest citation traction in this scope.

Papers: $\texttt{SPECS}$: Faster Test-Time Scaling through Speculative Drafts , Persona-driven Simulation of Voting Behavior in the European Parliament with Large Language Models , Spurious Rewards: Rethinking Training Signals in RLVR

Most Recent

Fast path to latest protocol changes and newly published evaluation setups.

Best Protocol Detail

Papers with explicit rater/unit metadata and quality-control signals for reproducibility.

Papers Published On This Date

$\texttt{SPECS}$: Faster Test-Time Scaling through Speculative Drafts
Mert Cemri, Nived Rajaraman, Rishabh Tiwari, Xiaoxuan Liu, Kurt Keutzer · Jun 15, 2025

Scaling test-time compute has driven the recent advances in the reasoning capabilities of large language models (LLMs), typically by allocating additional computation for more thorough exploration.
Persona-driven Simulation of Voting Behavior in the European Parliament with Large Language Models
Maximilian Kreutner, Marlene Lutz, Markus Strohmaier · Jun 13, 2025

Large Language Models (LLMs) display remarkable capabilities to understand or even produce political discourse but have been found to consistently exhibit a progressive left-leaning bias.
Spurious Rewards: Rethinking Training Signals in RLVR
Rulin Shao, Shuyue Stella Li, Rui Xin, Scott Geng, Yiping Wang · Jun 12, 2025

We show that reinforcement learning with verifiable rewards (RLVR) can elicit strong mathematical reasoning in certain language models even with spurious rewards that have little, no, or even negative correlation with the correct answer.
Probabilistic distances-based hallucination detection in LLMs with RAG
Rodion Oblovatny, Alexandra Kuleshova, Konstantin Polev, Alexey Zaytsev · Jun 11, 2025

Detecting hallucinations in large language models (LLMs) is critical for their safety in many applications.
ICE-ID: A Novel Historical Census Dataset for Longitudinal Identity Resolution
Gonçalo Hora de Carvalho, Lazar S. Popov, Sander Kaatee, Mário S. Correia, Kristinn R. Thórisson · Jun 11, 2025

We introduce \textbf{ICE-ID}, a benchmark dataset comprising 984,028 records from 16 Icelandic census waves spanning 220 years (1703--1920), with 226,864 expert-curated person identifiers.
Towards Robust Real-World Multivariate Time Series Forecasting: A Unified Framework for Dependency, Asynchrony, and Missingness
Jinkwan Jang, Hyungjin Park, Jinmyeong Choi, Taesup Kim · Jun 10, 2025

Extensive experiments on public benchmark datasets reflecting practical settings, along with one private real-world industrial dataset, demonstrate the superior robustness and accuracy of ChannelTokenFormer under challenging real-world cond
Structure-Augmented Reasoning Generation
Jash Rajesh Parekh, Pengcheng Jiang, Jiawei Han · Jun 10, 2025

Extensive experiments on open-domain QA benchmarks and specialized reasoning datasets in finance and medicine demonstrate that SARG significantly outperforms state-of-the-art flat-context RAG baselines in both factual accuracy and reasoning
AbstRaL: Augmenting LLMs' Reasoning by Reinforcing Abstract Thinking
Silin Gao, Antoine Bosselut, Samy Bengio, Emmanuel Abbe · Jun 9, 2025

Our method, AbstRaL -- which promotes abstract reasoning in LLMs using RL on granular abstraction data -- significantly mitigates performance degradation on recent GSM perturbation benchmarks.
From Raw Corpora to Domain Benchmarks: Automated Evaluation of LLM Domain Expertise
Nitin Sharma, Thomas Wolfers, Çağatay Yıldız · Jun 9, 2025

Expert Verification

Accurate domain-specific benchmarking of LLMs is essential, specifically in domains with direct implications for humans, such as law, healthcare, and education.
When Style Breaks Safety: Defending LLMs Against Superficial Style Alignment
Yuxin Xiao, Sana Tonekaboni, Walter Gerych, Vinith Suriyakumar, Marzyeh Ghassemi · Jun 9, 2025

Red Team

In this work, we seek to understand whether style patterns compromise LLM safety, how superficial style alignment increases model vulnerability, and how best to mitigate these risks during alignment.
A dependently-typed calculus of event telicity and culminativity
Pavel Kovalev, Carlo Angiuli · Jun 8, 2025

We present a dependently-typed cross-linguistic framework for analyzing the telicity and culminativity of events, accompanied by examples of using our framework to model English sentences.
DesignBench: A Comprehensive Benchmark for MLLM-based Front-end Code Generation
Jingyu Xiao, Man Ho Lam, Ming Wang, Yuxuan Wan, Junliang Liu · Jun 6, 2025

However, existing front-end UI code generation benchmarks have the following limitations: (1) While framework-based development becomes predominant in modern front-end programming, current benchmarks fail to incorporate mainstream developme
Simple Yet Effective: Extracting Private Data Across Clients in Federated Fine-Tuning of Large Language Models
Yingqi Hu, Zhuo Zhang, Jingyuan Zhang, Jinghua Wang, Qifan Wang · Jun 6, 2025

These findings highlight concrete privacy risks in FedLLMs and establish a benchmark and evaluation framework for future research on privacy-preserving federated learning.
Cross-lingual Collapse: How Language-Centric Foundation Models Shape Reasoning in Large Language Models
Cheonbok Park, Jeonghoon Kim, Joosung Lee, Sanghwan Bae, Jaegul Choo · Jun 6, 2025

Reinforcement learning with verifiable reward (RLVR) has been instrumental in eliciting strong reasoning capabilities from large language models (LLMs) via long chains of thought (CoT).
When to use Graphs in RAG: A Comprehensive Analysis for Graph Retrieval-Augmented Generation
Zhishang Xiang, Chuanjie Wu, Qinggang Zhang, Shengyuan Chen, Zijin Hong · Jun 6, 2025

To address this, we propose GraphRAG-Bench, a comprehensive benchmark designed to evaluate GraphRAG models onboth hierarchical knowledge retrieval and deep contextual reasoning.
Voice Impression Control in Zero-Shot TTS
Kenichi Fujita, Shota Horiguchi, Yusuke Ijima · Jun 6, 2025

The results of both objective and subjective evaluations have demonstrated our method's effectiveness in impression control.
Improving Data Efficiency for LLM Reinforcement Fine-tuning Through Difficulty-targeted Online Data Selection and Rollout Replay
Yifan Sun, Jingyan Shen, Yibin Wang, Tianyu Chen, Zhendong Wang · Jun 5, 2025

Reinforcement learning (RL) has become an effective approach for fine-tuning large language models (LLMs), particularly to enhance their reasoning capabilities.
Resisting Contextual Interference in RAG via Parametric-Knowledge Reinforcement
Chenyu Lin, Yilin Wen, Du Su, Hexiang Tan, Fei Sun · Jun 5, 2025

Retrieval-augmented generation (RAG) improves performance on knowledge-intensive tasks but can be derailed by wrong, irrelevant, or conflicting retrieved text, causing models to rely on inaccurate evidence and cascade errors.
Sensory-Motor Control with Large Language Models via Iterative Policy Refinement
Jônata Tyska Carvalho, Stefano Nolfi · Jun 5, 2025

We propose a method that enables large language models (LLMs) to control embodied agents through the generation of control policies that directly map continuous observation vectors to continuous action vectors.
"Don't Do That!": Guiding Embodied Systems through Large Language Model-based Constraint Generation
Amin Seffo, Aladin Djuhera, Masataro Asai, Holger Boche · Jun 4, 2025

Web Browsing

Recent advancements in large language models (LLMs) have spurred interest in robotic navigation that incorporates complex spatial, mathematical, and conditional constraints from natural language into the planning problem.
Watermarking Degrades Alignment in Language Models: Analysis and Mitigation
Apurv Verma, NhatHai Phan, Shubhendu Trivedi · Jun 4, 2025

In practice, sampling as few as two to four candidates largely restores unwatermarked alignment performance in truthfulness, safety, and helpfulness, without hurting watermark detection.
Toward Beginner-Friendly LLMs for Language Learning: Controlling Difficulty in Conversation
Meiqing Jin, Liam Dugan, Chris Callison-Burch · Jun 4, 2025

We further introduce a new token-level evaluation metric, Token Miss Rate (TMR), that quantifies the proportion of incomprehensible tokens per utterance and correlates strongly with human judgments.
HSSBench: Benchmarking Humanities and Social Sciences Ability for Multimodal Large Language Models
Zhaolu Kang, Junhao Gong, Jiaxu Yan, Wanke Xia, Yian Wang · Jun 4, 2025

Expert Verification

However, current benchmarks for evaluating MLLMs primarily emphasize general knowledge and vertical step-by-step reasoning typical of STEM disciplines, while overlooking the distinct needs and potential of the Humanities and Social Sciences
EuroGEST: Investigating gender stereotypes in multilingual language models
Jacqueline Rowe, Mateusz Klimaszewski, Liane Guillou, Shannon Vallor, Alexandra Birch · Jun 4, 2025

Large language models increasingly support multiple languages, yet most benchmarks for gender bias remain English-centric.
Critique-GRPO: Advancing LLM Reasoning with Natural Language and Numerical Feedback
Xiaoying Zhang, Yipeng Zhang, Hao Sun, Kaituo Feng, Chaochao Lu · Jun 3, 2025

Critique Edit

Recent advances in reinforcement learning (RL) using numerical rewards have significantly enhanced the complex reasoning capabilities of large language models (LLMs).
Automated Web Application Testing: End-to-End Test Case Generation with Large Language Models and Screen Transition Graphs
Nguyen-Khang Le, Quan Minh Bui, Minh Ngoc Nguyen, Hiep Nguyen, Trung Vo · Jun 3, 2025

Web Browsing

Web applications are critical to modern software ecosystems, yet ensuring their reliability remains challenging due to the complexity and dynamic nature of web interfaces.
Esoteric Language Models: Bridging Autoregressive and Masked Diffusion LLMs
Subham Sekhar Sahoo, Zhihan Yang, Yash Akhauri, Johnna Liu, Deepansha Singh · Jun 2, 2025

Diffusion-based language models offer a compelling alternative to autoregressive (AR) models by enabling parallel and controllable generation.
iQUEST: An Iterative Question-Guided Framework for Knowledge Base Question Answering
Shuai Wang, Yinan Yu · Jun 2, 2025

Long Horizon

Detailed experiments demonstrate the consistent improvement delivered by iQUEST across four benchmark datasets and four LLMs.
Synthesis of discrete-continuous quantum circuits with multimodal diffusion models
Florian Fürrutter, Zohim Chandani, Ikko Hamamura, Hans J. Briegel, Gorka Muñoz-Gil · Jun 2, 2025

We benchmark the model over different experiments, analyzing the method's accuracy across varying qubit counts and circuit depths, showcasing the ability of the method to outperform existing approaches in gate counts and under noisy conditi

Recent Daily Archives

fortnight-2026-f04 (335) week-2026-w08 (288) quarter-2026-q1 (732) month-2026-02 (682) fortnight-2026-f05 (323) week-2026-w09 (323) 2026-02-23 (56) 2026-02-25 (89) 2026-02-24 (118) month-2026-01 (50) 2026-02-20 (37) week-2026-w07 (47) 2026-02-26 (60) quarter-2025-q4 (124) 2026-02-18 (56) 2026-02-21 (22) 2026-02-17 (54) 2026-02-19 (60) fortnight-2026-f02 (27) 2026-02-22 (23) 2026-02-16 (36) month-2025-10 (69) quarter-2025-q2 (78) fortnight-2026-f01 (16) fortnight-2026-f03 (34) quarter-2025-q3 (90) week-2026-w06 (23) month-2025-12 (30) fortnight-2025-f22 (32) quarter-2025-q1 (35) month-2025-06 (39) month-2025-09 (44) month-2025-11 (25) week-2026-w03 (15) week-2026-w04 (12) 2026-02-13 (8) fortnight-2025-f21 (32) 2026-02-15 (7) fortnight-2025-f20 (34) week-2025-w39 (21)

HFEPX Fortnight Archive: 2025-F12

Research Narrative

Why This Matters For Eval Research

Protocol Takeaways

Benchmark Interpretation

Metric Interpretation

Researcher Checklist

Suggested Reading Order

Known Limitations

Research Utility Links

Papers Published On This Date

Recent Daily Archives