Daily Archive

HFEPX Weekly Archive: 2025-W23

Updated from current HFEPX corpus (Feb 27, 2026). 19 papers are grouped in this daily page. Common evaluation modes: Automatic Metrics, Simulation Env. Most common rater population: Domain Experts. Common annotation unit: Trajectory. Frequently cited benchmark: Retrieval. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Jun 8, 2025.

Papers: 19 Last published: Jun 8, 2025 Global RSS

Research Narrative

Grounded narrative Model: deterministic-grounded Source: persisted

Updated from current HFEPX corpus (Feb 27, 2026). This page tracks 19 papers for HFEPX Weekly Archive: 2025-W23. Dominant protocol signals include automatic metrics, simulation environments, human evaluation, with frequent benchmark focus on Retrieval, AIME and metric focus on accuracy, cost. Use the grounded sections below to prioritize reproducible protocol choices, benchmark-matched comparisons, and judge-vs-human evaluation checks.

Why This Matters For Eval Research

10.5% of papers report explicit human-feedback signals, led by critique/edit feedback.

Evidence: Critique-GRPO: Advancing LLM Reasoning with Natural Language and Numerical Feedback , A dependently-typed calculus of event telicity and culminativity , DesignBench: A Comprehensive Benchmark for MLLM-based Front-end Code Generation , Simple Yet Effective: Extracting Private Data Across Clients in Federated Fine-Tuning of Large Language Models
automatic metrics appears in 89.5% of papers in this hub.

Evidence: A dependently-typed calculus of event telicity and culminativity , DesignBench: A Comprehensive Benchmark for MLLM-based Front-end Code Generation , Simple Yet Effective: Extracting Private Data Across Clients in Federated Fine-Tuning of Large Language Models , Cross-lingual Collapse: How Language-Centric Foundation Models Shape Reasoning in Large Language Models
Retrieval is a recurring benchmark anchor for cross-paper comparisons in this page.

Evidence: Resisting Contextual Interference in RAG via Parametric-Knowledge Reinforcement , A dependently-typed calculus of event telicity and culminativity , DesignBench: A Comprehensive Benchmark for MLLM-based Front-end Code Generation , Simple Yet Effective: Extracting Private Data Across Clients in Federated Fine-Tuning of Large Language Models

Protocol Takeaways

Quality-control reporting is sparse in this slice; prioritize papers with explicit calibration or adjudication steps.

Evidence: A dependently-typed calculus of event telicity and culminativity , DesignBench: A Comprehensive Benchmark for MLLM-based Front-end Code Generation , Simple Yet Effective: Extracting Private Data Across Clients in Federated Fine-Tuning of Large Language Models , Cross-lingual Collapse: How Language-Centric Foundation Models Shape Reasoning in Large Language Models
Rater context is mostly domain experts, and annotation is commonly trajectory-level annotation; use this to scope replication staffing.

Evidence: HSSBench: Benchmarking Humanities and Social Sciences Ability for Multimodal Large Language Models , EuroGEST: Investigating gender stereotypes in multilingual language models , A dependently-typed calculus of event telicity and culminativity , DesignBench: A Comprehensive Benchmark for MLLM-based Front-end Code Generation
Pair this hub with llm_as_judge pages to benchmark automated-vs-human evaluation tradeoffs.

Evidence: EuroGEST: Investigating gender stereotypes in multilingual language models , A dependently-typed calculus of event telicity and culminativity , DesignBench: A Comprehensive Benchmark for MLLM-based Front-end Code Generation , Simple Yet Effective: Extracting Private Data Across Clients in Federated Fine-Tuning of Large Language Models

Benchmark Interpretation

Retrieval appears in 10.5% of hub papers (2/19); use this cohort for benchmark-matched comparisons.
AIME appears in 5.3% of hub papers (1/19); use this cohort for benchmark-matched comparisons.

Metric Interpretation

accuracy is reported in 21.1% of hub papers (4/19); compare with a secondary metric before ranking methods.
cost is reported in 10.5% of hub papers (2/19); compare with a secondary metric before ranking methods.

Researcher Checklist

Close gap on Papers with explicit human feedback. Coverage is a replication risk (10.5% vs 45% target).
Close gap on Papers reporting quality controls. Coverage is a replication risk (0% vs 30% target).
Tighten coverage on Papers naming benchmarks/datasets. Coverage is usable but incomplete (31.6% vs 35% target).
Maintain strength on Papers naming evaluation metrics. Coverage is strong (47.4% vs 35% target).
Close gap on Papers with known rater population. Coverage is a replication risk (10.5% vs 35% target).
Close gap on Papers with known annotation unit. Coverage is a replication risk (10.5% vs 35% target).

Papers with explicit human feedback

Coverage is a replication risk (10.5% vs 45% target).

Papers reporting quality controls

Coverage is a replication risk (0% vs 30% target).

Papers naming benchmarks/datasets

Coverage is usable but incomplete (31.6% vs 35% target).

Papers naming evaluation metrics

Coverage is strong (47.4% vs 35% target).

Papers with known rater population

Coverage is a replication risk (10.5% vs 35% target).

Papers with known annotation unit

Coverage is a replication risk (10.5% vs 35% target).

Known Limitations

Only 0% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (10.5% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Links

Human Eval Protocols - Surfaces human-rating workflows for rubric and annotator quality analysis.
Benchmark Slice: Retrieval - Prioritizes benchmark-specific protocol comparisons.
Metric Slice: accuracy - Finds papers where reported metrics are directly comparable.
Recent High-Signal Papers - Keeps the hub connected to the latest HFEPX corpus updates.

human_eval vs automatic_metrics

both=1, left_only=0, right_only=16

1 papers use both Human Eval and Automatic Metrics.

automatic_metrics vs simulation_env

both=1, left_only=16, right_only=2

1 papers use both Automatic Metrics and Simulation Env.

simulation_env vs human_eval

both=0, left_only=3, right_only=1

0 papers use both Simulation Env and Human Eval.

Benchmark Brief

Retrieval

Coverage: 2 papers (10.5%)

2 papers (10.5%) mention Retrieval.

Examples: When to use Graphs in RAG: A Comprehensive Analysis for Graph Retrieval-Augmented Generation , Resisting Contextual Interference in RAG via Parametric-Knowledge Reinforcement

Benchmark Brief

AIME

Coverage: 1 papers (5.3%)

1 papers (5.3%) mention AIME.

Examples: Critique-GRPO: Advancing LLM Reasoning with Natural Language and Numerical Feedback

Benchmark Brief

Designbench

Coverage: 1 papers (5.3%)

1 papers (5.3%) mention Designbench.

Examples: DesignBench: A Comprehensive Benchmark for MLLM-based Front-end Code Generation

Metric Brief

accuracy

Coverage: 4 papers (21.1%)

4 papers (21.1%) mention accuracy.

Examples: Cross-lingual Collapse: How Language-Centric Foundation Models Shape Reasoning in Large Language Models , Resisting Contextual Interference in RAG via Parametric-Knowledge Reinforcement , EuroGEST: Investigating gender stereotypes in multilingual language models

Metric Brief

cost

Coverage: 2 papers (10.5%)

2 papers (10.5%) mention cost.

Examples: Improving Data Efficiency for LLM Reinforcement Fine-tuning Through Difficulty-targeted Online Data Selection and Rollout Replay , "Don't Do That!": Guiding Embodied Systems through Large Language Model-based Constraint Generation

Metric Brief

perplexity

Coverage: 2 papers (10.5%)

2 papers (10.5%) mention perplexity.

Examples: Watermarking Degrades Alignment in Language Models: Analysis and Mitigation , Esoteric Language Models: Bridging Autoregressive and Masked Diffusion LLMs

Most Cited In This Hub

Fast path to methods with the strongest citation traction in this scope.

Papers: A dependently-typed calculus of event telicity and culminativity , DesignBench: A Comprehensive Benchmark for MLLM-based Front-end Code Generation , Simple Yet Effective: Extracting Private Data Across Clients in Federated Fine-Tuning of Large Language Models

Most Recent

Fast path to latest protocol changes and newly published evaluation setups.

Best Protocol Detail

Papers with explicit rater/unit metadata and quality-control signals for reproducibility.

Papers Published On This Date

A dependently-typed calculus of event telicity and culminativity
Pavel Kovalev, Carlo Angiuli · Jun 8, 2025

We present a dependently-typed cross-linguistic framework for analyzing the telicity and culminativity of events, accompanied by examples of using our framework to model English sentences.
DesignBench: A Comprehensive Benchmark for MLLM-based Front-end Code Generation
Jingyu Xiao, Man Ho Lam, Ming Wang, Yuxuan Wan, Junliang Liu · Jun 6, 2025

However, existing front-end UI code generation benchmarks have the following limitations: (1) While framework-based development becomes predominant in modern front-end programming, current benchmarks fail to incorporate mainstream developme
Simple Yet Effective: Extracting Private Data Across Clients in Federated Fine-Tuning of Large Language Models
Yingqi Hu, Zhuo Zhang, Jingyuan Zhang, Jinghua Wang, Qifan Wang · Jun 6, 2025

These findings highlight concrete privacy risks in FedLLMs and establish a benchmark and evaluation framework for future research on privacy-preserving federated learning.
Cross-lingual Collapse: How Language-Centric Foundation Models Shape Reasoning in Large Language Models
Cheonbok Park, Jeonghoon Kim, Joosung Lee, Sanghwan Bae, Jaegul Choo · Jun 6, 2025

Reinforcement learning with verifiable reward (RLVR) has been instrumental in eliciting strong reasoning capabilities from large language models (LLMs) via long chains of thought (CoT).
When to use Graphs in RAG: A Comprehensive Analysis for Graph Retrieval-Augmented Generation
Zhishang Xiang, Chuanjie Wu, Qinggang Zhang, Shengyuan Chen, Zijin Hong · Jun 6, 2025

To address this, we propose GraphRAG-Bench, a comprehensive benchmark designed to evaluate GraphRAG models onboth hierarchical knowledge retrieval and deep contextual reasoning.
Voice Impression Control in Zero-Shot TTS
Kenichi Fujita, Shota Horiguchi, Yusuke Ijima · Jun 6, 2025

The results of both objective and subjective evaluations have demonstrated our method's effectiveness in impression control.
Improving Data Efficiency for LLM Reinforcement Fine-tuning Through Difficulty-targeted Online Data Selection and Rollout Replay
Yifan Sun, Jingyan Shen, Yibin Wang, Tianyu Chen, Zhendong Wang · Jun 5, 2025

Reinforcement learning (RL) has become an effective approach for fine-tuning large language models (LLMs), particularly to enhance their reasoning capabilities.
Resisting Contextual Interference in RAG via Parametric-Knowledge Reinforcement
Chenyu Lin, Yilin Wen, Du Su, Hexiang Tan, Fei Sun · Jun 5, 2025

Retrieval-augmented generation (RAG) improves performance on knowledge-intensive tasks but can be derailed by wrong, irrelevant, or conflicting retrieved text, causing models to rely on inaccurate evidence and cascade errors.
Sensory-Motor Control with Large Language Models via Iterative Policy Refinement
Jônata Tyska Carvalho, Stefano Nolfi · Jun 5, 2025

We propose a method that enables large language models (LLMs) to control embodied agents through the generation of control policies that directly map continuous observation vectors to continuous action vectors.
"Don't Do That!": Guiding Embodied Systems through Large Language Model-based Constraint Generation
Amin Seffo, Aladin Djuhera, Masataro Asai, Holger Boche · Jun 4, 2025

Web Browsing

Recent advancements in large language models (LLMs) have spurred interest in robotic navigation that incorporates complex spatial, mathematical, and conditional constraints from natural language into the planning problem.
Watermarking Degrades Alignment in Language Models: Analysis and Mitigation
Apurv Verma, NhatHai Phan, Shubhendu Trivedi · Jun 4, 2025

In practice, sampling as few as two to four candidates largely restores unwatermarked alignment performance in truthfulness, safety, and helpfulness, without hurting watermark detection.
Toward Beginner-Friendly LLMs for Language Learning: Controlling Difficulty in Conversation
Meiqing Jin, Liam Dugan, Chris Callison-Burch · Jun 4, 2025

We further introduce a new token-level evaluation metric, Token Miss Rate (TMR), that quantifies the proportion of incomprehensible tokens per utterance and correlates strongly with human judgments.
HSSBench: Benchmarking Humanities and Social Sciences Ability for Multimodal Large Language Models
Zhaolu Kang, Junhao Gong, Jiaxu Yan, Wanke Xia, Yian Wang · Jun 4, 2025

Expert Verification

However, current benchmarks for evaluating MLLMs primarily emphasize general knowledge and vertical step-by-step reasoning typical of STEM disciplines, while overlooking the distinct needs and potential of the Humanities and Social Sciences
EuroGEST: Investigating gender stereotypes in multilingual language models
Jacqueline Rowe, Mateusz Klimaszewski, Liane Guillou, Shannon Vallor, Alexandra Birch · Jun 4, 2025

Large language models increasingly support multiple languages, yet most benchmarks for gender bias remain English-centric.
Critique-GRPO: Advancing LLM Reasoning with Natural Language and Numerical Feedback
Xiaoying Zhang, Yipeng Zhang, Hao Sun, Kaituo Feng, Chaochao Lu · Jun 3, 2025

Critique Edit

Recent advances in reinforcement learning (RL) using numerical rewards have significantly enhanced the complex reasoning capabilities of large language models (LLMs).
Automated Web Application Testing: End-to-End Test Case Generation with Large Language Models and Screen Transition Graphs
Nguyen-Khang Le, Quan Minh Bui, Minh Ngoc Nguyen, Hiep Nguyen, Trung Vo · Jun 3, 2025

Web Browsing

Web applications are critical to modern software ecosystems, yet ensuring their reliability remains challenging due to the complexity and dynamic nature of web interfaces.
Esoteric Language Models: Bridging Autoregressive and Masked Diffusion LLMs
Subham Sekhar Sahoo, Zhihan Yang, Yash Akhauri, Johnna Liu, Deepansha Singh · Jun 2, 2025

Diffusion-based language models offer a compelling alternative to autoregressive (AR) models by enabling parallel and controllable generation.
iQUEST: An Iterative Question-Guided Framework for Knowledge Base Question Answering
Shuai Wang, Yinan Yu · Jun 2, 2025

Long Horizon

Detailed experiments demonstrate the consistent improvement delivered by iQUEST across four benchmark datasets and four LLMs.
Synthesis of discrete-continuous quantum circuits with multimodal diffusion models
Florian Fürrutter, Zohim Chandani, Ikko Hamamura, Hans J. Briegel, Gorka Muñoz-Gil · Jun 2, 2025

We benchmark the model over different experiments, analyzing the method's accuracy across varying qubit counts and circuit depths, showcasing the ability of the method to outperform existing approaches in gate counts and under noisy conditi

Recent Daily Archives

fortnight-2026-f04 (335) week-2026-w08 (288) quarter-2026-q1 (732) month-2026-02 (682) fortnight-2026-f05 (323) week-2026-w09 (323) 2026-02-23 (56) 2026-02-25 (89) 2026-02-24 (118) month-2026-01 (50) 2026-02-20 (37) week-2026-w07 (47) 2026-02-26 (60) quarter-2025-q4 (124) 2026-02-18 (56) 2026-02-21 (22) 2026-02-17 (54) 2026-02-19 (60) fortnight-2026-f02 (27) 2026-02-22 (23) 2026-02-16 (36) month-2025-10 (69) quarter-2025-q2 (78) fortnight-2026-f01 (16) fortnight-2026-f03 (34) quarter-2025-q3 (90) week-2026-w06 (23) month-2025-12 (30) fortnight-2025-f22 (32) quarter-2025-q1 (35) month-2025-06 (39) month-2025-09 (44) month-2025-11 (25) week-2026-w03 (15) week-2026-w04 (12) 2026-02-13 (8) fortnight-2025-f21 (32) 2026-02-15 (7) fortnight-2025-f20 (34) fortnight-2025-f12 (29)

HFEPX Weekly Archive: 2025-W23

Research Narrative

Why This Matters For Eval Research

Protocol Takeaways

Benchmark Interpretation

Metric Interpretation

Researcher Checklist

Suggested Reading Order

Known Limitations

Research Utility Links

Papers Published On This Date

Recent Daily Archives