Daily Archive

HFEPX Fortnight Archive: 2025-F11

Updated from current HFEPX corpus (Feb 27, 2026). 18 papers are grouped in this daily page. Common evaluation modes: Automatic Metrics, Simulation Env. Frequently cited benchmark: Retrieval. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from May 30, 2025.

Papers: 18 Last published: May 30, 2025 Global RSS

Research Narrative

Grounded narrative Model: deterministic-grounded Source: persisted

Updated from current HFEPX corpus (Feb 27, 2026). This page tracks 18 papers for HFEPX Fortnight Archive: 2025-F11. Dominant protocol signals include automatic metrics, simulation environments, with frequent benchmark focus on Retrieval, Rtc-Bench and metric focus on accuracy, context length. Use the grounded sections below to prioritize reproducible protocol choices, benchmark-matched comparisons, and judge-vs-human evaluation checks.

Why This Matters For Eval Research

27.8% of papers report explicit human-feedback signals, led by pairwise preferences.

Evidence: Counting trees: A treebank-driven exploration of syntactic variation in speech and writing across languages , Can LLMs Simulate Human Behavioral Variability? A Case Study in the Phonemic Fluency Task , VerifyBench: Benchmarking Reference-based Reward Systems for Large Language Models , When Large Multimodal Models Confront Evolving Knowledge: Challenges and Explorations
automatic metrics appears in 88.9% of papers in this hub.

Evidence: When Large Multimodal Models Confront Evolving Knowledge: Challenges and Explorations , Bayesian Attention Mechanism: A Probabilistic Framework for Positional Encoding and Context Length Extrapolation , Counting trees: A treebank-driven exploration of syntactic variation in speech and writing across languages , PonderLM: Pretraining Language Models to Ponder in Continuous Space
Retrieval is a recurring benchmark anchor for cross-paper comparisons in this page.

Evidence: Bayesian Attention Mechanism: A Probabilistic Framework for Positional Encoding and Context Length Extrapolation , Can LLMs Simulate Human Behavioral Variability? A Case Study in the Phonemic Fluency Task , Entailed Opinion Matters: Improving the Fact-Checking Performance of Language Models by Relying on their Entailment Ability , When Large Multimodal Models Confront Evolving Knowledge: Challenges and Explorations

Protocol Takeaways

Quality-control reporting is sparse in this slice; prioritize papers with explicit calibration or adjudication steps.

Evidence: When Large Multimodal Models Confront Evolving Knowledge: Challenges and Explorations , Bayesian Attention Mechanism: A Probabilistic Framework for Positional Encoding and Context Length Extrapolation , Counting trees: A treebank-driven exploration of syntactic variation in speech and writing across languages , RedTeamCUA: Realistic Adversarial Testing of Computer-Use Agents in Hybrid Web-OS Environments
Stratify by benchmark (Retrieval vs Rtc-Bench) before comparing methods.

Evidence: When Large Multimodal Models Confront Evolving Knowledge: Challenges and Explorations , Bayesian Attention Mechanism: A Probabilistic Framework for Positional Encoding and Context Length Extrapolation , Counting trees: A treebank-driven exploration of syntactic variation in speech and writing across languages , RedTeamCUA: Realistic Adversarial Testing of Computer-Use Agents in Hybrid Web-OS Environments
Track metric sensitivity by reporting both accuracy and context length.

Evidence: When Large Multimodal Models Confront Evolving Knowledge: Challenges and Explorations , Bayesian Attention Mechanism: A Probabilistic Framework for Positional Encoding and Context Length Extrapolation , Counting trees: A treebank-driven exploration of syntactic variation in speech and writing across languages , RedTeamCUA: Realistic Adversarial Testing of Computer-Use Agents in Hybrid Web-OS Environments

Benchmark Interpretation

Retrieval appears in 16.7% of hub papers (3/18); use this cohort for benchmark-matched comparisons.
Rtc-Bench appears in 5.6% of hub papers (1/18); use this cohort for benchmark-matched comparisons.

Metric Interpretation

accuracy is reported in 38.9% of hub papers (7/18); compare with a secondary metric before ranking methods.
context length is reported in 11.1% of hub papers (2/18); compare with a secondary metric before ranking methods.

Researcher Checklist

Tighten coverage on Papers with explicit human feedback. Coverage is usable but incomplete (27.8% vs 45% target).
Close gap on Papers reporting quality controls. Coverage is a replication risk (0% vs 30% target).
Tighten coverage on Papers naming benchmarks/datasets. Coverage is usable but incomplete (27.8% vs 35% target).
Maintain strength on Papers naming evaluation metrics. Coverage is strong (55.6% vs 35% target).
Close gap on Papers with known rater population. Coverage is a replication risk (0% vs 35% target).
Close gap on Papers with known annotation unit. Coverage is a replication risk (0% vs 35% target).

Papers with explicit human feedback

Coverage is usable but incomplete (27.8% vs 45% target).

Papers reporting quality controls

Coverage is a replication risk (0% vs 30% target).

Papers naming benchmarks/datasets

Coverage is usable but incomplete (27.8% vs 35% target).

Papers naming evaluation metrics

Coverage is strong (55.6% vs 35% target).

Papers with known rater population

Coverage is a replication risk (0% vs 35% target).

Papers with known annotation unit

Coverage is a replication risk (0% vs 35% target).

Known Limitations

Only 0% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (0% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Links

Benchmark Slice: Retrieval - Prioritizes benchmark-specific protocol comparisons.
Metric Slice: accuracy - Finds papers where reported metrics are directly comparable.
Recent High-Signal Papers - Keeps the hub connected to the latest HFEPX corpus updates.

automatic_metrics vs simulation_env

both=1, left_only=15, right_only=2

1 papers use both Automatic Metrics and Simulation Env.

Benchmark Brief

Retrieval

Coverage: 3 papers (16.7%)

3 papers (16.7%) mention Retrieval.

Examples: Bayesian Attention Mechanism: A Probabilistic Framework for Positional Encoding and Context Length Extrapolation , Can LLMs Simulate Human Behavioral Variability? A Case Study in the Phonemic Fluency Task , Entailed Opinion Matters: Improving the Fact-Checking Performance of Language Models by Relying on their Entailment Ability

Benchmark Brief

Rtc-Bench

Coverage: 1 papers (5.6%)

1 papers (5.6%) mention Rtc-Bench.

Examples: RedTeamCUA: Realistic Adversarial Testing of Computer-Use Agents in Hybrid Web-OS Environments

Benchmark Brief

Verifybench

Coverage: 1 papers (5.6%)

1 papers (5.6%) mention Verifybench.

Examples: VerifyBench: Benchmarking Reference-based Reward Systems for Large Language Models

Metric Brief

accuracy

Coverage: 7 papers (38.9%)

7 papers (38.9%) mention accuracy.

Examples: Bayesian Attention Mechanism: A Probabilistic Framework for Positional Encoding and Context Length Extrapolation , HoloLLM: Multisensory Foundation Model for Language-Grounded Human Sensing and Reasoning , On the Design of KL-Regularized Policy Gradient Algorithms for LLM Reasoning

Metric Brief

context length

Coverage: 2 papers (11.1%)

2 papers (11.1%) mention context length.

Examples: Bayesian Attention Mechanism: A Probabilistic Framework for Positional Encoding and Context Length Extrapolation , On the Design of KL-Regularized Policy Gradient Algorithms for LLM Reasoning

Metric Brief

jailbreak success rate

Coverage: 1 papers (5.6%)

1 papers (5.6%) mention jailbreak success rate.

Examples: RedTeamCUA: Realistic Adversarial Testing of Computer-Use Agents in Hybrid Web-OS Environments

Most Cited In This Hub

Fast path to methods with the strongest citation traction in this scope.

Papers: When Large Multimodal Models Confront Evolving Knowledge: Challenges and Explorations , Bayesian Attention Mechanism: A Probabilistic Framework for Positional Encoding and Context Length Extrapolation , Counting trees: A treebank-driven exploration of syntactic variation in speech and writing across languages

Most Recent

Fast path to latest protocol changes and newly published evaluation setups.

Best Protocol Detail

Papers with explicit rater/unit metadata and quality-control signals for reproducibility.

Papers Published On This Date

When Large Multimodal Models Confront Evolving Knowledge: Challenges and Explorations
Kailin Jiang, Yuntao Du, Yukai Ding, Yuchen Ren, Ning Jiang · May 30, 2025

To address this, we first propose a pipeline to construct MMEVOKE, a benchmark for evaluating LMMs' ability in multimodal evolving knowledge injection.
Bayesian Attention Mechanism: A Probabilistic Framework for Positional Encoding and Context Length Extrapolation
Arthur S. Bianchessi, Yasmin C. Aguirre, Rodrigo C. Barros, Lucas S. Kupssinskü · May 28, 2025

However, existing PE methods lack theoretical clarity and rely on limited evaluation metrics to substantiate their extrapolation claims.
Counting trees: A treebank-driven exploration of syntactic variation in speech and writing across languages
Kaja Dobrovoljc · May 28, 2025

Pairwise Preference

Results show that, across both languages, spoken corpora contain fewer and less diverse syntactic structures than their written counterparts, with consistent cross-linguistic preferences for certain structural types across modalities.
RedTeamCUA: Realistic Adversarial Testing of Computer-Use Agents in Hybrid Web-OS Environments
Zeyi Liao, Jaylen Jones, Linxi Jiang, Yuting Ning, Eric Fosler-Lussier · May 28, 2025

Red Team Web Browsing

Computer-use agents (CUAs) promise to automate complex tasks across operating systems (OS) and the web, but remain vulnerable to indirect prompt injection.
PonderLM: Pretraining Language Models to Ponder in Continuous Space
Boyi Zeng, Shixiang Song, Siyuan Huang, Yixuan Wang, He Li · May 27, 2025

Humans ponder before articulating complex sentence elements, enabling deeper cognitive processing through focused effort.
FinTagging: Benchmarking LLMs for Extracting and Structuring Financial Information
Yan Wang, Lingfei Qian, Xueqing Peng, Yang Ren, Keyi Wang · May 27, 2025

Existing benchmarks oversimplify this task as flat, single step classification over small subsets of concepts, ignoring the hierarchical semantics of the taxonomy and the structured nature of financial documents.
Knowledge Fusion of Large Language Models Via Modular SkillPacks
Guodong Du, Zhuo Li, Xuanning Zhou, Junlin Li, Zesheng Shi · May 24, 2025

Cross-capability transfer is a key challenge in large language model (LLM) research, with applications in multi-task integration, model compression, and continual learning.
HoloLLM: Multisensory Foundation Model for Language-Grounded Human Sensing and Reasoning
Chuhao Zhou, Jianfei Yang · May 23, 2025

Embodied agents operating in smart homes must understand human behavior through diverse sensory inputs and communicate via natural language.
On the Design of KL-Regularized Policy Gradient Algorithms for LLM Reasoning
Yifan Zhang, Yifeng Liu, Huizhuo Yuan, Yang Yuan, Quanquan Gu · May 23, 2025

On mathematical reasoning benchmarks (AIME24, AIME25), RPG-REINFORCE with RPG-Style Clip improves accuracy by up to $+6$ absolute percentage points over DAPO.
Refusal Direction is Universal Across Safety-Aligned Languages
Xinpeng Wang, Mingyang Wang, Yihong Liu, Hinrich Schütze, Barbara Plank · May 22, 2025

Red Team

Refusal mechanisms in large language models (LLMs) are essential for ensuring safety.
Accidental Vulnerability: Factors in Fine-Tuning that Shift Model Safeguards
Punya Syon Pandey, Samuel Simko, Kellin Pelrine, Zhijing Jin · May 22, 2025

As large language models (LLMs) gain popularity, their vulnerability to adversarial attacks emerges as a primary concern.
Can LLMs Simulate Human Behavioral Variability? A Case Study in the Phonemic Fluency Task
Mengyang Qiu, Zoe Brisebois, Siena Sun · May 22, 2025

Pairwise Preference

Large language models (LLMs) are increasingly explored as substitutes for human participants in cognitive tasks, but their ability to simulate human behavioral variability remains unclear.
VerifyBench: Benchmarking Reference-based Reward Systems for Large Language Models
Yuchen Yan, Jin Jiang, Zhenbang Ren, Yijun Li, Xudong Cai · May 21, 2025

Pairwise Preference

However, existing reward benchmarks focus on preference comparisons between responses rather than evaluating verification against ground truth references, leaving a critical gap in our ability to evaluate verification systems used in reason
Entailed Opinion Matters: Improving the Fact-Checking Performance of Language Models by Relying on their Entailment Ability
Gaurav Kumar, Ayush Garg, Debajyoti Mazumder, Aditya Kishore, Babu kumar · May 21, 2025

Automated fact-checking has been a challenging task for the research community.
Language Models use Lookbacks to Track Beliefs
Nikhil Prakash, Natalie Shapira, Arnab Sen Sharma, Christoph Riedl, Yonatan Belinkov · May 20, 2025

How do language models (LMs) represent characters' beliefs, especially when those beliefs may differ from reality?
Towards Reliable Proof Generation with LLMs: A Neuro-Symbolic Approach
Oren Sultan, Eitan Stern, Dafna Shahaf · May 20, 2025

Large language models (LLMs) struggle with formal domains that require rigorous logical deduction and symbolic reasoning, such as mathematical proof generation.
What if Deception Cannot be Detected? A Cross-Linguistic Study on the Limits of Deception Detection from Text
Aswathy Velutharambath, Kai Sassenberg, Roman Klinger · May 19, 2025

We further benchmark against other English deception datasets following similar data collection protocols.
Complexity counts: global and local perspectives on Indo-Aryan numeral systems
Chundra Cathcart · May 19, 2025

The numeral systems of Indo-Aryan languages such as Hindi, Gujarati, and Bengali are highly unusual in that unlike most numeral systems (e.g., those of English, Chinese, etc.), forms referring to 1--99 are highly non-transparent and are can

Recent Daily Archives

fortnight-2026-f04 (335) week-2026-w08 (288) quarter-2026-q1 (732) month-2026-02 (682) fortnight-2026-f05 (323) week-2026-w09 (323) 2026-02-23 (56) 2026-02-25 (89) 2026-02-24 (118) month-2026-01 (50) 2026-02-20 (37) week-2026-w07 (47) 2026-02-26 (60) quarter-2025-q4 (124) 2026-02-18 (56) 2026-02-21 (22) 2026-02-17 (54) 2026-02-19 (60) fortnight-2026-f02 (27) 2026-02-22 (23) 2026-02-16 (36) month-2025-10 (69) quarter-2025-q2 (78) fortnight-2026-f01 (16) fortnight-2026-f03 (34) quarter-2025-q3 (90) week-2026-w06 (23) month-2025-12 (30) fortnight-2025-f22 (32) quarter-2025-q1 (35) month-2025-06 (39) month-2025-09 (44) month-2025-11 (25) week-2026-w03 (15) week-2026-w04 (12) 2026-02-13 (8) fortnight-2025-f21 (32) 2026-02-15 (7) fortnight-2025-f20 (34) fortnight-2025-f12 (29)

HFEPX Fortnight Archive: 2025-F11

Research Narrative

Why This Matters For Eval Research

Protocol Takeaways

Benchmark Interpretation

Metric Interpretation

Researcher Checklist

Suggested Reading Order

Known Limitations

Research Utility Links

Papers Published On This Date

Recent Daily Archives