Daily Archive

HFEPX Weekly Archive: 2025-W44

Updated from current HFEPX corpus (Feb 27, 2026). 18 papers are grouped in this daily page. Common evaluation modes: Automatic Metrics, Simulation Env. Most common rater population: Domain Experts. Common annotation unit: Pairwise. Frequent quality control: Inter Annotator Agreement Reported. Frequently cited benchmark: afri-semeval. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Oct 31, 2025.

Papers: 18 Last published: Oct 31, 2025 Global RSS

Research Narrative

Grounded narrative Model: deterministic-grounded Source: persisted

Updated from current HFEPX corpus (Feb 27, 2026). This page tracks 18 papers for HFEPX Weekly Archive: 2025-W44. Dominant protocol signals include automatic metrics, simulation environments, human evaluation, with frequent benchmark focus on afri-semeval, APPS and metric focus on accuracy, agreement. Use the grounded sections below to prioritize reproducible protocol choices, benchmark-matched comparisons, and judge-vs-human evaluation checks.

Why This Matters For Eval Research

22.2% of papers report explicit human-feedback signals, led by pairwise preferences.

Evidence: BEAT: Visual Backdoor Attacks on VLM-based Embodied Agents via Contrastive Trigger Learning , Incentivizing Agentic Reasoning in LLM Judges via Tool-Integrated Reinforcement Learning , When Distributions Shifts: Causal Generalization for Low-Resource Languages , Beyond a Million Tokens: Benchmarking and Enhancing Long-Term Memory in LLMs
automatic metrics appears in 83.3% of papers in this hub.

Evidence: BEAT: Visual Backdoor Attacks on VLM-based Embodied Agents via Contrastive Trigger Learning , When Distributions Shifts: Causal Generalization for Low-Resource Languages , Beyond a Million Tokens: Benchmarking and Enhancing Long-Term Memory in LLMs , Probability Distributions Computed by Autoregressive Transformers
afri-semeval is a recurring benchmark anchor for cross-paper comparisons in this page.

Evidence: When Distributions Shifts: Causal Generalization for Low-Resource Languages , BEAT: Visual Backdoor Attacks on VLM-based Embodied Agents via Contrastive Trigger Learning , Beyond a Million Tokens: Benchmarking and Enhancing Long-Term Memory in LLMs , Probability Distributions Computed by Autoregressive Transformers

Protocol Takeaways

Most common quality-control signal is inter-annotator agreement reporting (5.6% of papers).

Evidence: Beyond Understanding: Evaluating the Pragmatic Gap in LLMs' Cultural Processing of Figurative Language , BEAT: Visual Backdoor Attacks on VLM-based Embodied Agents via Contrastive Trigger Learning , When Distributions Shifts: Causal Generalization for Low-Resource Languages , Beyond a Million Tokens: Benchmarking and Enhancing Long-Term Memory in LLMs
Rater context is mostly domain experts, and annotation is commonly pairwise annotation; use this to scope replication staffing.

Evidence: Supervised Reinforcement Learning: From Expert Trajectories to Step-wise Reasoning , BEAT: Visual Backdoor Attacks on VLM-based Embodied Agents via Contrastive Trigger Learning , When Distributions Shifts: Causal Generalization for Low-Resource Languages , Beyond a Million Tokens: Benchmarking and Enhancing Long-Term Memory in LLMs
Pair this hub with llm_as_judge pages to benchmark automated-vs-human evaluation tradeoffs.

Evidence: Beyond Understanding: Evaluating the Pragmatic Gap in LLMs' Cultural Processing of Figurative Language , Incentivizing Agentic Reasoning in LLM Judges via Tool-Integrated Reinforcement Learning , BEAT: Visual Backdoor Attacks on VLM-based Embodied Agents via Contrastive Trigger Learning , When Distributions Shifts: Causal Generalization for Low-Resource Languages

Benchmark Interpretation

afri-semeval appears in 5.6% of hub papers (1/18); use this cohort for benchmark-matched comparisons.
APPS appears in 5.6% of hub papers (1/18); use this cohort for benchmark-matched comparisons.

Metric Interpretation

accuracy is reported in 22.2% of hub papers (4/18); compare with a secondary metric before ranking methods.
agreement is reported in 11.1% of hub papers (2/18); compare with a secondary metric before ranking methods.

Researcher Checklist

Close gap on Papers with explicit human feedback. Coverage is a replication risk (22.2% vs 45% target).
Close gap on Papers reporting quality controls. Coverage is a replication risk (5.6% vs 30% target).
Close gap on Papers naming benchmarks/datasets. Coverage is a replication risk (16.7% vs 35% target).
Maintain strength on Papers naming evaluation metrics. Coverage is strong (50% vs 35% target).
Close gap on Papers with known rater population. Coverage is a replication risk (5.6% vs 35% target).
Close gap on Papers with known annotation unit. Coverage is a replication risk (11.1% vs 35% target).

Papers with explicit human feedback

Coverage is a replication risk (22.2% vs 45% target).

Papers reporting quality controls

Coverage is a replication risk (5.6% vs 30% target).

Papers naming benchmarks/datasets

Coverage is a replication risk (16.7% vs 35% target).

Papers naming evaluation metrics

Coverage is strong (50% vs 35% target).

Papers with known rater population

Coverage is a replication risk (5.6% vs 35% target).

Papers with known annotation unit

Coverage is a replication risk (11.1% vs 35% target).

Known Limitations

Only 5.6% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (5.6% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Links

Human Eval Protocols - Surfaces human-rating workflows for rubric and annotator quality analysis.
Benchmark Slice: afri-semeval - Prioritizes benchmark-specific protocol comparisons.
Metric Slice: accuracy - Finds papers where reported metrics are directly comparable.
IAA-Reported Evaluations - Highlights evaluations that explicitly report inter-annotator agreement.
Recent High-Signal Papers - Keeps the hub connected to the latest HFEPX corpus updates.

human_eval vs automatic_metrics

both=1, left_only=1, right_only=14

1 papers use both Human Eval and Automatic Metrics.

automatic_metrics vs simulation_env

both=1, left_only=14, right_only=2

1 papers use both Automatic Metrics and Simulation Env.

simulation_env vs human_eval

both=0, left_only=3, right_only=2

0 papers use both Simulation Env and Human Eval.

Benchmark Brief

afri-semeval

Coverage: 1 papers (5.6%)

1 papers (5.6%) mention afri-semeval.

Examples: When Distributions Shifts: Causal Generalization for Low-Resource Languages

Benchmark Brief

APPS

Coverage: 1 papers (5.6%)

1 papers (5.6%) mention APPS.

Examples: The Tool Decathlon: Benchmarking Language Agents for Diverse, Realistic, and Long-Horizon Task Execution

Benchmark Brief

Retrieval

Coverage: 1 papers (5.6%)

1 papers (5.6%) mention Retrieval.

Examples: Beyond a Million Tokens: Benchmarking and Enhancing Long-Term Memory in LLMs

Metric Brief

accuracy

Coverage: 4 papers (22.2%)

4 papers (22.2%) mention accuracy.

Examples: BEAT: Visual Backdoor Attacks on VLM-based Embodied Agents via Contrastive Trigger Learning , From Medical Records to Diagnostic Dialogues: A Clinical-Grounded Approach and Dataset for Psychiatric Comorbidity , Repurposing Synthetic Data for Fine-grained Search Agent Supervision

Metric Brief

agreement

Coverage: 2 papers (11.1%)

2 papers (11.1%) mention agreement.

Examples: Through the Judge's Eyes: Inferred Thinking Traces Improve Reliability of LLM Raters , Beyond Understanding: Evaluating the Pragmatic Gap in LLMs' Cultural Processing of Figurative Language

Metric Brief

success rate

Coverage: 2 papers (11.1%)

2 papers (11.1%) mention success rate.

Examples: Reasoning Up the Instruction Ladder for Controllable Language Models , The Tool Decathlon: Benchmarking Language Agents for Diverse, Realistic, and Long-Horizon Task Execution

Most Cited In This Hub

Fast path to methods with the strongest citation traction in this scope.

Papers: BEAT: Visual Backdoor Attacks on VLM-based Embodied Agents via Contrastive Trigger Learning , When Distributions Shifts: Causal Generalization for Low-Resource Languages , Beyond a Million Tokens: Benchmarking and Enhancing Long-Term Memory in LLMs

Most Recent

Fast path to latest protocol changes and newly published evaluation setups.

Best Protocol Detail

Papers with explicit rater/unit metadata and quality-control signals for reproducibility.

Papers Published On This Date

BEAT: Visual Backdoor Attacks on VLM-based Embodied Agents via Contrastive Trigger Learning
Qiusi Zhan, Hyeonjeong Ha, Rui Yang, Sirui Xu, Hanyang Chen · Oct 31, 2025

Pairwise Preference Long Horizon

Recent advances in Vision-Language Models (VLMs) have propelled embodied agents by enabling direct perception, reasoning, and planning task-oriented actions from visual inputs.
When Distributions Shifts: Causal Generalization for Low-Resource Languages
Mahi Aliyu Aminu, Chisom Chibuike, Fatimo Adebanjo, Omokolade Awosanya, Samuel Oyeneye · Oct 31, 2025

Machine learning models often fail under distribution shifts, a problem exacerbated in low-resource settings where limited data restricts robust generalization.
Beyond a Million Tokens: Benchmarking and Enhancing Long-Term Memory in LLMs
Mohammad Tavakoli, Alireza Salemi, Carrie Ye, Mohamed Abdalla, Hamed Zamani · Oct 31, 2025

Evaluating the abilities of large language models (LLMs) for tasks that require long-term memory and thus long-context reasoning, for example in conversational settings, is hampered by the existing benchmarks, which often lack narrative coh
Probability Distributions Computed by Autoregressive Transformers
Andy Yang, Anej Svete, Jiaoda Li, Anthony Widjaja Lin, Jonathan Rawski · Oct 31, 2025

Most expressivity results for transformers treat them as language recognizers (which accept or reject strings), and not as they are used in practice, as language models (which generate strings autoregressively and probabilistically).
Reasoning Up the Instruction Ladder for Controllable Language Models
Zishuo Zheng, Vidhisha Balachandran, Chan Young Park, Faeze Brahman, Sachin Kumar · Oct 30, 2025

Red Team

Our finetuned models achieve consistent improvements on instruction following and instruction hierarchy benchmarks, achieving roughly a 20% improvement on the IHEval conflict setup.
Temporal Sparse Autoencoders: Leveraging the Sequential Nature of Language for Interpretability
Usha Bhalla, Alex Oesterling, Claudio Mayrink Verdun, Himabindu Lakkaraju, Flavio P. Calmon · Oct 30, 2025

Translating the internal representations and computations of models into concepts that humans can understand is a key goal of interpretability.
LLMs Process Lists With General Filter Heads
Arnab Sen Sharma, Giordano Rogers, Natalie Shapira, David Bau · Oct 30, 2025

Our results reveal that transformer LMs can develop human-interpretable implementations of abstract computational operations that generalize in ways that are surprisingly similar to strategies used in traditional functional programming patt
Inference-Cost-Aware Dynamic Tree Construction for Efficient Inference in Large Language Models
Yinrong Hong, Zhiquan Tan, Kai Hu · Oct 30, 2025

Large Language Models (LLMs) face significant inference latency challenges stemming from their autoregressive design and large size.
Supervised Reinforcement Learning: From Expert Trajectories to Step-wise Reasoning
Yihe Deng, I-Hung Hsu, Jun Yan, Zifeng Wang, Rujun Han · Oct 29, 2025

Demonstrations Long Horizon

Beyond reasoning benchmarks, SRL generalizes effectively to agentic software engineering tasks, establishing it as a robust and versatile training framework for reasoning-oriented LLMs.
Through the Judge's Eyes: Inferred Thinking Traces Improve Reliability of LLM Raters
Xingjian Zhang, Tianhong Gao, Suliang Jin, Tianhao Wang, Teng Ye · Oct 29, 2025

Large language models (LLMs) are increasingly used as raters for evaluation tasks.
The Tool Decathlon: Benchmarking Language Agents for Diverse, Realistic, and Long-Horizon Task Execution
Junlong Li, Wenshuo Zhao, Jian Zhao, Weihao Zeng, Haoze Wu · Oct 29, 2025

Long Horizon

Real-world language agents must handle complex, multi-step workflows across diverse Apps.
From Medical Records to Diagnostic Dialogues: A Clinical-Grounded Approach and Dataset for Psychiatric Comorbidity
Tianxi Wan, Jiaming Luo, Siyuan Chen, Kunyao Lan, Jianhua Chen · Oct 29, 2025

Multi Agent

To address this, we develop a novel approach integrating synthetic patient electronic medical record (EMR) construction and multi-agent diagnostic dialogue generation.
World Simulation with Video Foundation Models for Physical AI
NVIDIA, :, Arslan Ali, Junjie Bai, Maciej Bala · Oct 28, 2025

Long Horizon

These capabilities enable more reliable synthetic data generation, policy evaluation, and closed-loop simulation for robotics and autonomous systems.
Do Large Language Models Grasp The Grammar? Evidence from Grammar-Book-Guided Probing in Luxembourgish
Lujun Li, Yewei Song, Lama Sleem, Yiqun Wang, Yangjie Xu · Oct 28, 2025

In natural language processing, there remains a notable scarcity of grammar focused evaluation protocols, a gap that is even more pronounced for low-resource languages.
Repurposing Synthetic Data for Fine-grained Search Agent Supervision
Yida Zhao, Kuan Li, Xixi Wu, Liwen Zhang, Dingchu Zhang · Oct 28, 2025

LLM-based search agents are increasingly trained on entity-centric synthetic data to solve complex, knowledge-intensive tasks.
Beyond Understanding: Evaluating the Pragmatic Gap in LLMs' Cultural Processing of Figurative Language
Mena Attia, Aashiq Muhamed, Mai Alkhamissi, Thamar Solorio, Mona Diab · Oct 27, 2025

We present a comprehensive evaluation of the ability of large language models (LLMs) to process culturally grounded language, specifically to understand and pragmatically use figurative expressions that encode local knowledge and cultural n
A Survey of Data Agents: Emerging Paradigm or Overstated Hype?
Yizhang Zhu, Liangwei Wang, Chenyu Yang, Xiaotian Lin, Boyan Li · Oct 27, 2025

The rapid advancement of large language models (LLMs) has spurred the emergence of data agents, autonomous systems designed to orchestrate Data + AI ecosystems for tackling complex data-related tasks.
Incentivizing Agentic Reasoning in LLM Judges via Tool-Integrated Reinforcement Learning
Ran Xu, Jingjing Chen, Jiayu Ye, Yu Wu, Jun Yan · Oct 27, 2025

Pairwise Preference

Large Language Models (LLMs) are widely used as judges to evaluate response quality, providing a scalable alternative to human evaluation.

Recent Daily Archives

fortnight-2026-f04 (335) week-2026-w08 (288) quarter-2026-q1 (732) month-2026-02 (682) fortnight-2026-f05 (323) week-2026-w09 (323) 2026-02-23 (56) 2026-02-25 (89) 2026-02-24 (118) month-2026-01 (50) 2026-02-20 (37) week-2026-w07 (47) 2026-02-26 (60) quarter-2025-q4 (124) 2026-02-18 (56) 2026-02-21 (22) 2026-02-17 (54) 2026-02-19 (60) fortnight-2026-f02 (27) 2026-02-22 (23) 2026-02-16 (36) month-2025-10 (69) quarter-2025-q2 (78) fortnight-2026-f01 (16) fortnight-2026-f03 (34) quarter-2025-q3 (90) week-2026-w06 (23) month-2025-12 (30) fortnight-2025-f22 (32) quarter-2025-q1 (35) month-2025-06 (39) month-2025-09 (44) month-2025-11 (25) week-2026-w03 (15) week-2026-w04 (12) 2026-02-13 (8) fortnight-2025-f21 (32) 2026-02-15 (7) fortnight-2025-f20 (34) fortnight-2025-f12 (29)

HFEPX Weekly Archive: 2025-W44

Research Narrative

Why This Matters For Eval Research

Protocol Takeaways

Benchmark Interpretation

Metric Interpretation

Researcher Checklist

Suggested Reading Order

Known Limitations

Research Utility Links

Papers Published On This Date

Recent Daily Archives