Daily Archive

HFEPX Weekly Archive: 2026-W05

Updated from current HFEPX corpus (Feb 27, 2026). 11 papers are grouped in this daily page. Common evaluation modes: Automatic Metrics, Simulation Env. Most common rater population: Domain Experts. Common annotation unit: Freeform. Frequently cited benchmark: ALFWorld. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Feb 1, 2026.

Papers: 11 Last published: Feb 1, 2026 Global RSS

Research Narrative

Grounded narrative Model: deterministic-grounded Source: persisted

Updated from current HFEPX corpus (Feb 27, 2026). This page tracks 11 papers for HFEPX Weekly Archive: 2026-W05. Dominant protocol signals include automatic metrics, simulation environments, with frequent benchmark focus on ALFWorld, Amo-Bench and metric focus on accuracy, coherence. Use the grounded sections below to prioritize reproducible protocol choices, benchmark-matched comparisons, and judge-vs-human evaluation checks.

Why This Matters For Eval Research

automatic metrics appears in 90.9% of papers in this hub.

Evidence: What If We Allocate Test-Time Compute Adaptively? , From Associations to Activations: Comparing Behavioral and Hidden-State Semantic Geometry in LLMs , Unmasking Reasoning Processes: A Process-aware Benchmark for Evaluating Structural Mathematical Reasoning in LLMs , KBVQ-MoE: KLT-guided SVD with Bias-Corrected Vector Quantization for MoE Large Language Models
ALFWorld is a recurring benchmark anchor for cross-paper comparisons in this page.

Evidence: Embodied Task Planning via Graph-Informed Action Generation with Large Language Model , What If We Allocate Test-Time Compute Adaptively? , From Associations to Activations: Comparing Behavioral and Hidden-State Semantic Geometry in LLMs , Unmasking Reasoning Processes: A Process-aware Benchmark for Evaluating Structural Mathematical Reasoning in LLMs
long-horizon tasks appears in 18.2% of papers, indicating agentic evaluation demand.

Evidence: What If We Allocate Test-Time Compute Adaptively? , Embodied Task Planning via Graph-Informed Action Generation with Large Language Model , From Associations to Activations: Comparing Behavioral and Hidden-State Semantic Geometry in LLMs , Unmasking Reasoning Processes: A Process-aware Benchmark for Evaluating Structural Mathematical Reasoning in LLMs

Protocol Takeaways

Quality-control reporting is sparse in this slice; prioritize papers with explicit calibration or adjudication steps.

Evidence: What If We Allocate Test-Time Compute Adaptively? , From Associations to Activations: Comparing Behavioral and Hidden-State Semantic Geometry in LLMs , Unmasking Reasoning Processes: A Process-aware Benchmark for Evaluating Structural Mathematical Reasoning in LLMs , KBVQ-MoE: KLT-guided SVD with Bias-Corrected Vector Quantization for MoE Large Language Models
Rater context is mostly domain experts, and annotation is commonly Freeform; use this to scope replication staffing.

Evidence: KBVQ-MoE: KLT-guided SVD with Bias-Corrected Vector Quantization for MoE Large Language Models , What If We Allocate Test-Time Compute Adaptively? , From Associations to Activations: Comparing Behavioral and Hidden-State Semantic Geometry in LLMs , Unmasking Reasoning Processes: A Process-aware Benchmark for Evaluating Structural Mathematical Reasoning in LLMs
Stratify by benchmark (ALFWorld vs Amo-Bench) before comparing methods.

Evidence: What If We Allocate Test-Time Compute Adaptively? , From Associations to Activations: Comparing Behavioral and Hidden-State Semantic Geometry in LLMs , Unmasking Reasoning Processes: A Process-aware Benchmark for Evaluating Structural Mathematical Reasoning in LLMs , KBVQ-MoE: KLT-guided SVD with Bias-Corrected Vector Quantization for MoE Large Language Models

Benchmark Interpretation

ALFWorld appears in 9.1% of hub papers (1/11); use this cohort for benchmark-matched comparisons.
Amo-Bench appears in 9.1% of hub papers (1/11); use this cohort for benchmark-matched comparisons.

Metric Interpretation

accuracy is reported in 27.3% of hub papers (3/11); compare with a secondary metric before ranking methods.
coherence is reported in 9.1% of hub papers (1/11); compare with a secondary metric before ranking methods.

Researcher Checklist

Close gap on Papers with explicit human feedback. Coverage is a replication risk (0% vs 45% target).
Close gap on Papers reporting quality controls. Coverage is a replication risk (0% vs 30% target).
Tighten coverage on Papers naming benchmarks/datasets. Coverage is usable but incomplete (27.3% vs 35% target).
Maintain strength on Papers naming evaluation metrics. Coverage is strong (45.5% vs 35% target).
Close gap on Papers with known rater population. Coverage is a replication risk (9.1% vs 35% target).
Close gap on Papers with known annotation unit. Coverage is a replication risk (18.2% vs 35% target).

Papers with explicit human feedback

Coverage is a replication risk (0% vs 45% target).

Papers reporting quality controls

Coverage is a replication risk (0% vs 30% target).

Papers naming benchmarks/datasets

Coverage is usable but incomplete (27.3% vs 35% target).

Papers naming evaluation metrics

Coverage is strong (45.5% vs 35% target).

Papers with known rater population

Coverage is a replication risk (9.1% vs 35% target).

Papers with known annotation unit

Coverage is a replication risk (18.2% vs 35% target).

Known Limitations

Only 0% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (9.1% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Links

Benchmark Slice: ALFWorld - Prioritizes benchmark-specific protocol comparisons.
Metric Slice: accuracy - Finds papers where reported metrics are directly comparable.
Recent High-Signal Papers - Keeps the hub connected to the latest HFEPX corpus updates.

automatic_metrics vs simulation_env

both=1, left_only=9, right_only=1

1 papers use both Automatic Metrics and Simulation Env.

Benchmark Brief

ALFWorld

Coverage: 1 papers (9.1%)

1 papers (9.1%) mention ALFWorld.

Examples: Embodied Task Planning via Graph-Informed Action Generation with Large Language Model

Benchmark Brief

Amo-Bench

Coverage: 1 papers (9.1%)

1 papers (9.1%) mention Amo-Bench.

Examples: What If We Allocate Test-Time Compute Adaptively?

Benchmark Brief

MATH-500

Coverage: 1 papers (9.1%)

1 papers (9.1%) mention MATH-500.

Examples: What If We Allocate Test-Time Compute Adaptively?

Metric Brief

accuracy

Coverage: 3 papers (27.3%)

3 papers (27.3%) mention accuracy.

Examples: Unmasking Reasoning Processes: A Process-aware Benchmark for Evaluating Structural Mathematical Reasoning in LLMs , KBVQ-MoE: KLT-guided SVD with Bias-Corrected Vector Quantization for MoE Large Language Models , FROST: Filtering Reasoning Outliers with Attention for Efficient Reasoning

Metric Brief

coherence

Coverage: 1 papers (9.1%)

1 papers (9.1%) mention coherence.

Examples: Embodied Task Planning via Graph-Informed Action Generation with Large Language Model

Metric Brief

cost

Coverage: 1 papers (9.1%)

1 papers (9.1%) mention cost.

Examples: Embodied Task Planning via Graph-Informed Action Generation with Large Language Model

Most Cited In This Hub

Fast path to methods with the strongest citation traction in this scope.

Papers: What If We Allocate Test-Time Compute Adaptively? , From Associations to Activations: Comparing Behavioral and Hidden-State Semantic Geometry in LLMs , Unmasking Reasoning Processes: A Process-aware Benchmark for Evaluating Structural Mathematical Reasoning in LLMs

Most Recent

Fast path to latest protocol changes and newly published evaluation setups.

Best Protocol Detail

Papers with explicit rater/unit metadata and quality-control signals for reproducibility.

Papers Published On This Date

What If We Allocate Test-Time Compute Adaptively?
Ahsan Bilal, Ahmed Mohsin, Muhammad Umer, Ali Subhan, Hassan Rizwan · Feb 1, 2026

Long Horizon

For each problem, the agent runs multiple inference iterations.
From Associations to Activations: Comparing Behavioral and Hidden-State Semantic Geometry in LLMs
Louis Schiekiera, Max Zimmer, Christophe Roux, Sebastian Pokutta, Fritz Günther · Jan 31, 2026

Using representational similarity analysis, we compare behavioral geometries to layerwise hidden-state similarity and benchmark against FastText, BERT, and cross-model consensus.
Unmasking Reasoning Processes: A Process-aware Benchmark for Evaluating Structural Mathematical Reasoning in LLMs
Xiang Zheng, Weiqi Zhai, Wei Wang, Boyu Yang, Wenbo Li · Jan 31, 2026

Multi Agent

Recent large language models (LLMs) achieve near-saturation accuracy on many established mathematical reasoning benchmarks, raising concerns about their ability to diagnose genuine reasoning competence.
KBVQ-MoE: KLT-guided SVD with Bias-Corrected Vector Quantization for MoE Large Language Models
Zukang Xu, Zhixiong Zhao, Xing Hu, Zhixuan Chen, Dawei Yang · Jan 30, 2026

Mixture of Experts (MoE) models have achieved great success by significantly improving performance while maintaining computational efficiency through sparse expert activation.
Embodied Task Planning via Graph-Informed Action Generation with Large Language Model
Xiang Li, Ning Yan, Masood Mortazavi · Jan 29, 2026

Long Horizon

While Large Language Models (LLMs) have demonstrated strong zero-shot reasoning capabilities, their deployment as embodied agents still faces fundamental challenges in long-horizon planning.
Indic-TunedLens: Interpreting Multilingual Models in Indian Languages
Mihir Panchal, Deeksha Varshney, Mamta, Asif Ekbal · Jan 29, 2026

We evaluate our framework on 10 Indian languages using the MMLU benchmark and find that it significantly improves over SOTA interpretability methods, especially for morphologically rich, low resource languages.
INSURE-Dial: A Phase-Aware Conversational Dataset & Benchmark for Compliance Verification and Phase Detection
Shubham Kulkarni, Alexander Lyzhov, Preetam Joshi, Shiva Chaitanya · Jan 28, 2026

Web Browsing

We introduce INSURE-Dial, to our knowledge the first public benchmark for developing and assessing compliance-aware voice agents for phase-aware call auditing with span-based compliance verification.
Understanding LLM Failures: A Multi-Tape Turing Machine Analysis of Systematic Errors in Language Model Reasoning
Magnus Boman · Jan 27, 2026

Large language models (LLMs) exhibit failure modes on seemingly trivial tasks.
One Token Is Enough: Improving Diffusion Language Models with a Sink Token
Zihou Zhang, Zheyong Xie, Li Zhong, Haifeng Liu, Yao Hu · Jan 27, 2026

Diffusion Language Models (DLMs) have emerged as a compelling alternative to autoregressive approaches, enabling parallel text generation with competitive performance.
FROST: Filtering Reasoning Outliers with Attention for Efficient Reasoning
Haozheng Luo, Zhuolin Jiang, Md Zahid Hasan, Yan Chen, Soumalya Sarkar · Jan 26, 2026

Empirically, we validate FROST on four benchmarks using two strong reasoning models (Phi-4-Reasoning and GPT-OSS-20B), outperforming state-of-the-art methods such as TALE and ThinkLess.
Flatter Tokens are More Valuable for Speculative Draft Model Training
Jiaming Fan, Daming Cao, Xiangzhong Luo, Jiale Fu, Chonghan Liu · Jan 26, 2026

Speculative Decoding (SD) is a key technique for accelerating Large Language Model (LLM) inference, but it typically requires training a draft model on a large dataset.

Recent Daily Archives

fortnight-2026-f04 (335) week-2026-w08 (288) quarter-2026-q1 (732) month-2026-02 (682) fortnight-2026-f05 (323) week-2026-w09 (323) 2026-02-23 (56) 2026-02-25 (89) 2026-02-24 (118) month-2026-01 (50) 2026-02-20 (37) week-2026-w07 (47) 2026-02-26 (60) quarter-2025-q4 (124) 2026-02-18 (56) 2026-02-21 (22) 2026-02-17 (54) 2026-02-19 (60) fortnight-2026-f02 (27) 2026-02-22 (23) 2026-02-16 (36) month-2025-10 (69) quarter-2025-q2 (78) fortnight-2026-f01 (16) fortnight-2026-f03 (34) quarter-2025-q3 (90) week-2026-w06 (23) month-2025-12 (30) fortnight-2025-f22 (32) quarter-2025-q1 (35) month-2025-06 (39) month-2025-09 (44) month-2025-11 (25) week-2026-w03 (15) week-2026-w04 (12) 2026-02-13 (8) fortnight-2025-f21 (32) 2026-02-15 (7) fortnight-2025-f20 (34) fortnight-2025-f12 (29)

HFEPX Weekly Archive: 2026-W05

Research Narrative

Why This Matters For Eval Research

Protocol Takeaways

Benchmark Interpretation

Metric Interpretation

Researcher Checklist

Suggested Reading Order

Known Limitations

Research Utility Links

Papers Published On This Date

Recent Daily Archives