HFEPX Archive Slice
HFEPX Daily Papers for 2026-05-27
Daily archive slice for 2026-05-27 from the HFEPX corpus. Updated from current HFEPX corpus (2026-06-01); covers 3 papers from 2026-05-27.
HFEPX Archive Slice
Daily archive slice for 2026-05-27 from the HFEPX corpus. Updated from current HFEPX corpus (2026-06-01); covers 3 papers from 2026-05-27.
Use this archive page for time-slice monitoring (what changed in evaluation methods, metrics, and protocol quality this period). Quality band: Developing .
High-Signal Coverage
100.0%
3 / 3 papers are not low-signal flagged.
Benchmark Anchors
66.7%
Papers with benchmark/dataset mentions in extraction output.
Metric Anchors
33.3%
Papers with reported metric mentions in extraction output.
Primary action: Use this slice as early signal only; benchmark/metric anchoring is limited for rigorous period-over-period claims.
Get this digest every Friday →
SubscribeRanked by protocol completeness and evidence density for faster period-over-period review.
May 27, 2026 · Citations: 0 · Score: 7.5
Eval: Automatic Metrics · Metrics: Recall
May 27, 2026 · Citations: 0 · Score: 4.0
Eval: Not reported · Metrics: Not reported
May 27, 2026 · Citations: 0 · Score: 2.5
Eval: Not reported · Metrics: Not reported
Quickly compare method ingredients across this archive slice.
| Paper | Eval Modes | Benchmarks | Metrics | Quality Controls |
|---|---|---|---|---|
| Ask Now, Use Later: Benchmarking the Proactivity Gap in Long-Lived LLM Agents May 27, 2026 | Automatic Metrics | Atrbench | Recall | Not reported |
| The Importance of Being Statistically Earnest: A Critical Re-evaluation of GSM-Symbolic May 27, 2026 | Not reported | GSM8K | Not reported | Not reported |
| GraphLit: Learning Text-Enriched Dynamic Character Network Representations for Literary Study May 27, 2026 | Not reported | Not reported | Not reported | Not reported |
Moderate: Human feedback
Human feedback is present in 1 of 3 papers.
Gap: Quality controls
Quality controls is present in 0 of 3 papers.
Strong: Benchmarks
Benchmarks is present in 2 of 3 papers.
Moderate: Metrics
Metrics is present in 1 of 3 papers.
Gap: Known rater population
Known rater population is present in 0 of 3 papers.
Gap: Known annotation unit
Known annotation unit is present in 0 of 3 papers.
Evaluation Modes
Top Metrics
Top Benchmarks
Quality Controls
Dominika Agnieszka Długosz, Arlindo Oliveira, Natalia Díaz-Rodríguez · May 27, 2026 · Citations: 0
The GSM-Symbolic benchmark (Mirzadeh et al., 2025) reported consistent performance drops across 25 Large Language Models (LLMs) when tested on template-generated variants of GSM8K problems, concluding that the models lack genuine reasoning…
Gaspard Michel, Elena V. Epure, Romain Hennequin, Christophe Cerisara, Mirella Lapata · May 27, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Bin Wu, Guanyun Zou, Bingbing Wang, Huan Zhao, Chuan Shi · May 27, 2026 · Citations: 0
A long-lived LLM agent, such as OpenClaw, earns its value by acting on a user's preferences and constraints across sessions, not just the current request.