HFEPX Archive Slice

HFEPX Daily Papers for 2026-05-27

Daily archive slice for 2026-05-27 from the HFEPX corpus. Updated from current HFEPX corpus (2026-06-01); covers 3 papers from 2026-05-27.

Papers: 3 Last published: May 27, 2026 Global RSS

Researcher Quick Triage

Use this archive page for time-slice monitoring (what changed in evaluation methods, metrics, and protocol quality this period). Quality band: Developing .

High-Signal Coverage

100.0%

3 / 3 papers are not low-signal flagged.

Benchmark Anchors

66.7%

Papers with benchmark/dataset mentions in extraction output.

Metric Anchors

33.3%

Papers with reported metric mentions in extraction output.

0 papers report explicit quality controls for this archive period.
Prioritize papers with both benchmark and metric anchors for reliable longitudinal comparisons.

Primary action: Use this slice as early signal only; benchmark/metric anchoring is limited for rigorous period-over-period claims.

Get this digest every Friday →

Why This Time Slice Matters

Use this archive slice to monitor protocol drift and shifts in evaluation methods over 2026-05-27.

Protocol Takeaways For This Period

Evaluation modes for this slice cluster around automatic_metrics.

Start Here (Highest-Signal Papers In This Slice)

Ranked by protocol completeness and evidence density for faster period-over-period review.

Ask Now, Use Later: Benchmarking the Proactivity Gap in Long-Lived LLM Agents
May 27, 2026 · Citations: 0 · Score: 7.5

Eval: Automatic Metrics · Metrics: Recall
The Importance of Being Statistically Earnest: A Critical Re-evaluation of GSM-Symbolic
May 27, 2026 · Citations: 0 · Score: 4.0

Eval: Not reported · Metrics: Not reported
GraphLit: Learning Text-Enriched Dynamic Character Network Representations for Literary Study
May 27, 2026 · Citations: 0 · Score: 2.5

Eval: Not reported · Metrics: Not reported

Protocol Matrix (Top 10)

Quickly compare method ingredients across this archive slice.

Paper	Eval Modes	Benchmarks	Metrics	Quality Controls
Ask Now, Use Later: Benchmarking the Proactivity Gap in Long-Lived LLM Agents May 27, 2026	Automatic Metrics	Atrbench	Recall	Not reported
The Importance of Being Statistically Earnest: A Critical Re-evaluation of GSM-Symbolic May 27, 2026	Not reported	GSM8K	Not reported	Not reported
GraphLit: Learning Text-Enriched Dynamic Character Network Representations for Literary Study May 27, 2026	Not reported	Not reported	Not reported	Not reported

Researcher Workflow (Detailed)

Checklist

Moderate: Human feedback

Human feedback is present in 1 of 3 papers.
Gap: Quality controls

Quality controls is present in 0 of 3 papers.
Strong: Benchmarks

Benchmarks is present in 2 of 3 papers.
Moderate: Metrics

Metrics is present in 1 of 3 papers.
Gap: Known rater population

Known rater population is present in 0 of 3 papers.
Gap: Known annotation unit

Known annotation unit is present in 0 of 3 papers.

Strengths

Benchmarks is present in 2 of 3 papers.

Known Gaps

Quality controls is present in 0 of 3 papers.
Known rater population is present in 0 of 3 papers.
Known annotation unit is present in 0 of 3 papers.

Suggested Next Analyses

Compare 2026-05-27 against neighboring archive slices to flag protocol drift.

Recommended Queries

Browse all HFEPX daily archives

Known Limitations

This synthetic archive page is generated on-demand from extraction data because no cached payload was available for 2026-05-27.

Research Utility Snapshot (Detailed)

Evaluation Modes

Automatic Metrics (1)

Top Metrics

Recall (1)

Top Benchmarks

Atrbench (1)
GSM8K (1)

Quality Controls

Papers In This Archive Slice

The Importance of Being Statistically Earnest: A Critical Re-evaluation of GSM-Symbolic
Dominika Agnieszka Długosz, Arlindo Oliveira, Natalia Díaz-Rodríguez · May 27, 2026 · Citations: 0

The GSM-Symbolic benchmark (Mirzadeh et al., 2025) reported consistent performance drops across 25 Large Language Models (LLMs) when tested on template-generated variants of GSM8K problems, concluding that the models lack genuine reasoning…
GraphLit: Learning Text-Enriched Dynamic Character Network Representations for Literary Study
Gaspard Michel, Elena V. Epure, Romain Hennequin, Christophe Cerisara, Mirella Lapata · May 27, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Ask Now, Use Later: Benchmarking the Proactivity Gap in Long-Lived LLM Agents
Bin Wu, Guanyun Zou, Bingbing Wang, Huan Zhao, Chuan Shi · May 27, 2026 · Citations: 0

Pairwise Preference

A long-lived LLM agent, such as OpenClaw, earns its value by acting on a user's preferences and constraints across sessions, not just the current request.

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.

Self-Service

Post a Job

Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.

Create Account & Post a Job

Managed Service

For Large Projects

Done-for-You

We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.

Learn About Managed Service

For Freelancers

Join as an AI Trainer

Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.

Join Now