HFEPX Archive Slice
HFEPX Daily Papers for 2026-05-17
Daily archive slice for 2026-05-17 from the HFEPX corpus. Updated from current HFEPX corpus (2026-06-07); covers 9 papers from 2026-05-17.
HFEPX Archive Slice
Daily archive slice for 2026-05-17 from the HFEPX corpus. Updated from current HFEPX corpus (2026-06-07); covers 9 papers from 2026-05-17.
Use this archive page for time-slice monitoring (what changed in evaluation methods, metrics, and protocol quality this period). Quality band: Medium .
High-Signal Coverage
100.0%
9 / 9 papers are not low-signal flagged.
Benchmark Anchors
22.2%
Papers with benchmark/dataset mentions in extraction output.
Metric Anchors
44.4%
Papers with reported metric mentions in extraction output.
Primary action: Use this slice as early signal only; benchmark/metric anchoring is limited for rigorous period-over-period claims.
Get this digest every Friday →
SubscribeRanked by protocol completeness and evidence density for faster period-over-period review.
May 17, 2026 · Citations: 0 · Score: 6.5
Eval: Automatic Metrics · Metrics: Mrr
May 17, 2026 · Citations: 0 · Score: 6.0
Eval: Simulation Env · Metrics: Not reported
May 17, 2026 · Citations: 0 · Score: 5.0
Eval: Automatic Metrics · Metrics: Relevance
May 17, 2026 · Citations: 0 · Score: 5.0
Eval: Automatic Metrics · Metrics: Cost, Inference cost
May 17, 2026 · Citations: 0 · Score: 5.0
Eval: Automatic Metrics · Metrics: Accuracy, Cost
May 17, 2026 · Citations: 0 · Score: 2.5
Eval: Not reported · Metrics: Not reported
Quickly compare method ingredients across this archive slice.
Gap: Human feedback
Human feedback is present in 1 of 9 papers.
Gap: Quality controls
Quality controls is present in 0 of 9 papers.
Gap: Benchmarks
Benchmarks is present in 2 of 9 papers.
Moderate: Metrics
Metrics is present in 4 of 9 papers.
Moderate: Known rater population
Known rater population is present in 2 of 9 papers.
Moderate: Known annotation unit
Known annotation unit is present in 2 of 9 papers.
Evaluation Modes
Top Metrics
Top Benchmarks
Quality Controls
Kaavya Chaparala, Thomas Thebaud, Jesús Villalba López, Laureano Moro-Velazquez, Peter Viechnicki · May 17, 2026 · Citations: 0
There are not enough established benchmarks for the task fo speech summarization.
Saksham Sahai Srivastava · May 17, 2026 · Citations: 0
Long-horizon LLM agents rely on persistent memory to support interactions across sessions, yet existing memory systems often retrieve context using semantic similarity or broad history inclusion, treating retrieved memories as uniformly…
Volodymyr Ovcharov · May 17, 2026 · Citations: 0
We test this assumption longitudinally by constructing UA-StatuteRetrieval, a benchmark that measures co-citation predictability across 20 annual snapshots (2007-2026) of 396 million codex citations from 101 million Ukrainian court…
Sahar Abdelnabi, Eugene Bagdasarian · May 17, 2026 · Citations: 0
Prompt injection is the most critical vulnerability in deployed AI agents.
Shahriar Kabir Nahin, Hadi Askari, Muhao Chen, Anshuman Chhabra · May 17, 2026 · Citations: 0
The rapid growth of online video platforms and AI-generated content has made reliable video guardrails a key challenge for safety and real-world deployment.
Ori Bar Joseph, Smadar Arvatz, Noam Kayzer, Dan Revital, Sarel Weinberger · May 17, 2026 · Citations: 0
Routing improvements correlate with consistent downstream benchmark gains, positioning routing entropy and expert specialization as principled diagnostics for multilingual capacity in MoE systems.
Minghao Tian, Yunfei Xie, Chen Wei · May 17, 2026 · Citations: 0
Across five language models and multiple math reasoning benchmarks, Mu-GRPO matches or exceeds the performance of standard GRPO while achieving around 2x speedup in wall-clock training time, establishing a substantially improved…
Ethan Tang · May 17, 2026 · Citations: 0
Recent work has fine-tuned language models on chess data and reported high benchmark scores as evidence that the resulting models can understand the rules of chess, play full chess games at a professional level, or generate human-readable…
Yuxuan Lu, Ziyi Wang, Yingzhou Lu, Yisi Sang, Jiri Gesi · May 17, 2026 · Citations: 0
Training tool-calling agents requires large-scale trajectory data with verifiable labels, yet existing approaches either synthesize environments that diverge from real API behavior or generate tasks without ground-truth outcomes for…