HFEPX Archive Slice

HFEPX Daily Archive: 2025-10-13

Updated from current HFEPX corpus (Apr 9, 2026). 9 papers are grouped in this daily page.

Read Full Context

Updated from current HFEPX corpus (Apr 9, 2026). 9 papers are grouped in this daily page. Common evaluation modes: Simulation Env, Automatic Metrics. Most common rater population: Domain Experts. Common annotation unit: Trajectory. Frequently cited benchmark: APPS. Common metric signal: coherence. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Oct 13, 2025.

Papers: 9 Last published: Oct 13, 2025 Global RSS

Researcher Quick Triage

Use this archive page for time-slice monitoring (what changed in evaluation methods, metrics, and protocol quality this period). Quality band: Medium .

High-Signal Coverage

100.0%

9 / 9 papers are not low-signal flagged.

Benchmark Anchors

11.1%

Papers with benchmark/dataset mentions in extraction output.

Metric Anchors

33.3%

Papers with reported metric mentions in extraction output.

0 papers report explicit quality controls for this archive period.
Prioritize papers with both benchmark and metric anchors for reliable longitudinal comparisons.

Primary action: Use this slice as early signal only; benchmark/metric anchoring is limited for rigorous period-over-period claims.

Get this digest every Friday →

Why This Time Slice Matters

simulation environments appears in 33.3% of papers in this hub.
APPS is a recurring benchmark anchor for cross-paper comparisons in this page.
long-horizon tasks appears in 11.1% of papers, indicating agentic evaluation demand.

Protocol Takeaways For This Period

Quality-control reporting is sparse in this slice; prioritize papers with explicit calibration or adjudication steps.
Rater context is mostly domain experts, and annotation is commonly trajectory-level annotation; use this to scope replication staffing.
Stratify by benchmark (APPS vs OSWorld) before comparing methods.

Start Here (Highest-Signal Papers In This Slice)

Ranked by protocol completeness and evidence density for faster period-over-period review.

R-WoM: Retrieval-augmented World Model For Computer-use Agents
Oct 13, 2025 · Citations: 0 · Score: 4.0

Eval: Simulation Env · Metrics: Not reported
StoryBox: Collaborative Multi-Agent Simulation for Hybrid Bottom-Up Long-Form Story Generation Using Large Language Models
Oct 13, 2025 · Citations: 0 · Score: 4.0

Eval: Simulation Env · Metrics: Coherence
DropVLA: An Action-Level Backdoor Attack on Vision--Language--Action Models
Oct 13, 2025 · Citations: 0 · Score: 4.0

Eval: Automatic Metrics · Metrics: Success rate, Jailbreak success rate
ShishuLM : Achieving Optimal and Efficient Parameterization with Low Attention Transformer Models
Oct 13, 2025 · Citations: 0 · Score: 3.0

Eval: Not reported · Metrics: Latency, Throughput
SAGE: A Top-Down Bottom-Up Knowledge-Grounded User Simulator for Multi-turn AGent Evaluation
Oct 13, 2025 · Citations: 0 · Score: 2.5

Eval: Simulation Env · Metrics: Not reported
Qubit-centric Transformer for Surface Code Decoding
Oct 13, 2025 · Citations: 0 · Score: 1.5

Eval: Not reported · Metrics: Not reported

Protocol Matrix (Top 10)

Quickly compare method ingredients across this archive slice.

Paper	Eval Modes	Benchmarks	Metrics	Quality Controls
R-WoM: Retrieval-augmented World Model For Computer-use Agents Oct 13, 2025	Simulation Env	WebArena, OSWorld	Not reported	Not reported
StoryBox: Collaborative Multi-Agent Simulation for Hybrid Bottom-Up Long-Form Story Generation Using Large Language Models Oct 13, 2025	Simulation Env	Not reported	Coherence	Not reported
DropVLA: An Action-Level Backdoor Attack on Vision--Language--Action Models Oct 13, 2025	Automatic Metrics	Not reported	Success rate, Jailbreak success rate	Not reported
ShishuLM : Achieving Optimal and Efficient Parameterization with Low Attention Transformer Models Oct 13, 2025	Not reported	Not reported	Latency, Throughput	Not reported
SAGE: A Top-Down Bottom-Up Knowledge-Grounded User Simulator for Multi-turn AGent Evaluation Oct 13, 2025	Simulation Env	Not reported	Not reported	Not reported
Qubit-centric Transformer for Surface Code Decoding Oct 13, 2025	Not reported	Not reported	Not reported	Not reported
Unlocking the Potential of Diffusion Language Models through Template Infilling Oct 13, 2025	Not reported	Not reported	Not reported	Not reported
CNSocialDepress: A Chinese Social Media Dataset for Depression Risk Detection and Structured Analysis Oct 13, 2025	Not reported	Not reported	Not reported	Not reported
From Prompts to Packets: A View from the Network on ChatGPT, Copilot, and Gemini Oct 13, 2025	Not reported	Not reported	Not reported	Not reported

Researcher Workflow (Detailed)

Checklist

Gap: Papers with explicit human feedback

Coverage is a replication risk (0% vs 45% target).
Gap: Papers reporting quality controls

Coverage is a replication risk (0% vs 30% target).
Moderate: Papers naming benchmarks/datasets

Coverage is usable but incomplete (22.2% vs 35% target).
Gap: Papers naming evaluation metrics

Coverage is a replication risk (11.1% vs 35% target).
Gap: Papers with known rater population

Coverage is a replication risk (11.1% vs 35% target).
Gap: Papers with known annotation unit

Coverage is a replication risk (11.1% vs 35% target).

Strengths

This hub still surfaces a concentrated paper set for protocol triage and replication planning.

Known Gaps

Only 0% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (11.1% coverage).
Annotation unit is under-specified (11.1% coverage).

Suggested Next Analyses

Stratify by benchmark (APPS vs OSWorld) before comparing methods.

Recommended Queries

Benchmark Slice: APPS Metric Slice: coherence Recent High-Signal Papers

Known Limitations

Only 0% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (11.1% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Snapshot (Detailed)

Evaluation Modes

Simulation Env (3)
Automatic Metrics (1)

Top Metrics

Coherence (1)

Top Benchmarks

APPS (1)
OSWorld (1)
WebArena (1)

Quality Controls

Papers In This Archive Slice

SAGE: A Top-Down Bottom-Up Knowledge-Grounded User Simulator for Multi-turn AGent Evaluation
Ryan Shea, Yunan Lu, Liang Qiu, Zhou Yu · Oct 13, 2025 · Citations: 0

We propose SAGE, a novel user Simulation framework for multi-turn AGent Evaluation that integrates knowledge from business contexts.
R-WoM: Retrieval-augmented World Model For Computer-use Agents
Kai Mei, Jiang Guo, Shuaichen Chang, Mingwen Dong, Dongkyu Lee · Oct 13, 2025 · Citations: 0

Long Horizon

Large Language Models (LLMs) can serve as world models to enhance agent decision-making in digital environments by simulating future states and predicting action outcomes, potentially eliminating costly trial-and-error exploration.
StoryBox: Collaborative Multi-Agent Simulation for Hybrid Bottom-Up Long-Form Story Generation Using Large Language Models
Zehao Chen, Rong Pan, Haoran Li · Oct 13, 2025 · Citations: 0

Multi Agent

Human writers often begin their stories with an overarching mental scene, where they envision the interactions between characters and their environment.
Qubit-centric Transformer for Surface Code Decoding
Seong-Joon Park, Hee-Youl Kwak, Yongjune Kim · Oct 13, 2025 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Unlocking the Potential of Diffusion Language Models through Template Infilling
Junhoo Lee, Seungyeon Kim, Nojun Kwak · Oct 13, 2025 · Citations: 0

We demonstrate the effectiveness of our approach on diverse benchmarks, including mathematical reasoning, code generation, and trip planning, achieving consistent improvements of 9.40% over the baseline.
From Prompts to Packets: A View from the Network on ChatGPT, Copilot, and Gemini
Antonio Montieri, Alfredo Nascita, Antonio Pescapè · Oct 13, 2025 · Citations: 0
CNSocialDepress: A Chinese Social Media Dataset for Depression Risk Detection and Structured Analysis
Jinyuan Xu, Tian Lan, Xintao Yu, Xue He, Hezhi Zhang · Oct 13, 2025 · Citations: 0

To address this limitation, we release CNSocialDepress, a benchmark dataset for depression risk detection on Chinese social media.
ShishuLM : Achieving Optimal and Efficient Parameterization with Low Attention Transformer Models
Shivanshu Kumar, Gopalakrishnan Srinivasan · Oct 13, 2025 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
DropVLA: An Action-Level Backdoor Attack on Vision--Language--Action Models
Zonghuan Xu, Xiang Zheng, Xingjun Ma, Yu-Gang Jiang · Oct 13, 2025 · Citations: 0

The backdoor remains robust to moderate trigger variations and transfers across evaluation suites (96.27%, 99.09%), whereas text-only largely fails (0.72%).

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.

Post a Job Get a Quote