HFEPX Hub

CS.LG + Long Horizon Papers

Updated from current HFEPX corpus (Feb 27, 2026). 27 papers are grouped in this hub page. Common evaluation modes: Automatic Metrics, Simulation Env. Most common rater population: Domain Experts. Common annotation unit: Trajectory. Frequent quality control: Inter Annotator Agreement Reported. Frequently cited benchmark: WebShop. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Feb 26, 2026.

Papers: 27 Last published: Feb 26, 2026 Global RSS Tag RSS

Cs.LGLong Horizon

Research Narrative

Grounded narrative Model: deterministic-grounded Source: persisted

Updated from current HFEPX corpus (Feb 27, 2026). This page tracks 27 papers for CS.LG + Long Horizon Papers. Dominant protocol signals include automatic metrics, simulation environments, with frequent benchmark focus on WebShop, ALFWorld and metric focus on accuracy, cost. Use the grounded sections below to prioritize reproducible protocol choices, benchmark-matched comparisons, and judge-vs-human evaluation checks.

Why This Matters For Eval Research

18.5% of papers report explicit human-feedback signals, led by demonstration data.

Evidence: Search-P1: Path-Centric Reward Shaping for Stable and Efficient Agentic RAG Training , How Do Latent Reasoning Methods Perform Under Weak and Strong Supervision? , GUI-Libra: Training Native GUI Agents to Reason and Act with Action-aware Supervision and Partially Verifiable RL , SWE-Protégé: Learning to Selectively Collaborate With an Expert Unlocks Small Language Models as Software Engineering Agents
automatic metrics appears in 70.4% of papers in this hub.

Evidence: Search-P1: Path-Centric Reward Shaping for Stable and Efficient Agentic RAG Training , How Do Latent Reasoning Methods Perform Under Weak and Strong Supervision? , GUI-Libra: Training Native GUI Agents to Reason and Act with Action-aware Supervision and Partially Verifiable RL , SWE-Protégé: Learning to Selectively Collaborate With an Expert Unlocks Small Language Models as Software Engineering Agents
WebShop is a recurring benchmark anchor for cross-paper comparisons in this page.

Evidence: SELAUR: Self Evolving LLM Agent via Uncertainty-aware Rewards , Search-P1: Path-Centric Reward Shaping for Stable and Efficient Agentic RAG Training , How Do Latent Reasoning Methods Perform Under Weak and Strong Supervision? , GUI-Libra: Training Native GUI Agents to Reason and Act with Action-aware Supervision and Partially Verifiable RL

Protocol Takeaways

Most common quality-control signal is inter-annotator agreement reporting (3.7% of papers).

Evidence: GATES: Self-Distillation under Privileged Context with Consensus Gating , Search-P1: Path-Centric Reward Shaping for Stable and Efficient Agentic RAG Training , How Do Latent Reasoning Methods Perform Under Weak and Strong Supervision? , GUI-Libra: Training Native GUI Agents to Reason and Act with Action-aware Supervision and Partially Verifiable RL
Rater context is mostly domain experts, and annotation is commonly trajectory-level annotation; use this to scope replication staffing.

Evidence: SWE-Protégé: Learning to Selectively Collaborate With an Expert Unlocks Small Language Models as Software Engineering Agents , UI-Venus-1.5 Technical Report , Search-P1: Path-Centric Reward Shaping for Stable and Efficient Agentic RAG Training , How Do Latent Reasoning Methods Perform Under Weak and Strong Supervision?
Stratify by benchmark (WebShop vs ALFWorld) before comparing methods.

Evidence: Search-P1: Path-Centric Reward Shaping for Stable and Efficient Agentic RAG Training , How Do Latent Reasoning Methods Perform Under Weak and Strong Supervision? , GUI-Libra: Training Native GUI Agents to Reason and Act with Action-aware Supervision and Partially Verifiable RL , SWE-Protégé: Learning to Selectively Collaborate With an Expert Unlocks Small Language Models as Software Engineering Agents

Benchmark Interpretation

WebShop appears in 7.4% of hub papers (2/27); use this cohort for benchmark-matched comparisons.
ALFWorld appears in 3.7% of hub papers (1/27); use this cohort for benchmark-matched comparisons.

Metric Interpretation

accuracy is reported in 25.9% of hub papers (7/27); compare with a secondary metric before ranking methods.
cost is reported in 11.1% of hub papers (3/27); compare with a secondary metric before ranking methods.

Researcher Checklist

Close gap on Papers with explicit human feedback. Coverage is a replication risk (18.5% vs 45% target).
Close gap on Papers reporting quality controls. Coverage is a replication risk (3.7% vs 30% target).
Tighten coverage on Papers naming benchmarks/datasets. Coverage is usable but incomplete (25.9% vs 35% target).
Maintain strength on Papers naming evaluation metrics. Coverage is strong (51.9% vs 35% target).
Close gap on Papers with known rater population. Coverage is a replication risk (14.8% vs 35% target).
Maintain strength on Papers with known annotation unit. Coverage is strong (37% vs 35% target).

Papers with explicit human feedback

Coverage is a replication risk (18.5% vs 45% target).

Papers reporting quality controls

Coverage is a replication risk (3.7% vs 30% target).

Papers naming benchmarks/datasets

Coverage is usable but incomplete (25.9% vs 35% target).

Papers naming evaluation metrics

Coverage is strong (51.9% vs 35% target).

Papers with known rater population

Coverage is a replication risk (14.8% vs 35% target).

Papers with known annotation unit

Coverage is strong (37% vs 35% target).

Known Limitations

Only 3.7% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (14.8% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Links

Benchmark Slice: WebShop - Prioritizes benchmark-specific protocol comparisons.
Metric Slice: accuracy - Finds papers where reported metrics are directly comparable.
IAA-Reported Evaluations - Highlights evaluations that explicitly report inter-annotator agreement.
Recent High-Signal Papers - Keeps the hub connected to the latest HFEPX corpus updates.

automatic_metrics vs simulation_env

both=0, left_only=19, right_only=8

0 papers use both Automatic Metrics and Simulation Env.

Benchmark Brief

WebShop

Coverage: 2 papers (7.4%)

2 papers (7.4%) mention WebShop.

Examples: SELAUR: Self Evolving LLM Agent via Uncertainty-aware Rewards , TSR: Trajectory-Search Rollouts for Multi-Turn RL of LLM Agents

Benchmark Brief

ALFWorld

Coverage: 1 papers (3.7%)

1 papers (3.7%) mention ALFWorld.

Examples: SELAUR: Self Evolving LLM Agent via Uncertainty-aware Rewards

Benchmark Brief

APPS

Coverage: 1 papers (3.7%)

1 papers (3.7%) mention APPS.

Examples: UI-Venus-1.5 Technical Report

Metric Brief

accuracy

Coverage: 7 papers (25.9%)

7 papers (25.9%) mention accuracy.

Examples: Search-P1: Path-Centric Reward Shaping for Stable and Efficient Agentic RAG Training , How Do Latent Reasoning Methods Perform Under Weak and Strong Supervision? , GUI-Libra: Training Native GUI Agents to Reason and Act with Action-aware Supervision and Partially Verifiable RL

Metric Brief

cost

Coverage: 3 papers (11.1%)

3 papers (11.1%) mention cost.

Examples: How Do Latent Reasoning Methods Perform Under Weak and Strong Supervision? , SWE-Protégé: Learning to Selectively Collaborate With an Expert Unlocks Small Language Models as Software Engineering Agents , Sink-Aware Pruning for Diffusion Language Models

Metric Brief

latency

Coverage: 3 papers (11.1%)

3 papers (11.1%) mention latency.

Examples: SWE-Protégé: Learning to Selectively Collaborate With an Expert Unlocks Small Language Models as Software Engineering Agents , AceGRPO: Adaptive Curriculum Enhanced Group Relative Policy Optimization for Autonomous Machine Learning Engineering , Fast-ThinkAct: Efficient Vision-Language-Action Reasoning via Verbalizable Latent Planning

Most Cited In This Hub

Fast path to methods with the strongest citation traction in this scope.

Papers: Search-P1: Path-Centric Reward Shaping for Stable and Efficient Agentic RAG Training , How Do Latent Reasoning Methods Perform Under Weak and Strong Supervision? , GUI-Libra: Training Native GUI Agents to Reason and Act with Action-aware Supervision and Partially Verifiable RL

Most Recent

Fast path to latest protocol changes and newly published evaluation setups.

Best Protocol Detail

Papers with explicit rater/unit metadata and quality-control signals for reproducibility.

Top Papers

Search-P1: Path-Centric Reward Shaping for Stable and Efficient Agentic RAG Training
Tianle Xia, Ming Xu, Lingxiang Hu, Yiding Sun, Wenwei Li · Feb 26, 2026 · Citations: 0

Automatic Metrics Long Horizon

Agentic RAG addresses this by enabling LLMs to dynamically decide when and what to retrieve, but current RL-based training methods suffer from sparse outcome rewards that discard intermediate signals and low sample efficiency where failed s
How Do Latent Reasoning Methods Perform Under Weak and Strong Supervision?
Yingqian Cui, Zhenwei Dai, Bing He, Zhan Shi, Hui Liu · Feb 25, 2026 · Citations: 0

Automatic Metrics Long Horizon

Latent reasoning has been recently proposed as a reasoning paradigm and performs multi-step reasoning through generating steps in the latent space instead of the textual space.
GUI-Libra: Training Native GUI Agents to Reason and Act with Action-aware Supervision and Partially Verifiable RL
Rui Yang, Qianhui Wu, Zhaoyang Wang, Hanyang Chen, Ke Yang · Feb 25, 2026 · Citations: 0

Automatic Metrics Long Horizon

Open-source native GUI agents still lag behind closed-source systems on long-horizon navigation tasks.
SWE-Protégé: Learning to Selectively Collaborate With an Expert Unlocks Small Language Models as Software Engineering Agents
Patrick Tser Jern Kon, Archana Pradeep, Ang Chen, Alexander P. Ellis, Warren Hunt · Feb 25, 2026 · Citations: 0

Automatic Metrics Long Horizon

Our approach combines supervised fine-tuning on expert-augmented trajectories with agentic reinforcement learning that explicitly discourages degenerative looping and unproductive expert collaboration.
LiLo-VLA: Compositional Long-Horizon Manipulation via Linked Object-Centric Policies
Yue Yang, Shuo Cheng, Yu Fang, Homanga Bharadhwaj, Mingyu Ding · Feb 25, 2026 · Citations: 0

Simulation Env Long Horizon

We introduce a 21-task simulation benchmark consisting of two challenging suites: LIBERO-Long++ and Ultra-Long.
Provably Safe Generative Sampling with Constricting Barrier Functions
Darshan Gadginmath, Ahmed Allibhoy, Fabio Pasqualetti · Feb 24, 2026 · Citations: 0

Automatic Metrics Long Horizon

However, a critical gap remains for their deployment in safety-critical domains: the lack of formal guarantees that generated samples will satisfy hard constraints.
Learning from Trials and Errors: Reflective Test-Time Planning for Embodied LLMs
Yining Hong, Huang Huang, Manling Li, Li Fei-Fei, Jiajun Wu · Feb 24, 2026 · Citations: 0

Automatic Metrics Long Horizon

Drawing upon human reflective practitioners, we introduce Reflective Test-Time Planning, which integrates two modes of reflection: \textit{reflection-in-action}, where the agent uses test-time scaling to generate and score multiple candidat
SELAUR: Self Evolving LLM Agent via Uncertainty-aware Rewards
Dengjia Zhang, Xiaoou Liu, Lu Cheng, Yaqing Wang, Kenton Murray · Feb 24, 2026 · Citations: 0

Automatic Metrics Long Horizon

Large language models (LLMs) are increasingly deployed as multi-step decision-making agents, where effective reward design is essential for guiding learning.
ToolMATH: A Math Tool Benchmark for Realistic Long-Horizon Multi-Tool Reasoning
Hyeonje Choi, Jeongsoo Lee, Hyojun Lee, Jay-Yoon Lee · Feb 24, 2026 · Citations: 0

Simulation Env Long Horizon

We introduce \ToolMATH, a math-grounded benchmark that evaluates tool-augmented language models in realistic multi-tool environments where the output depends on calling schema-specified tools and sustaining multi-step execution.
GATES: Self-Distillation under Privileged Context with Consensus Gating
Alex Stein, Furong Huang, Tom Goldstein · Feb 24, 2026 · Citations: 0

Automatic Metrics Long Horizon

Held-out in-domain accuracy under asymmetric evaluation improves from 46.0\% to 62.0\%, and average (maj@8) accuracy on public document-free math benchmarks improves from 20.2\% to 35.4\%.
Capable but Unreliable: Canonical Path Deviation as a Causal Mechanism of Agent Failure in Long-Horizon Tasks
Wilson Y. Lee · Feb 22, 2026 · Citations: 0

Automatic Metrics Long Horizon

Why do language agents fail on tasks they are capable of solving?
Hierarchical Reward Design from Language: Enhancing Alignment of Agent Behavior with Human Specifications
Zhiqin Qian, Ryan Diaz, Sangwon Seo, Vaibhav Unhelkar · Feb 20, 2026 · Citations: 0

Pairwise Preference Automatic Metrics Long Horizon

When training artificial intelligence (AI) to perform tasks, humans often care not only about whether a task is completed but also how it is performed.
Sink-Aware Pruning for Diffusion Language Models
Aidar Myrzakhan, Tianyi Li, Bowei Guo, Shengkun Tang, Zhiqiang Shen · Feb 19, 2026 · Citations: 0

Automatic Metrics Long Horizon

Diffusion Language Models (DLMs) incur high inference cost due to iterative denoising, motivating efficient pruning.
GLM-5: from Vibe Coding to Agentic Engineering
GLM-5-Team, :, Aohan Zeng, Xin Lv, Zhenyu Hou · Feb 17, 2026 · Citations: 0

Automatic Metrics Long Horizon

We present GLM-5, a next-generation foundation model designed to transition the paradigm of vibe coding to agentic engineering.
Think like a Scientist: Physics-guided LLM Agent for Equation Discovery
Jianke Yang, Ohm Venkatachalam, Mohammad Kianezhad, Sharvaree Vadgama, Rose Yu · Feb 12, 2026 · Citations: 0

Automatic Metrics Long Horizon

We introduce KeplerAgent, an agentic framework that explicitly follows this scientific reasoning process.
TSR: Trajectory-Search Rollouts for Multi-Turn RL of LLM Agents
Aladin Djuhera, Swanand Ravindra Kadhe, Farhan Ahmed, Heiko Ludwig, Holger Boche · Feb 12, 2026 · Citations: 0

Simulation Env Long Horizon

Advances in large language models (LLMs) are driving a shift toward using reinforcement learning (RL) to train agents from iterative, multi-turn interactions across tasks.
UI-Venus-1.5 Technical Report
Venus Team, Changlong Gao, Zhangxuan Gu, Yulin Liu, Xinyu Qiu · Feb 9, 2026 · Citations: 0

Simulation Env Long Horizon

GUI agents have emerged as a powerful paradigm for automating interactions in digital environments, yet achieving both broad generality and consistently strong task performance remains challenging.
AceGRPO: Adaptive Curriculum Enhanced Group Relative Policy Optimization for Autonomous Machine Learning Engineering
Yuzhu Cai, Zexi Liu, Xinyu Zhu, Cheng Wang, Siheng Chen · Feb 8, 2026 · Citations: 0

Automatic Metrics Long Horizon

Autonomous Machine Learning Engineering (MLE) requires agents to perform sustained, iterative optimization over long horizons.
APEX-Agents
Bertie Vidgen, Austin Mann, Abby Fennelly, John Wright Stanly, Lucas Rothman · Jan 20, 2026 · Citations: 0

Rubric RatingExpert Verification Simulation Env Long Horizon

We introduce the AI Productivity Index for Agents (APEX-Agents), a benchmark for assessing whether AI agents can execute long-horizon, cross-application tasks created by investment banking analysts, management consultants, and corporate law
Orthogonalized Policy Optimization:Policy Optimization as Orthogonal Projection in Hilbert Space
Wang Zixian · Jan 18, 2026 · Citations: 0

Automatic Metrics Long Horizon

Experiments on MATH benchmarks show that the Hilbert projection formulation prevents gradient saturation typical of KL-constrained methods.
Fast-ThinkAct: Efficient Vision-Language-Action Reasoning via Verbalizable Latent Planning
Chi-Pin Huang, Yunze Man, Zhiding Yu, Min-Hung Chen, Jan Kautz · Jan 14, 2026 · Citations: 0

Pairwise Preference Simulation Env Long Horizon

Fast-ThinkAct learns to reason efficiently with latent CoTs by distilling from a teacher, driven by a preference-guided objective to align manipulation trajectories that transfers both linguistic and visual planning capabilities for embodie
Conflict-Aware Fusion: Resolving Logic Inertia in Large Language Models via Structured Cognitive Priors
Qiming Bao, Xiaoxuan Fu, Michael Witbrock · Dec 6, 2025 · Citations: 0

Automatic Metrics Long Horizon

We present a controlled evaluation framework consisting of four stress tests: (1) rule deletion (redundant vs.
Stabilizing Off-Policy Training for Long-Horizon LLM Agent via Turn-Level Importance Sampling and Clipping-Triggered Normalization
Chenliang Li, Adel Elmahdy, Alex Boyd, Zhongruo Wang, Siliang Zeng · Nov 25, 2025 · Citations: 0

Automatic Metrics Long Horizon

Reinforcement learning (RL) algorithms such as PPO and GRPO are widely used to train large language models (LLMs) for multi-turn agentic tasks.
Supervised Reinforcement Learning: From Expert Trajectories to Step-wise Reasoning
Yihe Deng, I-Hung Hsu, Jun Yan, Zifeng Wang, Rujun Han · Oct 29, 2025 · Citations: 0

Demonstrations Automatic Metrics Long Horizon

Beyond reasoning benchmarks, SRL generalizes effectively to agentic software engineering tasks, establishing it as a robust and versatile training framework for reasoning-oriented LLMs.
World Simulation with Video Foundation Models for Physical AI
NVIDIA, :, Arslan Ali, Junjie Bai, Maciej Bala · Oct 28, 2025 · Citations: 0

Simulation Env Long Horizon

These capabilities enable more reliable synthetic data generation, policy evaluation, and closed-loop simulation for robotics and autonomous systems.
MoMaGen: Generating Demonstrations under Soft and Hard Constraints for Multi-Step Bimanual Mobile Manipulation
Chengshu Li, Mengdi Xu, Arpit Bahety, Hang Yin, Yunfan Jiang · Oct 21, 2025 · Citations: 0

Demonstrations Simulation Env Long Horizon

Imitation learning from large-scale, diverse human demonstrations has been shown to be effective for training robots, but collecting such data is costly and time-consuming.
RHYTHM: Reasoning with Hierarchical Temporal Tokenization for Human Mobility
Haoyu He, Haozheng Luo, Yan Chen, Qi R. Wang · Sep 27, 2025 · Citations: 0

Automatic Metrics Long Horizon

Predicting human mobility is inherently challenging due to complex long-range dependencies and multi-scale periodic behaviors.

CS.LG + Long Horizon Papers

Research Narrative

Why This Matters For Eval Research

Protocol Takeaways

Benchmark Interpretation

Metric Interpretation

Researcher Checklist

Suggested Reading Order

Known Limitations

Research Utility Links

Top Papers

Related Hubs