HFEPX Hub

Long Horizon + General (Last 30 Days)

Updated from current HFEPX corpus (Mar 1, 2026). 24 papers are grouped in this hub page.

Read Full Context

Updated from current HFEPX corpus (Mar 1, 2026). 24 papers are grouped in this hub page. Common evaluation modes: Automatic Metrics, Simulation Env. Most common rater population: Domain Experts. Common annotation unit: Trajectory. Frequent quality control: Calibration. Frequently cited benchmark: ALFWorld. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Feb 24, 2026.

Papers: 24 Last published: Feb 24, 2026 Global RSS Tag RSS

Long HorizonGeneralLast 30d

Researcher Quick Triage

This hub is best used for protocol triage and replication planning from abstract-level evidence. Quality band: Developing .

All Sampled Papers (24) Replication-Ready Only (3)

High-Signal Coverage

100.0%

24 / 24 sampled papers are not low-signal flagged.

Replication-Ready Set

Benchmark + metric + eval mode explicitly present.

Judge/Human Comparability

Papers containing both `human_eval` and `llm_as_judge`.

3 papers are replication-ready (benchmark + metric + explicit evaluation mode).
0 papers support judge-vs-human agreement analysis.
1 papers report explicit quality controls (calibration/adjudication/IAA).

Primary action: Start with the top 2 papers in “Start Here”, then validate assumptions in the protocol matrix.

Why This Matters (Expanded)

Why This Matters For Eval Research

8.3% of papers report explicit human-feedback signals, led by pairwise preferences.
automatic metrics appears in 70.8% of papers in this hub.
ALFWorld is a recurring benchmark anchor for cross-paper comparisons in this page.

Protocol Notes (Expanded)

Protocol Takeaways

Most common quality-control signal is rater calibration (4.2% of papers).
Rater context is mostly domain experts, and annotation is commonly trajectory-level annotation; use this to scope replication staffing.
Stratify by benchmark (ALFWorld vs Ama-Bench) before comparing methods.

Benchmark Interpretation

ALFWorld appears in 4.2% of hub papers (1/24); use this cohort for benchmark-matched comparisons.
Ama-Bench appears in 4.2% of hub papers (1/24); use this cohort for benchmark-matched comparisons.

Metric Interpretation

accuracy is reported in 41.7% of hub papers (10/24); compare with a secondary metric before ranking methods.
cost is reported in 20.8% of hub papers (5/24); compare with a secondary metric before ranking methods.

Researcher Checklist (Expanded)

Researcher Checklist

Gap: Papers with explicit human feedback

Coverage is a replication risk (8.3% vs 45% target).
Gap: Papers reporting quality controls

Coverage is a replication risk (4.2% vs 30% target).
Moderate: Papers naming benchmarks/datasets

Coverage is usable but incomplete (25% vs 35% target).
Strong: Papers naming evaluation metrics

Coverage is strong (70.8% vs 35% target).
Gap: Papers with known rater population

Coverage is a replication risk (4.2% vs 35% target).
Moderate: Papers with known annotation unit

Coverage is usable but incomplete (33.3% vs 35% target).

Strengths

Agentic evaluation appears in 100% of papers.

Known Gaps

Only 4.2% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (4.2% coverage).

Suggested Next Analyses

Stratify by benchmark (ALFWorld vs Ama-Bench) before comparing methods.
Track metric sensitivity by reporting both accuracy and cost.
Add inter-annotator agreement checks when reproducing these protocols.

Recommended Queries (Expanded)

Recommended Queries

Benchmark Slice: ALFWorld Metric Slice: accuracy Recent High-Signal Papers

Start Here (Best First 6)

Ranked for protocol completeness (human signal, benchmark + metric anchors, quality controls, and judge/human overlap).

AMA-Bench: Evaluating Long-Horizon Memory for Agentic Applications
Feb 26, 2026 · Citations: 0 · Score: 6.0

HF: Not reported · Eval: Automatic Metrics · Benchmark: Ama Bench · Metric: Accuracy
D-COT: Disciplined Chain-of-Thought Learning for Efficient Reasoning in Small Language Models
Feb 25, 2026 · Citations: 0 · Score: 6.0

HF: Not reported · Eval: Automatic Metrics · Benchmark: MMLU · Metric: Accuracy
Search More, Think Less: Rethinking Long-Horizon Agentic Search for Efficiency and Generalization
Feb 26, 2026 · Citations: 0 · Score: 6.0

HF: Not reported · Eval: Automatic Metrics · Benchmark: GAIA · Metric: Accuracy
PMG: Parameterized Motion Generator for Human-like Locomotion Control
Feb 13, 2026 · Citations: 0 · Score: 5.5

HF: Not reported · Eval: Automatic Metrics · Benchmark: Not Reported · Metric: Calibration
SELAUR: Self Evolving LLM Agent via Uncertainty-aware Rewards
Feb 24, 2026 · Citations: 0 · Score: 4.5

HF: Not reported · Eval: Simulation Env · Benchmark: ALFWorld · Metric: Not Reported
Mobile-Agent-v3.5: Multi-platform Fundamental GUI Agents
Feb 15, 2026 · Citations: 0 · Score: 4.5

HF: Not reported · Eval: Simulation Env · Benchmark: WebArena · Metric: Not Reported

Protocol Matrix (Top 12)

Use this to quickly compare protocol ingredients instead of scanning long prose.

Paper	HF Signal	Eval Modes	Benchmarks	Metrics	QC
AMA-Bench: Evaluating Long-Horizon Memory for Agentic Applications Feb 26, 2026	No Not Reported	Automatic Metrics	Ama Bench	Accuracy	Not Reported
D-COT: Disciplined Chain-of-Thought Learning for Efficient Reasoning in Small Language Models Feb 25, 2026	No Not Reported	Automatic Metrics	MMLU , MMLU Pro	Accuracy	Not Reported
Search More, Think Less: Rethinking Long-Horizon Agentic Search for Efficiency and Generalization Feb 26, 2026	No Not Reported	Automatic Metrics	GAIA , BrowseComp	Accuracy , Latency	Not Reported
PMG: Parameterized Motion Generator for Human-like Locomotion Control Feb 13, 2026	No Not Reported	Automatic Metrics	Not Reported	Calibration	Calibration
SELAUR: Self Evolving LLM Agent via Uncertainty-aware Rewards Feb 24, 2026	No Not Reported	Simulation Env	ALFWorld , WebShop	Not Reported	Not Reported
Mobile-Agent-v3.5: Multi-platform Fundamental GUI Agents Feb 15, 2026	No Not Reported	Simulation Env	WebArena , OSWorld	Not Reported	Not Reported
ARLArena: A Unified Framework for Stable Agentic Reinforcement Learning Feb 25, 2026	No Not Reported	Simulation Env	Arlarena	Not Reported	Not Reported
Learning to Reason for Multi-Step Retrieval of Personal Context in Personalized Question Answering Feb 22, 2026	Yes Pairwise Preference	Not Reported	Not Reported	Not Reported	Not Reported
Hierarchical Reward Design from Language: Enhancing Alignment of Agent Behavior with Human Specifications Feb 20, 2026	Yes Pairwise Preference	Not Reported	Not Reported	Not Reported	Not Reported
LiLo-VLA: Compositional Long-Horizon Manipulation via Linked Object-Centric Policies Feb 25, 2026	No Not Reported	Simulation Env	Not Reported	Success rate	Not Reported
Beyond Words: Evaluating and Bridging Epistemic Divergence in User-Agent Interaction via Theory of Mind Feb 14, 2026	No Not Reported	Automatic Metrics	Not Reported	Task success	Not Reported
Search-P1: Path-Centric Reward Shaping for Stable and Efficient Agentic RAG Training Feb 26, 2026	No Not Reported	Automatic Metrics	Not Reported	Accuracy	Not Reported

Protocol Diff (Top Papers)

Fast side-by-side comparison for the highest-ranked papers in this hub.

Signal	AMA-Bench: Evaluating Long-Horizon Memory for Agent…	D-COT: Disciplined Chain-of-Thought Learning for Ef…	Search More, Think Less: Rethinking Long-Horizon Ag…
Human Feedback	Not reported	Not reported	Not reported
Evaluation Modes	Automatic Metrics	Automatic Metrics	Automatic Metrics
Benchmarks	Ama Bench	MMLU, MMLU Pro	GAIA, BrowseComp
Metrics	Accuracy	Accuracy	Accuracy, Latency
Quality Controls	Not reported	Not reported	Not reported
Rater Population	Domain Experts	Unknown	Unknown
Annotation Unit	Unknown	Trajectory	Unknown

Research Utility Snapshot

Human Feedback Mix

Pairwise Preference (2)

Evaluation Modes

Automatic Metrics (17)
Simulation Env (5)

Top Benchmarks

ALFWorld (1)
Ama Bench (1)
Arlarena (1)
BrowseComp (1)

Top Metrics

Accuracy (10)
Cost (5)
Latency (3)
Inference cost (2)

Rater Population Mix

Domain Experts (1)

Quality Controls

Calibration (1)

Coverage diagnostics (sample-based): human-feedback 8.3% · benchmarks 25.0% · metrics 66.7% · quality controls 4.2%.

Top Papers

SELAUR: Self Evolving LLM Agent via Uncertainty-aware Rewards
Dengjia Zhang, Xiaoou Liu, Lu Cheng, Yaqing Wang, Kenton Murray · Feb 24, 2026 · Citations: 0

Simulation Env Long Horizon

Large language models (LLMs) are increasingly deployed as multi-step decision-making agents, where effective reward design is essential for guiding learning.
Mobile-Agent-v3.5: Multi-platform Fundamental GUI Agents
Haiyang Xu, Xi Zhang, Haowei Liu, Junyang Wang, Zhaozai Zhu · Feb 15, 2026 · Citations: 0

Simulation Env Long Horizon

The paper introduces GUI-Owl-1.5, the latest native GUI agent model that features instruct/thinking variants in multiple sizes (2B/4B/8B/32B/235B) and supports a range of platforms (desktop, mobile, browser, and more) to enable cloud-edge…
PMG: Parameterized Motion Generator for Human-like Locomotion Control
Chenxi Han, Yuheng Min, Zihao Huang, Ao Hong, Hang Liu · Feb 13, 2026 · Citations: 0

Automatic Metrics Long Horizon

Recent advances in data-driven reinforcement learning and motion tracking have substantially improved humanoid locomotion, yet critical practical challenges remain.
AMA-Bench: Evaluating Long-Horizon Memory for Agentic Applications
Yujie Zhao, Boqin Yuan, Junbo Huang, Haocheng Yuan, Zhongming Yu · Feb 26, 2026 · Citations: 0

Automatic Metrics Long Horizon

To bridge this gap, we introduce AMA-Bench (Agent Memory with Any length), which evaluates long-horizon memory for LLMs in real agentic applications.
D-COT: Disciplined Chain-of-Thought Learning for Efficient Reasoning in Small Language Models
Shunsuke Ubukata · Feb 25, 2026 · Citations: 0

Automatic Metrics Long Horizon

In this study, we propose Disciplined Chain-of-Thought (D-CoT), a novel framework that enforces a structured reasoning process using control tags -- such as <TEMP_LOW> for fact-checking and <TEMP_HIGH> for multi-perspective exploration --…
ARLArena: A Unified Framework for Stable Agentic Reinforcement Learning
Xiaoxuan Wang, Han Zhang, Haixin Wang, Yidan Shi, Ruoyan Li · Feb 25, 2026 · Citations: 0

Simulation Env Long Horizon

Agentic reinforcement learning (ARL) has rapidly gained attention as a promising paradigm for training agents to solve complex, multi-step interactive tasks.
Learning to Reason for Multi-Step Retrieval of Personal Context in Personalized Question Answering
Maryam Amirizaniani, Alireza Salemi, Hamed Zamani · Feb 22, 2026 · Citations: 0

Pairwise Preference Long Horizon

Personalization in Question Answering (QA) requires answers that are both accurate and aligned with users' background, preferences, and historical context.
LiLo-VLA: Compositional Long-Horizon Manipulation via Linked Object-Centric Policies
Yue Yang, Shuo Cheng, Yu Fang, Homanga Bharadhwaj, Mingyu Ding · Feb 25, 2026 · Citations: 0

Simulation Env Long Horizon

We introduce a 21-task simulation benchmark consisting of two challenging suites: LIBERO-Long++ and Ultra-Long.
Hierarchical Reward Design from Language: Enhancing Alignment of Agent Behavior with Human Specifications
Zhiqin Qian, Ryan Diaz, Sangwon Seo, Vaibhav Unhelkar · Feb 20, 2026 · Citations: 0

Pairwise Preference Long Horizon

When training artificial intelligence (AI) to perform tasks, humans often care not only about whether a task is completed but also how it is performed.
Search More, Think Less: Rethinking Long-Horizon Agentic Search for Efficiency and Generalization
Qianben Chen, Tianrui Qin, King Zhu, Qiexiang Wang, Chengjun Yu · Feb 26, 2026 · Citations: 0

Automatic Metrics Long Horizon

Recent deep research agents primarily improve performance by scaling reasoning depth, but this leads to high inference cost and latency in search-intensive scenarios.
Efficient Hierarchical Any-Angle Path Planning on Multi-Resolution 3D Grids
Victor Reijgwart, Cesar Cadena, Roland Siegwart, Lionel Ott · Feb 24, 2026 · Citations: 0

Simulation Env Long Horizon

Hierarchical, multi-resolution volumetric mapping approaches are widely used to represent large and complex environments as they can efficiently capture their occupancy and connectivity information.
Beyond Words: Evaluating and Bridging Epistemic Divergence in User-Agent Interaction via Theory of Mind
Minyuan Ruan, Ziyue Wang, Kaiming Liu, Yunghwei Lai, Peng Li · Feb 14, 2026 · Citations: 0

Automatic Metrics Long Horizon

Large Language Models (LLMs) have developed rapidly and are widely applied to both general-purpose and professional tasks to assist human users.
ICON: Indirect Prompt Injection Defense for Agents based on Inference-Time Correction
Che Wang, Fuyao Zhang, Jiaming Zhang, Ziqi Zhang, Yinghui Wang · Feb 24, 2026 · Citations: 0

Automatic Metrics Long Horizon

Large Language Model (LLM) agents are susceptible to Indirect Prompt Injection (IPI) attacks, where malicious instructions in retrieved content hijack the agent's execution.
Search-P1: Path-Centric Reward Shaping for Stable and Efficient Agentic RAG Training
Tianle Xia, Ming Xu, Lingxiang Hu, Yiding Sun, Wenwei Li · Feb 26, 2026 · Citations: 0

Automatic Metrics Long Horizon

We propose Search-P1, a framework that introduces path-centric reward shaping for agentic RAG training, comprising two key components: (1) Path-Centric Reward, which evaluates the structural quality of reasoning trajectories through…
VIGiA: Instructional Video Guidance via Dialogue Reasoning and Retrieval
Diogo Glória-Silva, David Semedo, João Maglhães · Feb 22, 2026 · Citations: 0

Automatic Metrics Long Horizon

Our evaluation shows that VIGiA outperforms existing state-of-the-art models on all tasks in a conversational plan guidance setting, reaching over 90\% accuracy on plan-aware VQA.
DeepPresenter: Environment-Grounded Reflection for Agentic Presentation Generation
Hao Zheng, Guozhao Mo, Xinru Yan, Qianhao Yuan, Wenkai Zhang · Feb 26, 2026 · Citations: 0

Automatic Metrics Long Horizon

However, existing presentation agents often rely on predefined workflows and fixed templates.
Replacing Multi-Step Assembly of Data Preparation Pipelines with One-Step LLM Pipeline Generation for Table QA
Fengyu Li, Junhao Zhu, Kaishi Song, Lu Chen, Zhongming Yao · Feb 26, 2026 · Citations: 0

Automatic Metrics Long Horizon

Experiments on two benchmark datasets show that, with the same LLM backbone, Operation-R1 achieves average absolute accuracy gains of 9.55 and 6.08 percentage points over multi-step preparation baselines, with 79\% table compression and a…
How Do Latent Reasoning Methods Perform Under Weak and Strong Supervision?
Yingqian Cui, Zhenwei Dai, Bing He, Zhan Shi, Hui Liu · Feb 25, 2026 · Citations: 0

Automatic Metrics Long Horizon

First, we observe pervasive shortcut behavior, where they achieve high accuracy without relying on latent reasoning.
Hierarchical LLM-Based Multi-Agent Framework with Prompt Optimization for Multi-Robot Task Planning
Tomoya Kawabe, Rin Takano · Feb 25, 2026 · Citations: 0

Automatic Metrics Long Horizon

We present a hierarchical multi-agent LLM-based planner with prompt optimization: an upper layer decomposes tasks and assigns them to lower-layer agents, which generate PDDL problems solved by a classical planner.
Anatomy of Agentic Memory: Taxonomy and Empirical Analysis of Evaluation and System Limitations
Dongming Jiang, Yi Li, Songtao Wei, Jinxin Yang, Ayushi Kishore · Feb 22, 2026 · Citations: 0

Automatic Metrics Long Horizon

Agentic memory systems enable large language model (LLM) agents to maintain state across long interactions, supporting long-horizon reasoning and personalization beyond fixed context windows.
TabAgent: A Framework for Replacing Agentic Generative Components with Tabular-Textual Classifiers
Ido Levy, Eilam Shapira, Yinon Goldshtein, Avi Yaeli, Nir Mashkif · Feb 18, 2026 · Citations: 0

Automatic Metrics Long Horizon

We propose TabAgent, a framework for replacing generative decision components in closed-set selection tasks with a compact textual-tabular classifier trained on execution traces.
A Geometric Analysis of Small-sized Language Model Hallucinations
Emanuele Ricco, Elia Onofri, Lorenzo Cima, Stefano Cresci, Roberto Di Pietro · Feb 16, 2026 · Citations: 0

Automatic Metrics Long Horizon

Hallucinations -- fluent but factually incorrect responses -- pose a major challenge to the reliability of language models, especially in multi-step or agentic settings.
Think like a Scientist: Physics-guided LLM Agent for Equation Discovery
Jianke Yang, Ohm Venkatachalam, Mohammad Kianezhad, Sharvaree Vadgama, Rose Yu · Feb 12, 2026 · Citations: 0

Automatic Metrics Long Horizon

We introduce KeplerAgent, an agentic framework that explicitly follows this scientific reasoning process.
Provably Safe Generative Sampling with Constricting Barrier Functions
Darshan Gadginmath, Ahmed Allibhoy, Fabio Pasqualetti · Feb 24, 2026 · Citations: 0

Automatic Metrics Long Horizon

However, a critical gap remains for their deployment in safety-critical domains: the lack of formal guarantees that generated samples will satisfy hard constraints.

Related Hubs

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.

Post a Job Get a Quote