HFEPX Hub

Long Horizon Papers

Updated from current HFEPX corpus (Apr 12, 2026). 324 papers are grouped in this hub page.

Read Full Context

Updated from current HFEPX corpus (Apr 12, 2026). 324 papers are grouped in this hub page. Common evaluation modes: Automatic Metrics, Simulation Env. Most common rater population: Domain Experts. Common annotation unit: Trajectory. Frequent quality control: Calibration. Frequently cited benchmark: GSM8K. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Feb 15, 2026.

Papers: 324 Last published: Feb 15, 2026 Global RSS Tag RSS

Long Horizon

Researcher Quick Triage

This hub is best used for protocol triage and replication planning from abstract-level evidence. Quality band: High .

Analysis blocks below are computed from the currently loaded sample (60 of 324 total papers in this hub).

All Sampled Papers (60) Replication-Ready Only (13)

High-Signal Coverage

100.0%

60 / 60 sampled papers are not low-signal flagged.

Replication-Ready Set

Benchmark + metric + eval mode explicitly present.

Judge/Human Comparability

Papers containing both `human_eval` and `llm_as_judge`.

13 papers are replication-ready (benchmark + metric + explicit evaluation mode).
1 papers support judge-vs-human agreement analysis.
6 papers report explicit quality controls (calibration/adjudication/IAA).

Primary action: Start with the top 2 papers in “Start Here”, then validate assumptions in the protocol matrix.

Need evaluators for this research workflow?

Post a Job →

Why This Matters For Eval Research

23.2% of papers report explicit human-feedback signals, led by pairwise preferences.
automatic metrics appears in 45.4% of papers in this hub.
GSM8K is a recurring benchmark anchor for cross-paper comparisons in this page.

Protocol Takeaways

1 sampled papers report both human evaluation and LLM-as-judge, supporting direct agreement checks.
Most common quality-control signal is rater calibration (1.2% of papers).
Rater context is mostly domain experts, and annotation is commonly trajectory-level annotation; use this to scope replication staffing.

Benchmark Interpretation

GSM8K appears in 4.3% of hub papers (9/324); use this cohort for benchmark-matched comparisons.
OSWorld appears in 2.4% of hub papers (5/324); use this cohort for benchmark-matched comparisons.

Metric Interpretation

accuracy is reported in 37.4% of hub papers (79/324); compare with a secondary metric before ranking methods.
cost is reported in 17.1% of hub papers (36/324); compare with a secondary metric before ranking methods.

Researcher Checklist (Expanded)

Researcher Checklist

Gap: Papers with explicit human feedback

Coverage is a replication risk (23.2% vs 45% target).
Gap: Papers reporting quality controls

Coverage is a replication risk (2.8% vs 30% target).
Moderate: Papers naming benchmarks/datasets

Coverage is usable but incomplete (31.8% vs 35% target).
Strong: Papers naming evaluation metrics

Coverage is strong (76.3% vs 35% target).
Gap: Papers with known rater population

Coverage is a replication risk (10% vs 35% target).
Strong: Papers with known annotation unit

Coverage is strong (51.2% vs 35% target).

Strengths

Contains both human-eval and LLM-as-judge protocols for head-to-head methodology comparison.
Agentic evaluation appears in 100% of papers.

Known Gaps

Only 2.8% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (10% coverage).
LLM-as-judge appears without enough inter-annotator agreement reporting.

Suggested Next Analyses

Compare papers that report both human_eval and llm_as_judge to quantify judge-human agreement drift.
Stratify by benchmark (GSM8K vs OSWorld) before comparing methods.
Track metric sensitivity by reporting both accuracy and cost.
Add inter-annotator agreement checks when reproducing these protocols.

Recommended Queries (Expanded)

Recommended Queries

Judge vs Human Agreement Benchmark Slice: GSM8K Metric Slice: accuracy Recent High-Signal Papers

Start with These 3

Use these when you need one protocol anchor, one benchmark anchor, and one recent comparison point before reading the wider hub.

Strongest protocol reference

AgentHER: Hindsight Experience Replay for LLM Agent Trajectory Relabe…

Highest protocol score with explicit human/eval signal plus WebArena.

Strongest benchmark reference

TraceSafe: A Systematic Assessment of LLM Guardrails on Multi-Step To…

Tracesafe-Bench with accuracy gives a fast comparison anchor.

Strongest recent paper

AD-Bench: A Real-World, Trajectory-Aware Advertising Analytics Benchm…

Useful for current practice scanning; published Feb 15, 2026.

Start Here (Best First 6)

Ranked for protocol completeness (human signal, benchmark + metric anchors, quality controls, and judge/human overlap).

AgentHER: Hindsight Experience Replay for LLM Agent Trajectory Relabeling
Mar 22, 2026 · Citations: 0 · Score: 10.0

HF: Demonstrations · Eval: Human Eval, Llm As Judge · Benchmark: WebArena · Metric: Precision
TraceSafe: A Systematic Assessment of LLM Guardrails on Multi-Step Tool-Calling Trajectories
Apr 8, 2026 · Citations: 0 · Score: 8.0

HF: Red Team · Eval: Automatic Metrics · Benchmark: Tracesafe Bench · Metric: Accuracy
AD-Bench: A Real-World, Trajectory-Aware Advertising Analytics Benchmark for LLM Agents
Feb 15, 2026 · Citations: 0 · Score: 7.5

HF: Expert Verification · Eval: Simulation Env · Benchmark: Ad Bench · Metric: Pass@1
\$OneMillion-Bench: How Far are Language Agents from Human Experts?
Mar 9, 2026 · Citations: 0 · Score: 7.5

HF: Rubric Rating · Eval: Automatic Metrics · Benchmark: Onemillion Bench · Metric: Accuracy
PEARL: Self-Evolving Assistant for Time Management with Reinforcement Learning
Jan 17, 2026 · Citations: 0 · Score: 7.5

HF: Pairwise Preference · Eval: Automatic Metrics · Benchmark: Calconflictbench · Metric: Error rate
Entropy trajectory shape predicts LLM reasoning reliability: A diagnostic study of uncertainty dynamics in chain-of-thought
Mar 19, 2026 · Citations: 0 · Score: 7.5

HF: Not reported · Eval: Automatic Metrics · Benchmark: GSM8K · Metric: Accuracy

Protocol Matrix (Top 12)

Use this to quickly compare protocol ingredients instead of scanning long prose.

Paper	HF Signal	Eval Modes	Benchmarks	Metrics	QC
AgentHER: Hindsight Experience Replay for LLM Agent Trajectory Relabeling Mar 22, 2026	Yes Demonstrations	Human Eval , Llm As Judge	WebArena , ToolBench	Precision , Pass@1	Not Reported
TraceSafe: A Systematic Assessment of LLM Guardrails on Multi-Step Tool-Calling Trajectories Apr 8, 2026	Yes Red Team	Automatic Metrics	Tracesafe Bench	Accuracy	Not Reported
AD-Bench: A Real-World, Trajectory-Aware Advertising Analytics Benchmark for LLM Agents Feb 15, 2026	Yes Expert Verification	Simulation Env	Ad Bench	Pass@1 , Pass@3	Not Reported
\$OneMillion-Bench: How Far are Language Agents from Human Experts? Mar 9, 2026	Yes Rubric Rating	Automatic Metrics	Onemillion Bench	Accuracy , Coherence	Not Reported
PEARL: Self-Evolving Assistant for Time Management with Reinforcement Learning Jan 17, 2026	Yes Pairwise Preference	Automatic Metrics	Calconflictbench	Error rate	Not Reported
Entropy trajectory shape predicts LLM reasoning reliability: A diagnostic study of uncertainty dynamics in chain-of-thought Mar 19, 2026	No Not Reported	Automatic Metrics	GSM8K	Accuracy , Calibration error	Calibration
Guideline-Grounded Evidence Accumulation for High-Stakes Agent Verification Mar 3, 2026	Yes Expert Verification	Automatic Metrics	Not Reported	Brier score , Auroc	Calibration
From Days to Minutes: An Autonomous AI Agent Achieves Reliable Clinical Triage in Remote Patient Monitoring Mar 10, 2026	Yes Expert Verification	Automatic Metrics	Not Reported	Accuracy , Kappa	Adjudication
Let's Think in Two Steps: Mitigating Agreement Bias in MLLMs with Self-Grounded Verification Jul 15, 2025	Yes Pairwise Preference	Automatic Metrics , Simulation Env	VisualWebArena , OSWorld	Accuracy	Not Reported
When Users Change Their Mind: Evaluating Interruptible Agents in Long-Horizon Web Navigation Apr 1, 2026	Yes Critique Edit	Simulation Env	WebArena , Interruptbench	Not Reported	Not Reported
RASPRef: Retrieval-Augmented Self-Supervised Prompt Refinement for Large Reasoning Models Mar 27, 2026	Yes Critique Edit	Not Reported	GSM8K	Not Reported	Not Reported
Kernel-Smith: A Unified Recipe for Evolutionary Kernel Optimization Mar 30, 2026	Yes Critique Edit	Not Reported	Kernelbench	Not Reported	Not Reported

Protocol Diff (Top Papers)

Fast side-by-side comparison for the highest-ranked papers in this hub.

Signal	AgentHER: Hindsight Experience Replay for LLM Agent…	TraceSafe: A Systematic Assessment of LLM Guardrail…	AD-Bench: A Real-World, Trajectory-Aware Advertisin…
Human Feedback	Demonstrations	Red Team	Expert Verification
Evaluation Modes	Human Eval, Llm As Judge	Automatic Metrics	Simulation Env
Benchmarks	WebArena, ToolBench	Tracesafe Bench	Ad Bench
Metrics	Precision, Pass@1	Accuracy	Pass@1, Pass@3
Quality Controls	Not reported	Not reported	Not reported
Rater Population	Unknown	Unknown	Domain Experts
Annotation Unit	Trajectory	Trajectory	Trajectory

Research Utility Snapshot

Human Feedback Mix

Pairwise Preference (23)
Demonstrations (10)
Rubric Rating (8)
Expert Verification (6)

Evaluation Modes

Automatic Metrics (147)
Simulation Env (42)
Llm As Judge (7)
Human Eval (5)

Top Benchmarks

GSM8K (9)
OSWorld (5)
ALFWorld (4)
WebArena (4)

Top Metrics

Accuracy (79)
Cost (36)
Latency (15)
Recall (12)

Rater Population Mix

Domain Experts (21)

Quality Controls

Calibration (4)
Adjudication (2)

Coverage diagnostics (sample-based): human-feedback 63.3% · benchmarks 43.3% · metrics 58.3% · quality controls 10.0%.

Top Papers

AD-Bench: A Real-World, Trajectory-Aware Advertising Analytics Benchmark for LLM Agents
Lingxiang Hu, Yiding Sun, Tianle Xia, Wenwei Li, Ming Xu · Feb 15, 2026 · Citations: 0

Expert Verification Simulation Env Long Horizon

While Large Language Model (LLM) agents have achieved remarkable progress in complex reasoning tasks, evaluating their performance in real-world environments has become a critical problem.
AgentHER: Hindsight Experience Replay for LLM Agent Trajectory Relabeling
Liang Ding · Mar 22, 2026 · Citations: 0

Demonstrations Human EvalLlm As Judge Long Horizon

LLM agents fail on the majority of real-world tasks -- GPT-4o succeeds on fewer than 15% of WebArena navigation tasks and below 55% pass@1 on ToolBench (Zhou et al., 2024; Qin et al., 2024) -- yet every failed trajectory is routinely…
Let's Think in Two Steps: Mitigating Agreement Bias in MLLMs with Self-Grounded Verification
Moises Andrade, Joonhyuk Cha, Brandon Ho, Vriksha Srihari, Karmesh Yadav · Jul 15, 2025 · Citations: 0

Pairwise Preference Automatic MetricsSimulation Env Long Horizon

We evaluate MLLM verifiers across web navigation, computer use, and robotics, spanning 13+ models, 28+ designs, and thousands of trajectories from diverse agents.
Guideline-Grounded Evidence Accumulation for High-Stakes Agent Verification
Yichi Zhang, Nabeel Seedat, Yinpeng Dong, Peng Cui, Jun Zhu · Mar 3, 2026 · Citations: 0

Expert Verification Automatic Metrics Long Horizon

As LLM-powered agents have been used for high-stakes decision-making, such as clinical diagnosis, it becomes critical to develop reliable verification of their decisions to facilitate trustworthy deployment.
\$OneMillion-Bench: How Far are Language Agents from Human Experts?
Qianyu Yang, Yang Liu, Jiaqi Li, Jun Bai, Hao Chen · Mar 9, 2026 · Citations: 0

Rubric Rating Automatic Metrics Tool Use

To this end, we introduce \OneMillion-Bench \OneMillion-Bench, a benchmark of 400 expert-curated tasks spanning Law, Finance, Industry, Healthcare, and Natural Science, built to evaluate agents across economically consequential scenarios.
From Days to Minutes: An Autonomous AI Agent Achieves Reliable Clinical Triage in Remote Patient Monitoring
Seunghwan Kim, Tiffany H. Kung, Heena Verma, Dilan Edirisinghe, Kaveh Sedehi · Mar 10, 2026 · Citations: 0

Expert Verification Automatic Metrics Long Horizon

Results: Against a human majority-vote standard (N=467), the agent achieved 95.8% emergency sensitivity and 88.5% sensitivity for all actionable alerts (85.7% specificity).
TraceSafe: A Systematic Assessment of LLM Guardrails on Multi-Step Tool-Calling Trajectories
Yen-Shan Chen, Sian-Yao Huang, Cheng-Lin Yang, Yun-Nung Chen · Apr 8, 2026 · Citations: 0

Red Team Automatic Metrics Long Horizon

As large language models (LLMs) evolve from static chatbots into autonomous agents, the primary vulnerability surface shifts from final outputs to intermediate execution traces.
PEARL: Self-Evolving Assistant for Time Management with Reinforcement Learning
Bingxuan Li, Jeonghwan Kim, Cheng Qian, Xiusi Chen, Eitan Anzenberg · Jan 17, 2026 · Citations: 0

Pairwise Preference Automatic Metrics Long Horizon

To enable a systematic study of this question, we introduce CalConflictBench, a benchmark for long-horizon calendar conflict resolution.
When Users Change Their Mind: Evaluating Interruptible Agents in Long-Horizon Web Navigation
Henry Peng Zou, Chunyu Miao, Wei-Chieh Huang, Yankai Chen, Yue Zhou · Apr 1, 2026 · Citations: 0

Critique Edit Simulation Env Long Horizon

As LLM agents transition from short, static problem solving to executing complex, long-horizon tasks in dynamic environments, the ability to handle user interruptions, such as adding requirement or revising goals, during mid-task execution…
LifeSim: Long-Horizon User Life Simulator for Personalized Assistant Evaluation
Feiyu Duan, Xuanjing Huang, Zhongyu Wei · Mar 12, 2026 · Citations: 0

Pairwise Preference Simulation Env Long Horizon

However, existing benchmarks for personalized assistants remain misaligned with real-world user-assistant interactions, failing to capture the complexity of external contexts and users' cognitive states.
Step-CoT: Stepwise Visual Chain-of-Thought for Medical Visual Question Answering
Lin Fan, Yafei Ou, Zhipeng Deng, Pengyu Dai, Hou Chongxian · Mar 14, 2026 · Citations: 0

Expert Verification Automatic Metrics Long Horizon

Benchmark: github.com/hahaha111111/Step-CoT.
Decoupling Strategy and Execution in Task-Focused Dialogue via Goal-Oriented Preference Optimization
Jingyi Xu, Xingyu Ren, Zhoupeng Shou, Yumeng Zhang, Zhiqiang You · Jan 24, 2026 · Citations: 0

Pairwise Preference Automatic Metrics Long Horizon

To address this, we propose Goal-Oriented Preference Optimization (GOPO), a hierarchical reinforcement learning framework that decouples strategy planning from response generation via an Expert Agent and a Customer Service Agent.
APEX-Agents
Bertie Vidgen, Austin Mann, Abby Fennelly, John Wright Stanly, Lucas Rothman · Jan 20, 2026 · Citations: 0

Rubric RatingExpert Verification Automatic Metrics Long Horizon

We introduce the AI Productivity Index for Agents (APEX-Agents), a benchmark for assessing whether AI agents can execute long-horizon, cross-application tasks created by investment banking analysts, management consultants, and corporate…
Fast-ThinkAct: Efficient Vision-Language-Action Reasoning via Verbalizable Latent Planning
Chi-Pin Huang, Yunze Man, Zhiding Yu, Min-Hung Chen, Jan Kautz · Jan 14, 2026 · Citations: 0

Pairwise Preference Simulation Env Long Horizon

Fast-ThinkAct learns to reason efficiently with latent CoTs by distilling from a teacher, driven by a preference-guided objective to align manipulation trajectories that transfers both linguistic and visual planning capabilities for embodie
SOLE-R1: Video-Language Reasoning as the Sole Reward for On-Robot Reinforcement Learning
Philip Schroeder, Thomas Weng, Karl Schmeckpeper, Eric Rosen, Stephen Hart · Mar 30, 2026 · Citations: 0

Demonstrations Simulation Env Long Horizon

To address this limitation, we introduce SOLE-R1 (Self-Observing LEarner), a video-language reasoning model explicitly designed to serve as the sole reward signal for online RL.
Build, Judge, Optimize: A Blueprint for Continuous Improvement of Multi-Agent Consumer Assistants
Alejandro Breen Herrera, Aayush Sheth, Steven G. Xu, Zhucheng Zhan, Charles Wright · Mar 3, 2026 · Citations: 0

Pairwise PreferenceRubric Rating Llm As JudgeSimulation Env Long Horizon

Conversational shopping assistants (CSAs) represent a compelling application of agentic AI, but moving from prototype to production reveals two underexplored challenges: how to evaluate multi-turn interactions and how to optimize tightly…
RAPTOR: A Foundation Policy for Quadrotor Control
Jonas Eschmann, Dario Albani, Giuseppe Loianno · Sep 15, 2025 · Citations: 0

Demonstrations Simulation Env Long Horizon

Humans are remarkably data-efficient when adapting to new unseen conditions, like driving a new car.
Error Notebook-Guided, Training-Free Part Retrieval in 3D CAD Assemblies via Vision-Language Models
Yunqing Liu, Nan Zhang, Zhiming Tan · Sep 1, 2025 · Citations: 0

Pairwise Preference Automatic Metrics Long Horizon

We additionally contribute a CAD dataset with human preference annotations.
Signals: Trajectory Sampling and Triage for Agentic Interactions
Shuguang Chen, Adil Hafeez, Salman Paracha · Apr 1, 2026 · Citations: 0

Pairwise Preference Automatic Metrics Long Horizon

We propose a lightweight, signal-based framework for triaging agentic interaction trajectories.
Learning When to Act: Interval-Aware Reinforcement Learning with Predictive Temporal Structure
Davide Di Gioia · Mar 23, 2026 · Citations: 0

Pairwise Preference Automatic Metrics Long Horizon

Autonomous agents operating in continuous environments must decide not only what to do, but when to act.
MIND: Unified Inquiry and Diagnosis RL with Criteria Grounded Clinical Supports for Psychiatric Consultation
Guoyi Li, Shihao Xu, Jiatong Ma, Yunyun Han, Jianhua Chen · Mar 4, 2026 · Citations: 0

Rubric Rating Automatic Metrics Long Horizon

Large language models (LLMs) have advanced medical dialogue systems, yet psychiatric consultation poses substantially higher demands due to subjective ambiguity and comorbidity complexity: an agent must continuously extract…
Mastering Multi-Drone Volleyball through Hierarchical Co-Self-Play Reinforcement Learning
Ruize Zhang, Sirui Xiang, Zelai Xu, Feng Gao, Shilong Ji · May 7, 2025 · Citations: 0

Demonstrations Automatic Metrics Long Horizon

The task is turn-based, multi-agent, and physically grounded, posing significant challenges due to its long-horizon dependencies, tight inter-agent coupling, and the underactuated dynamics of quadrotors.
Beyond Rows to Reasoning: Agentic Retrieval for Multimodal Spreadsheet Understanding and Editing
Anmol Gulati, Sahil Sen, Waqar Sarguroh, Kevin Paul · Mar 6, 2026 · Citations: 0

Human EvalAutomatic Metrics Long Horizon

We introduce Beyond Rows to Reasoning (BRTR), a multimodal agentic framework for spreadsheet understanding that replaces single-pass retrieval with an iterative tool-calling loop, supporting end-to-end Excel workflows from complex analysis…
Do Compact SSL Backbones Matter for Audio Deepfake Detection? A Controlled Study with RAPTOR
Ajinkya Kulkarni, Sandipana Dowerah, Atharva Kulkarni, Tanel Alumäe, Mathew Magimai Doss · Mar 6, 2026 · Citations: 0

Pairwise Preference Long Horizon

We present RAPTOR, Representation Aware Pairwise-gated Transformer for Out-of-domain Recognition a controlled study of compact SSL backbones from the HuBERT and WavLM within a unified pairwise-gated fusion detector, evaluated across 14…
Entropy trajectory shape predicts LLM reasoning reliability: A diagnostic study of uncertainty dynamics in chain-of-thought
Xinghao Zhao · Mar 19, 2026 · Citations: 0

Automatic Metrics Long Horizon

Chain-of-thought (CoT) reasoning improves LLM accuracy, yet detecting failures cheaply remains elusive.
ReDAct: Uncertainty-Aware Deferral for LLM Agents
Dzianis Piatrashyn, Nikita Kotelevskii, Kirill Grishchenkov, Nikita Glazkov, Ivan Nasonov · Apr 8, 2026 · Citations: 0

Simulation Env Long Horizon

Recently, LLM-based agents have become increasingly popular across many applications, including complex sequential decision-making problems.
Embodied Task Planning via Graph-Informed Action Generation with Large Language Model
Xiang Li, Ning Yan, Masood Mortazavi · Jan 29, 2026 · Citations: 0

Simulation Env Long Horizon

We propose GiG, a novel planning framework that structures embodied agents' memory using a Graph-in-Graph architecture.
RASPRef: Retrieval-Augmented Self-Supervised Prompt Refinement for Large Reasoning Models
Rahul Soni · Mar 27, 2026 · Citations: 0

Critique Edit Long Horizon

Recent reasoning-focused language models such as DeepSeek R1 and OpenAI o1 have demonstrated strong performance on structured reasoning benchmarks including GSM8K, MATH, and multi-hop question answering tasks.
KLong: Training LLM Agent for Extremely Long-horizon Tasks
Yue Liu, Yingwei Ma, Yibo Miao, Yanhao Li, Yuchong Xie · Feb 19, 2026 · Citations: 0

Rubric Rating Long Horizon

Then, we introduce Research-Factory, an automated pipeline that generates high-quality training data by collecting research papers and constructing evaluation rubrics.
Watch and Learn: Learning to Use Computers from Online Videos
Chan Hee Song, Yiwen Song, Palash Goyal, Yu Su, Oriana Riva · Oct 6, 2025 · Citations: 0

Demonstrations Long Horizon

Computer-using agents (CUAs) must plan task workflows across diverse and evolving applications, yet progress is limited by the lack of large-scale, high-quality training data.
Efficient Agent Training for Computer Use
Yanheng He, Jiahe Jin, Pengfei Liu · May 20, 2025 · Citations: 0

Demonstrations Long Horizon

We introduce PC Agent-E, an efficient agent training framework that significantly reduces reliance on large-scale human demonstrations.
From Control to Foresight: Simulation as a New Paradigm for Human-Agent Collaboration
Gaole He, Brian Y. Lim · Mar 12, 2026 · Citations: 0

Pairwise Preference Simulation Env Long Horizon

Large Language Models (LLMs) are increasingly used to power autonomous agents for complex, multi-step tasks.
MoMaGen: Generating Demonstrations under Soft and Hard Constraints for Multi-Step Bimanual Mobile Manipulation
Chengshu Li, Mengdi Xu, Arpit Bahety, Hang Yin, Yunfan Jiang · Oct 21, 2025 · Citations: 0

Demonstrations Simulation Env Long Horizon

Imitation learning from large-scale, diverse human demonstrations has been shown to be effective for training robots, but collecting such data is costly and time-consuming.
Tutoring Large Language Models to be Domain-adaptive, Precise, and Safe
Somnath Banerjee · Feb 14, 2026 · Citations: 0

Pairwise Preference Long Horizon

The methodological trajectory moves from classical supervised adaptation for task-specific demands to decoding-time alignment for safety, finally leveraging human feedback and preference modeling to achieve sociolinguistic acuity.
DeceptGuard :A Constitutional Oversight Framework For Detecting Deception in LLM Agents
Snehasis Mukhopadhyay · Mar 14, 2026 · Citations: 0

Automatic MetricsSimulation Env Long Horizon

We introduce DECEPTGUARD, a unified framework that systematically compares three monitoring regimes: black-box monitors (actions and outputs only), CoT-aware monitors (additionally observing the agent's chain-of-thought reasoning trace),…
LIT-RAGBench: Benchmarking Generator Capabilities of Large Language Models in Retrieval-Augmented Generation
Koki Itai, Shunichi Hasegawa, Yuta Yamamoto, Gouki Minegishi, Masaki Otsuki · Mar 6, 2026 · Citations: 0

Llm As JudgeAutomatic Metrics Long Horizon

To bridge the gap between existing evaluations and practical use, we introduce LIT-RAGBench (the Logic, Integration, Table, Reasoning, and Abstention RAG Generator Benchmark), which defines five categories: Integration, Reasoning, Logic,…
HUMORCHAIN: Theory-Guided Multi-Stage Reasoning for Interpretable Multimodal Humor Generation
Jiajun Zhang, Shijia Luo, Ruikang Zhang, Qi Su · Nov 21, 2025 · Citations: 0

Pairwise Preference Automatic Metrics Long Horizon

Humor, as both a creative human activity and a social binding mechanism, has long posed a major challenge for AI generation.
BEAT: Visual Backdoor Attacks on VLM-based Embodied Agents via Contrastive Trigger Learning
Qiusi Zhan, Hyeonjeong Ha, Rui Yang, Sirui Xu, Hanyang Chen · Oct 31, 2025 · Citations: 0

Pairwise Preference Automatic Metrics Long Horizon

We introduce BEAT, the first framework to inject such visual backdoors into VLM-based embodied agents using objects in the environments as triggers.
FrameRef: A Framing Dataset and Simulation Testbed for Modeling Bounded Rational Information Health
Victor De Lima, Jiqun Liu, Grace Hui Yang · Feb 17, 2026 · Citations: 0

Human EvalSimulation Env Long Horizon

Within this framework, we construct framing-sensitive agent personas by fine-tuning language models with framing-conditioned loss attenuation, inducing targeted biases while preserving overall task competence.
FrontierFinance: A Long-Horizon Computer-Use Benchmark of Real-World Financial Tasks
Michael Krumdick, Varshini Reddy, Shivani Chaudhary, William Day, Maarij Ahmed · Apr 7, 2026 · Citations: 0

Rubric Rating Long Horizon

To address this, we introduce FrontierFinance, a long-horizon benchmark of 25 complex financial modeling tasks across five core finance models, requiring an average of over 18 hours of skilled human labor per task to complete.
TREX: Trajectory Explanations for Multi-Objective Reinforcement Learning
Dilina Rajapakse, Juan C. Rosero, Ivana Dusparic · Mar 23, 2026 · Citations: 0

Pairwise Preference Long Horizon

Multi-Objective Reinforcement Learning (MORL) addresses this limitation by enabling agents to optimize several objectives simultaneously, explicitly reasoning about trade-offs between them.
MedXIAOHE: A Comprehensive Recipe for Building Medical MLLMs
Baorong Shi, Bo Cui, Boyuan Jiang, Deli Yu, Fang Qian · Feb 13, 2026 · Citations: 0

Pairwise PreferenceRubric Rating Long Horizon

MedXIAOHE achieves state-of-the-art performance across diverse medical benchmarks and surpasses leading closed-source multimodal systems on multiple capabilities.
DEER: A Benchmark for Evaluating Deep Research Agents on Expert Report Generation
Janghoon Han, Heegyu Kim, Changho Lee, Dahm Lee, Min Hyung Park · Dec 19, 2025 · Citations: 0

Rubric RatingExpert Verification Long Horizon

However, evaluating such reports remains challenging: report quality is multifaceted, making it difficult to determine what to assess and by what criteria; LLM-based judges may miss errors that require domain expertise to identify; and…
Supervised Reinforcement Learning: From Expert Trajectories to Step-wise Reasoning
Yihe Deng, I-Hung Hsu, Jun Yan, Zifeng Wang, Rujun Han · Oct 29, 2025 · Citations: 0

Demonstrations Long Horizon

Beyond reasoning benchmarks, SRL generalizes effectively to agentic software engineering tasks, establishing it as a robust and versatile training framework for reasoning-oriented LLMs.
DataSTORM: Deep Research on Large-Scale Databases using Exploratory Data Analysis and Data Storytelling
Shicheng Liu, Yucheng Jiang, Sajid Farook, Camila Nicollier Sanchez, David Fernando Castro Pena · Apr 7, 2026 · Citations: 0

Human Eval Long Horizon

Deep research with Large Language Model (LLM) agents is emerging as a powerful paradigm for multi-step information discovery, synthesis, and analysis.
Kernel-Smith: A Unified Recipe for Evolutionary Kernel Optimization
He Du, Qiming Ge, Jiakai Hu, Aijun Yang, Zheng Cai · Mar 30, 2026 · Citations: 0

Critique Edit Long Horizon

We present Kernel-Smith, a framework for high-performance GPU kernel and operator generation that combines a stable evaluation-driven evolutionary agent with an evolution-oriented post-training recipe.
Toward Ultra-Long-Horizon Agentic Science: Cognitive Accumulation for Machine Learning Engineering
Xinyu Zhu, Yuzhu Cai, Zexi Liu, Bingyang Zheng, Cheng Wang · Jan 15, 2026 · Citations: 0

Simulation Env Long Horizon

The advancement of artificial intelligence toward agentic science is currently bottlenecked by the challenge of ultra-long-horizon autonomy, the ability to sustain strategic coherence and iterative correction over experimental cycles spanni
MAPO: Mixed Advantage Policy Optimization for Long-Horizon Multi-Turn Dialogue
Naifan Zhang, Ruihan Sun, Jinwei Su, Hengjie Yang, Zhengyuan Pan · Mar 6, 2026 · Citations: 0

Llm As JudgeSimulation Env Long Horizon

We propose a critic-free and efficient RL algorithm named MAPO that leverages dense process feedback from a judge model and propagates long-horizon effects through Monte Carlo returns.
Mind the Shift: Decoding Monetary Policy Stance from FOMC Statements with Large Language Models
Yixuan Tang, Yi Yang · Mar 15, 2026 · Citations: 0

Llm As JudgeAutomatic Metrics Long Horizon

Across four LLM backbones, DCS consistently outperforms supervised probes and LLM-as-judge baselines, achieving up to 71.1% accuracy on sentence-level hawkish--dovish classification.
JAWS: Enhancing Long-term Rollout of Neural PDE Solvers via Spatially-Adaptive Jacobian Regularization
Fengxiang Nie, Yasuhiro Suzuki · Mar 4, 2026 · Citations: 0

Automatic MetricsSimulation Env Long Horizon

Experiments demonstrate that JAWS serves as an effective spectral pre-conditioner for trajectory optimization, allowing short-horizon, memory-efficient training to match the accuracy of long-horizon baselines.
The Trinity of Consistency as a Defining Principle for General World Models
Jingxuan Wei, Siyuan Li, Yuhang Xu, Zheng Sun, Junjie Jiang · Feb 26, 2026 · Citations: 0

Simulation Env Long Horizon

To complement this conceptual framework, we introduce CoW-Bench, a benchmark centered on multi-frame reasoning and generation scenarios.
Hierarchy-of-Groups Policy Optimization for Long-Horizon Agentic Tasks
Shuo He, Lang Feng, Qi Wei, Xin Cheng, Lei Feng · Feb 26, 2026 · Citations: 0

Simulation Env Long Horizon

Group-based reinforcement learning (RL), such as GRPO, has advanced the capabilities of large language models on long-horizon agentic tasks.
SELAUR: Self Evolving LLM Agent via Uncertainty-aware Rewards
Dengjia Zhang, Xiaoou Liu, Lu Cheng, Yaqing Wang, Kenton Murray · Feb 24, 2026 · Citations: 0

Simulation Env Long Horizon

Large language models (LLMs) are increasingly deployed as multi-step decision-making agents, where effective reward design is essential for guiding learning.
Mobile-Agent-v3.5: Multi-platform Fundamental GUI Agents
Haiyang Xu, Xi Zhang, Haowei Liu, Junyang Wang, Zhaozai Zhu · Feb 15, 2026 · Citations: 0

Simulation Env Long Horizon

The paper introduces GUI-Owl-1.5, the latest native GUI agent model that features instruct/thinking variants in multiple sizes (2B/4B/8B/32B/235B) and supports a range of platforms (desktop, mobile, browser, and more) to enable cloud-edge…
LEXam: Benchmarking Legal Reasoning on 340 Law Exams
Yu Fan, Jingwei Ni, Jakob Merane, Yang Tian, Yoan Hermstrüwer · May 19, 2025 · Citations: 0

Llm As JudgeAutomatic Metrics Long Horizon

To address this, we introduce LEXam, a novel benchmark derived from 340 law exams spanning 116 law school courses across a range of subjects and degree levels.
PMG: Parameterized Motion Generator for Human-like Locomotion Control
Chenxi Han, Yuheng Min, Zihao Huang, Ao Hong, Hang Liu · Feb 13, 2026 · Citations: 0

Automatic Metrics Long Horizon

Recent advances in data-driven reinforcement learning and motion tracking have substantially improved humanoid locomotion, yet critical practical challenges remain.
Reasoning or Rhetoric? An Empirical Analysis of Moral Reasoning Explanations in Large Language Models
Aryan Kasat, Smriti Singh, Aman Chadha, Vinija Jain · Mar 23, 2026 · Citations: 0

Llm As Judge Long Horizon

Using an LLM-as-judge scoring pipeline validated across three judge models, we classify more than 600 responses from 13 LLMs spanning a range of architectures, parameter scales, and training regimes across six classical moral dilemmas, and…
Spatio-Temporal Attention Enhanced Multi-Agent DRL for UAV-Assisted Wireless Networks with Limited Communications
Che Chen, Lanhua Li, Shimin Gong, Yu Zhao, Yuming Fang · Mar 23, 2026 · Citations: 0

Simulation Env Long Horizon

To maximize the overall throughput, we first propose a delay-tolerant multi-agent deep reinforcement learning (MADRL) algorithm that integrates a delay-penalized reward to encourage information sharing among UAVs, while jointly optimizing…
AMemGym: Interactive Memory Benchmarking for Assistants in Long-Horizon Conversations
Cheng Jiayang, Dongyu Ru, Lin Qiu, Yiyang Li, Xuezhi Cao · Mar 2, 2026 · Citations: 0

Simulation Env Long Horizon

Long-horizon interactions between users and LLM-based assistants necessitate effective memory management, yet current approaches face challenges in training and evaluation of memory.
Self-Correcting VLA: Online Action Refinement via Sparse World Imagination
Chenyv Liu, Wentao Tan, Lei Zhu, Fengling Li, Jingjing Li · Feb 25, 2026 · Citations: 0

Simulation Env Long Horizon

Reinforcement learning enhances physical grounding through exploration yet typically relies on external reward signals that remain isolated from the agent's internal states.

Related Hubs

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.

Post a Job Get a Quote