Metric Hub

Cost + Long Horizon Metric Papers

Updated from current HFEPX corpus (Feb 27, 2026). 12 papers are grouped in this metric page. Common evaluation modes: Automatic Metrics, Simulation Env. Most common rater population: Domain Experts. Common annotation unit: Trajectory. Frequently cited benchmark: ALFWorld. Common metric signal: cost. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Feb 26, 2026.

Papers: 12 Last published: Feb 26, 2026 Global RSS

Research Narrative

Grounded narrative Model: deterministic-grounded Source: persisted

Updated from current HFEPX corpus (Feb 27, 2026). This page tracks 12 papers for Cost + Long Horizon Metric Papers. Dominant protocol signals include automatic metrics, simulation environments, with frequent benchmark focus on ALFWorld, BrowseComp and metric focus on cost, accuracy. Use the grounded sections below to prioritize reproducible protocol choices, benchmark-matched comparisons, and judge-vs-human evaluation checks.

Why This Matters For Eval Research

automatic metrics appears in 83.3% of papers in this hub.

Evidence: Test-Time Scaling with Diffusion Language Models via Reward-Guided Stitching , Replacing Multi-Step Assembly of Data Preparation Pipelines with One-Step LLM Pipeline Generation for Table QA , Search More, Think Less: Rethinking Long-Horizon Agentic Search for Efficiency and Generalization , How Do Latent Reasoning Methods Perform Under Weak and Strong Supervision?
ALFWorld is a recurring benchmark anchor for cross-paper comparisons in this page.

Evidence: Embodied Task Planning via Graph-Informed Action Generation with Large Language Model , Test-Time Scaling with Diffusion Language Models via Reward-Guided Stitching , Replacing Multi-Step Assembly of Data Preparation Pipelines with One-Step LLM Pipeline Generation for Table QA , Search More, Think Less: Rethinking Long-Horizon Agentic Search for Efficiency and Generalization
long-horizon tasks appears in 100% of papers, indicating agentic evaluation demand.

Evidence: Test-Time Scaling with Diffusion Language Models via Reward-Guided Stitching , Replacing Multi-Step Assembly of Data Preparation Pipelines with One-Step LLM Pipeline Generation for Table QA , Search More, Think Less: Rethinking Long-Horizon Agentic Search for Efficiency and Generalization , How Do Latent Reasoning Methods Perform Under Weak and Strong Supervision?

Protocol Takeaways

Quality-control reporting is sparse in this slice; prioritize papers with explicit calibration or adjudication steps.

Evidence: Test-Time Scaling with Diffusion Language Models via Reward-Guided Stitching , Replacing Multi-Step Assembly of Data Preparation Pipelines with One-Step LLM Pipeline Generation for Table QA , Search More, Think Less: Rethinking Long-Horizon Agentic Search for Efficiency and Generalization , How Do Latent Reasoning Methods Perform Under Weak and Strong Supervision?
Rater context is mostly domain experts, and annotation is commonly trajectory-level annotation; use this to scope replication staffing.

Evidence: SWE-Protégé: Learning to Selectively Collaborate With an Expert Unlocks Small Language Models as Software Engineering Agents , Test-Time Scaling with Diffusion Language Models via Reward-Guided Stitching , Replacing Multi-Step Assembly of Data Preparation Pipelines with One-Step LLM Pipeline Generation for Table QA , Search More, Think Less: Rethinking Long-Horizon Agentic Search for Efficiency and Generalization
Stratify by benchmark (ALFWorld vs BrowseComp) before comparing methods.

Evidence: Test-Time Scaling with Diffusion Language Models via Reward-Guided Stitching , Replacing Multi-Step Assembly of Data Preparation Pipelines with One-Step LLM Pipeline Generation for Table QA , Search More, Think Less: Rethinking Long-Horizon Agentic Search for Efficiency and Generalization , How Do Latent Reasoning Methods Perform Under Weak and Strong Supervision?

Benchmark Interpretation

ALFWorld appears in 8.3% of hub papers (1/12); use this cohort for benchmark-matched comparisons.
BrowseComp appears in 8.3% of hub papers (1/12); use this cohort for benchmark-matched comparisons.

Metric Interpretation

cost is reported in 100% of hub papers (12/12); compare with a secondary metric before ranking methods.
accuracy is reported in 41.7% of hub papers (5/12); compare with a secondary metric before ranking methods.

Abstract Evidence Highlights

Direct snippets from paper abstracts to ground protocol and benchmark interpretation.

Protocol Test-Time Scaling with Diffusion Language Models via Reward-Guided Stitching

Human-eval abstract signal: Reasoning with large language models often benefits from generating multiple chains-of-thought, but existing aggregation strategies are typically trajectory-level (e.g., selecting the best trace or voting on the final answer), discarding useful intermediate work from...

Protocol Replacing Multi-Step Assembly of Data Preparation Pipelines with One-Step LLM Pipeline Generation for Table QA

Human-eval abstract signal: Table Question Answering (TQA) aims to answer natural language questions over structured tables.

Benchmark Embodied Task Planning via Graph-Informed Action Generation with Large Language Model

ALFWorld benchmark signal: We evaluate our framework on three embodied planning benchmarks-Robotouille Synchronous, Robotouille Asynchronous, and ALFWorld.

Benchmark Test-Time Scaling with Diffusion Language Models via Reward-Guided Stitching

ALFWorld benchmark signal: Across math reasoning benchmarks, we find that step-level recombination is most beneficial on harder problems, and ablations highlight the importance of the final AR solver in converting stitched but imperfect rationales into accurate answers.

Metric Test-Time Scaling with Diffusion Language Models via Reward-Guided Stitching

cost metric signal: Using low-confidence diffusion sampling with parallel, independent rollouts, our training-free framework improves average accuracy by up to 23.8% across six math and coding tasks.

Metric Replacing Multi-Step Assembly of Data Preparation Pipelines with One-Step LLM Pipeline Generation for Table QA

cost metric signal: Experiments on two benchmark datasets show that, with the same LLM backbone, Operation-R1 achieves average absolute accuracy gains of 9.55 and 6.08 percentage points over multi-step preparation baselines, with 79\% table compression and a...

Protocol Search More, Think Less: Rethinking Long-Horizon Agentic Search for Efficiency and Generalization

Protocol abstract signal: Recent deep research agents primarily improve performance by scaling reasoning depth, but this leads to high inference cost and latency in search-intensive scenarios.

Protocol How Do Latent Reasoning Methods Perform Under Weak and Strong Supervision?

Protocol abstract signal: Latent reasoning has been recently proposed as a reasoning paradigm and performs multi-step reasoning through generating steps in the latent space instead of the textual space.

Researcher Checklist

Close gap on Papers with explicit human feedback. Coverage is a replication risk (0% vs 45% target).
Close gap on Papers reporting quality controls. Coverage is a replication risk (0% vs 30% target).
Tighten coverage on Papers naming benchmarks/datasets. Coverage is usable but incomplete (25% vs 35% target).
Maintain strength on Papers naming evaluation metrics. Coverage is strong (100% vs 35% target).
Close gap on Papers with known rater population. Coverage is a replication risk (8.3% vs 35% target).
Maintain strength on Papers with known annotation unit. Coverage is strong (50% vs 35% target).

Papers with explicit human feedback

Coverage is a replication risk (0% vs 45% target).

Papers reporting quality controls

Coverage is a replication risk (0% vs 30% target).

Papers naming benchmarks/datasets

Coverage is usable but incomplete (25% vs 35% target).

Papers naming evaluation metrics

Coverage is strong (100% vs 35% target).

Papers with known rater population

Coverage is a replication risk (8.3% vs 35% target).

Papers with known annotation unit

Coverage is strong (50% vs 35% target).

Known Limitations

Only 0% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (8.3% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Links

Benchmark Slice: ALFWorld - Prioritizes benchmark-specific protocol comparisons.
Metric Slice: cost - Finds papers where reported metrics are directly comparable.
Recent High-Signal Papers - Keeps the hub connected to the latest HFEPX corpus updates.

automatic_metrics vs simulation_env

both=0, left_only=10, right_only=2

0 papers use both Automatic Metrics and Simulation Env.

Benchmark Brief

ALFWorld

Coverage: 1 papers (8.3%)

1 papers (8.3%) mention ALFWorld.

Examples: Embodied Task Planning via Graph-Informed Action Generation with Large Language Model

Benchmark Brief

BrowseComp

Coverage: 1 papers (8.3%)

1 papers (8.3%) mention BrowseComp.

Examples: Search More, Think Less: Rethinking Long-Horizon Agentic Search for Efficiency and Generalization

Benchmark Brief

GAIA

Coverage: 1 papers (8.3%)

1 papers (8.3%) mention GAIA.

Examples: Search More, Think Less: Rethinking Long-Horizon Agentic Search for Efficiency and Generalization

Metric Brief

cost

Coverage: 12 papers (100%)

12 papers (100%) mention cost.

Examples: Test-Time Scaling with Diffusion Language Models via Reward-Guided Stitching , Replacing Multi-Step Assembly of Data Preparation Pipelines with One-Step LLM Pipeline Generation for Table QA , Search More, Think Less: Rethinking Long-Horizon Agentic Search for Efficiency and Generalization

Metric Brief

accuracy

Coverage: 5 papers (41.7%)

5 papers (41.7%) mention accuracy.

Metric Brief

latency

Coverage: 4 papers (33.3%)

4 papers (33.3%) mention latency.

Examples: Test-Time Scaling with Diffusion Language Models via Reward-Guided Stitching , Search More, Think Less: Rethinking Long-Horizon Agentic Search for Efficiency and Generalization , SWE-Protégé: Learning to Selectively Collaborate With an Expert Unlocks Small Language Models as Software Engineering Agents

Most Cited In This Hub

Fast path to methods with the strongest citation traction in this scope.

Papers: Test-Time Scaling with Diffusion Language Models via Reward-Guided Stitching , Replacing Multi-Step Assembly of Data Preparation Pipelines with One-Step LLM Pipeline Generation for Table QA , Search More, Think Less: Rethinking Long-Horizon Agentic Search for Efficiency and Generalization

Most Recent

Fast path to latest protocol changes and newly published evaluation setups.

Best Protocol Detail

Papers with explicit rater/unit metadata and quality-control signals for reproducibility.

Top Papers Reporting This Metric

Test-Time Scaling with Diffusion Language Models via Reward-Guided Stitching
Roy Miles, Aysim Toker, Andreea-Maria Oncescu, Songcen Xu, Jiankang Deng · Feb 26, 2026

Automatic Metrics MathCoding

This modular pipeline separates exploration (diffusion) from evaluation and solution synthesis, avoiding monolithic unified hybrids while preserving broad search.
Replacing Multi-Step Assembly of Data Preparation Pipelines with One-Step LLM Pipeline Generation for Table QA
Fengyu Li, Junhao Zhu, Kaishi Song, Lu Chen, Zhongming Yao · Feb 26, 2026

Automatic Metrics General

Experiments on two benchmark datasets show that, with the same LLM backbone, Operation-R1 achieves average absolute accuracy gains of 9.55 and 6.08 percentage points over multi-step preparation baselines, with 79\% table compression and a 2
Search More, Think Less: Rethinking Long-Horizon Agentic Search for Efficiency and Generalization
Qianben Chen, Tianrui Qin, King Zhu, Qiexiang Wang, Chengjun Yu · Feb 26, 2026

Automatic Metrics General

Recent deep research agents primarily improve performance by scaling reasoning depth, but this leads to high inference cost and latency in search-intensive scenarios.
How Do Latent Reasoning Methods Perform Under Weak and Strong Supervision?
Yingqian Cui, Zhenwei Dai, Bing He, Zhan Shi, Hui Liu · Feb 25, 2026

Automatic Metrics Coding

Latent reasoning has been recently proposed as a reasoning paradigm and performs multi-step reasoning through generating steps in the latent space instead of the textual space.
SWE-Protégé: Learning to Selectively Collaborate With an Expert Unlocks Small Language Models as Software Engineering Agents
Patrick Tser Jern Kon, Archana Pradeep, Ang Chen, Alexander P. Ellis, Warren Hunt · Feb 25, 2026

Automatic Metrics Coding

Our approach combines supervised fine-tuning on expert-augmented trajectories with agentic reinforcement learning that explicitly discourages degenerative looping and unproductive expert collaboration.
Continuous Telemonitoring of Heart Failure using Personalised Speech Dynamics
Yue Pan, Xingyao Wang, Hanyue Zhang, Liwei Liu, Changxin Li · Feb 23, 2026

Automatic Metrics MedicineCoding

The model's high sensitivity was further corroborated by additional follow-up data, confirming its efficacy in predicting HF deterioration and its potential to secure patient safety in remote, home-based settings.
Do LLMs and VLMs Share Neurons for Inference? Evidence and Mechanisms of Cross-Modal Transfer
Chenhang Cui, An Zhang, Yuxin Chen, Gelei Deng, Jingnan Zheng · Feb 22, 2026

Automatic Metrics MathCoding

Across diverse mathematics and perception benchmarks, SNRF consistently enhances LVLM inference performance while preserving perceptual capabilities.
Watermarking LLM Agent Trajectories
Wenlong Meng, Chen Gong, Terry Yue Zhuo, Fan Zhang, Kecen Li · Feb 21, 2026

Automatic Metrics MathCoding

LLM agents rely heavily on high-quality trajectory data to guide their problem-solving behaviors, yet producing such data requires substantial task design, high-capacity model generation, and manual filtering.
Sink-Aware Pruning for Diffusion Language Models
Aidar Myrzakhan, Tianyi Li, Bowei Guo, Shengkun Tang, Zhiqiang Shen · Feb 19, 2026

Automatic Metrics Coding

Diffusion Language Models (DLMs) incur high inference cost due to iterative denoising, motivating efficient pruning.
TabAgent: A Framework for Replacing Agentic Generative Components with Tabular-Textual Classifiers
Ido Levy, Eilam Shapira, Yinon Goldshtein, Avi Yaeli, Nir Mashkif · Feb 18, 2026

Automatic Metrics General

Agentic systems, AI architectures that autonomously execute multi-step workflows to achieve complex goals, are often built using repeated large language model (LLM) calls for closed-set decision tasks such as routing, shortlisting, gating,
Embodied Task Planning via Graph-Informed Action Generation with Large Language Model
Xiang Li, Ning Yan, Masood Mortazavi · Jan 29, 2026

Simulation Env Coding

While Large Language Models (LLMs) have demonstrated strong zero-shot reasoning capabilities, their deployment as embodied agents still faces fundamental challenges in long-horizon planning.
Aerial Vision-Language Navigation with a Unified Framework for Spatial, Temporal and Embodied Reasoning
Huilin Xu, Zhuoyang Liu, Yixiang Luomei, Feng Xu · Dec 9, 2025

Simulation Env General

Extensive experiments on the AerialVLN and OpenFly benchmark validate the effectiveness of our method.

Cost + Long Horizon Metric Papers

Research Narrative

Why This Matters For Eval Research

Protocol Takeaways

Benchmark Interpretation

Metric Interpretation

Abstract Evidence Highlights

Researcher Checklist

Suggested Reading Order

Known Limitations

Research Utility Links

Top Papers Reporting This Metric

Other Metric Hubs