Metric Hub

Cost + Simulation Env Metric Papers

Updated from current HFEPX corpus (Feb 27, 2026). 12 papers are grouped in this metric page. Common evaluation modes: Simulation Env, Automatic Metrics. Most common rater population: Domain Experts. Common annotation unit: Freeform. Frequently cited benchmark: Retrieval. Common metric signal: cost. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Feb 24, 2026.

Papers: 12 Last published: Feb 24, 2026 Global RSS

Research Narrative

Grounded narrative Model: deterministic-grounded Source: persisted

Updated from current HFEPX corpus (Feb 27, 2026). This page tracks 12 papers for Cost + Simulation Env Metric Papers. Dominant protocol signals include simulation environments, automatic metrics, LLM-as-judge, with frequent benchmark focus on Retrieval, ALFWorld and metric focus on cost, accuracy. Use the grounded sections below to prioritize reproducible protocol choices, benchmark-matched comparisons, and judge-vs-human evaluation checks.

Why This Matters For Eval Research

16.7% of papers report explicit human-feedback signals, led by expert verification.

Evidence: EpidemIQs: Prompt-to-Paper LLM Agents for Epidemic Modeling and Analysis , Localized Dynamics-Aware Domain Adaption for Off-Dynamics Offline Reinforcement Learning , Calibrate-Then-Act: Cost-Aware Exploration in LLM Agents , MAEB: Massive Audio Embedding Benchmark
simulation environments appears in 100% of papers in this hub.

Evidence: Localized Dynamics-Aware Domain Adaption for Off-Dynamics Offline Reinforcement Learning , Calibrate-Then-Act: Cost-Aware Exploration in LLM Agents , MAEB: Massive Audio Embedding Benchmark , Step 3.5 Flash: Open Frontier-Level Intelligence with 11B Active Parameters
Retrieval is a recurring benchmark anchor for cross-paper comparisons in this page.

Evidence: Calibrate-Then-Act: Cost-Aware Exploration in LLM Agents , Embodied Task Planning via Graph-Informed Action Generation with Large Language Model , Localized Dynamics-Aware Domain Adaption for Off-Dynamics Offline Reinforcement Learning , MAEB: Massive Audio Embedding Benchmark

Protocol Takeaways

Quality-control reporting is sparse in this slice; prioritize papers with explicit calibration or adjudication steps.

Evidence: Localized Dynamics-Aware Domain Adaption for Off-Dynamics Offline Reinforcement Learning , Calibrate-Then-Act: Cost-Aware Exploration in LLM Agents , MAEB: Massive Audio Embedding Benchmark , Step 3.5 Flash: Open Frontier-Level Intelligence with 11B Active Parameters
Rater context is mostly domain experts, and annotation is commonly Freeform; use this to scope replication staffing.

Evidence: Step 3.5 Flash: Open Frontier-Level Intelligence with 11B Active Parameters , EpidemIQs: Prompt-to-Paper LLM Agents for Epidemic Modeling and Analysis , Localized Dynamics-Aware Domain Adaption for Off-Dynamics Offline Reinforcement Learning , Calibrate-Then-Act: Cost-Aware Exploration in LLM Agents
Pair this hub with a human_eval-heavy hub to validate judge-model calibration.

Evidence: Localized Dynamics-Aware Domain Adaption for Off-Dynamics Offline Reinforcement Learning , Calibrate-Then-Act: Cost-Aware Exploration in LLM Agents , MAEB: Massive Audio Embedding Benchmark , Step 3.5 Flash: Open Frontier-Level Intelligence with 11B Active Parameters

Benchmark Interpretation

Retrieval appears in 16.7% of hub papers (2/12); use this cohort for benchmark-matched comparisons.
ALFWorld appears in 8.3% of hub papers (1/12); use this cohort for benchmark-matched comparisons.

Metric Interpretation

cost is reported in 100% of hub papers (12/12); compare with a secondary metric before ranking methods.
accuracy is reported in 16.7% of hub papers (2/12); compare with a secondary metric before ranking methods.

Abstract Evidence Highlights

Direct snippets from paper abstracts to ground protocol and benchmark interpretation.

Protocol Localized Dynamics-Aware Domain Adaption for Off-Dynamics Offline Reinforcement Learning

Human-eval abstract signal: Off-dynamics offline reinforcement learning (RL) aims to learn a policy for a target domain using limited target data and abundant source data collected under different transition dynamics.

Protocol Calibrate-Then-Act: Cost-Aware Exploration in LLM Agents

Human-eval abstract signal: LLMs are increasingly being used for complex problems which are not necessarily resolved in a single response, but require interacting with an environment to acquire information.

Protocol EpidemIQs: Prompt-to-Paper LLM Agents for Epidemic Modeling and Analysis

LLM-judge abstract signal: We evaluate EpidemIQs across several different epidemic scenarios, measuring computational cost, workflow reliability, task success rate, and LLM-as-Judge and human expert reviews to estimate the overall quality and technical correctness of the generated results.

Benchmark Embodied Task Planning via Graph-Informed Action Generation with Large Language Model

Retrieval benchmark signal: We evaluate our framework on three embodied planning benchmarks-Robotouille Synchronous, Robotouille Asynchronous, and ALFWorld.

Protocol MAEB: Massive Audio Embedding Benchmark

Protocol abstract signal: We introduce the Massive Audio Embedding Benchmark (MAEB), a large-scale benchmark covering 30 tasks across speech, music, environmental sounds, and cross-modal audio-text reasoning in 100+ languages.

Protocol Step 3.5 Flash: Open Frontier-Level Intelligence with 11B Active Parameters

Protocol abstract signal: To reach frontier-level intelligence, we design a scalable reinforcement learning framework that combines verifiable signals with preference feedback, while remaining stable under large-scale off-policy training, enabling consistent self-improvement across mathematics, code, and tool use.

Protocol The Single-Multi Evolution Loop for Self-Improving Model Collaboration Systems

Protocol abstract signal: Model collaboration -- systems where multiple language models (LMs) collaborate -- combines the strengths of diverse models with cost in loading multiple LMs.

Protocol Embodied Task Planning via Graph-Informed Action Generation with Large Language Model

Protocol abstract signal: While Large Language Models (LLMs) have demonstrated strong zero-shot reasoning capabilities, their deployment as embodied agents still faces fundamental challenges in long-horizon planning.

Researcher Checklist

Close gap on Papers with explicit human feedback. Coverage is a replication risk (16.7% vs 45% target).
Close gap on Papers reporting quality controls. Coverage is a replication risk (0% vs 30% target).
Tighten coverage on Papers naming benchmarks/datasets. Coverage is usable but incomplete (33.3% vs 35% target).
Maintain strength on Papers naming evaluation metrics. Coverage is strong (100% vs 35% target).
Close gap on Papers with known rater population. Coverage is a replication risk (16.7% vs 35% target).
Close gap on Papers with known annotation unit. Coverage is a replication risk (16.7% vs 35% target).

Papers with explicit human feedback

Coverage is a replication risk (16.7% vs 45% target).

Papers reporting quality controls

Coverage is a replication risk (0% vs 30% target).

Papers naming benchmarks/datasets

Coverage is usable but incomplete (33.3% vs 35% target).

Papers naming evaluation metrics

Coverage is strong (100% vs 35% target).

Papers with known rater population

Coverage is a replication risk (16.7% vs 35% target).

Papers with known annotation unit

Coverage is a replication risk (16.7% vs 35% target).

Known Limitations

Only 0% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (16.7% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Links

LLM-as-Judge Protocols - Finds judge-based evaluation setups to compare calibration and drift risks.
Benchmark Slice: Retrieval - Prioritizes benchmark-specific protocol comparisons.
Metric Slice: cost - Finds papers where reported metrics are directly comparable.
Recent High-Signal Papers - Keeps the hub connected to the latest HFEPX corpus updates.

llm_as_judge vs automatic_metrics

both=0, left_only=1, right_only=2

0 papers use both Llm As Judge and Automatic Metrics.

simulation_env vs automatic_metrics

both=2, left_only=10, right_only=0

2 papers use both Simulation Env and Automatic Metrics.

simulation_env vs llm_as_judge

both=1, left_only=11, right_only=0

1 papers use both Simulation Env and Llm As Judge.

Benchmark Brief

Retrieval

Coverage: 2 papers (16.7%)

2 papers (16.7%) mention Retrieval.

Examples: Calibrate-Then-Act: Cost-Aware Exploration in LLM Agents , Embodied Task Planning via Graph-Informed Action Generation with Large Language Model

Benchmark Brief

ALFWorld

Coverage: 1 papers (8.3%)

1 papers (8.3%) mention ALFWorld.

Examples: Embodied Task Planning via Graph-Informed Action Generation with Large Language Model

Benchmark Brief

BrowseComp

Coverage: 1 papers (8.3%)

1 papers (8.3%) mention BrowseComp.

Examples: Step 3.5 Flash: Open Frontier-Level Intelligence with 11B Active Parameters

Metric Brief

cost

Coverage: 12 papers (100%)

12 papers (100%) mention cost.

Examples: Localized Dynamics-Aware Domain Adaption for Off-Dynamics Offline Reinforcement Learning , Calibrate-Then-Act: Cost-Aware Exploration in LLM Agents , MAEB: Massive Audio Embedding Benchmark

Metric Brief

accuracy

Coverage: 2 papers (16.7%)

2 papers (16.7%) mention accuracy.

Examples: DIAL: Direct Iterative Adversarial Learning for Realistic Multi-Turn Dialogue Simulation , EconCausal: A Context-Aware Causal Reasoning Benchmark for Large Language Models in Social Science

Metric Brief

coherence

Coverage: 1 papers (8.3%)

1 papers (8.3%) mention coherence.

Examples: Embodied Task Planning via Graph-Informed Action Generation with Large Language Model

Most Cited In This Hub

Fast path to methods with the strongest citation traction in this scope.

Papers: Localized Dynamics-Aware Domain Adaption for Off-Dynamics Offline Reinforcement Learning , Calibrate-Then-Act: Cost-Aware Exploration in LLM Agents , MAEB: Massive Audio Embedding Benchmark

Most Recent

Fast path to latest protocol changes and newly published evaluation setups.

Papers: Localized Dynamics-Aware Domain Adaption for Off-Dynamics Offline Reinforcement Learning , Calibrate-Then-Act: Cost-Aware Exploration in LLM Agents , MAEB: Massive Audio Embedding Benchmark

Best Protocol Detail

Papers with explicit rater/unit metadata and quality-control signals for reproducibility.

Papers: Localized Dynamics-Aware Domain Adaption for Off-Dynamics Offline Reinforcement Learning , Calibrate-Then-Act: Cost-Aware Exploration in LLM Agents , MAEB: Massive Audio Embedding Benchmark

Top Papers Reporting This Metric

Localized Dynamics-Aware Domain Adaption for Off-Dynamics Offline Reinforcement Learning
Zhangjie Xia, Yu Yang, Pan Xu · Feb 24, 2026

Simulation Env General

Off-dynamics offline reinforcement learning (RL) aims to learn a policy for a target domain using limited target data and abundant source data collected under different transition dynamics.
Calibrate-Then-Act: Cost-Aware Exploration in LLM Agents
Wenxuan Ding, Nicholas Tomlin, Greg Durrett · Feb 18, 2026

Simulation Env Coding

Each problem has latent environment state that can be reasoned about via a prior which is passed to the LLM agent.
MAEB: Massive Audio Embedding Benchmark
Adnan El Assadi, Isaac Chung, Chenghao Xiao, Roman Solomatin, Animesh Jha · Feb 17, 2026

Simulation Env CodingMultilingual

We introduce the Massive Audio Embedding Benchmark (MAEB), a large-scale benchmark covering 30 tasks across speech, music, environmental sounds, and cross-modal audio-text reasoning in 100+ languages.
Step 3.5 Flash: Open Frontier-Level Intelligence with 11B Active Parameters
Ailin Huang, Ang Li, Aobo Kong, Bin Wang, Binxing Jiao · Feb 11, 2026

Simulation Env MathCoding

We introduce Step 3.5 Flash, a sparse Mixture-of-Experts (MoE) model that bridges frontier-level agentic intelligence and computational efficiency.
The Single-Multi Evolution Loop for Self-Improving Model Collaboration Systems
Shangbin Feng, Kishan Panaganti, Yulia Tsvetkov, Wenhao Yu · Feb 5, 2026

Simulation Env General

Model collaboration -- systems where multiple language models (LMs) collaborate -- combines the strengths of diverse models with cost in loading multiple LMs.
Embodied Task Planning via Graph-Informed Action Generation with Large Language Model
Xiang Li, Ning Yan, Masood Mortazavi · Jan 29, 2026

Simulation Env Coding

While Large Language Models (LLMs) have demonstrated strong zero-shot reasoning capabilities, their deployment as embodied agents still faces fundamental challenges in long-horizon planning.
DIAL: Direct Iterative Adversarial Learning for Realistic Multi-Turn Dialogue Simulation
Ziyi Zhu, Olivier Tieleman, Caitlin A. Stamatis, Luka Smyth, Thomas D. Hull · Dec 23, 2025

Automatic MetricsSimulation Env General

Realistic user simulation is crucial for training and evaluating multi-turn dialogue systems, yet creating simulators that accurately replicate human behavior remains a significant challenge.
Aerial Vision-Language Navigation with a Unified Framework for Spatial, Temporal and Embodied Reasoning
Huilin Xu, Zhuoyang Liu, Yixiang Luomei, Feng Xu · Dec 9, 2025

Simulation Env General

Extensive experiments on the AerialVLN and OpenFly benchmark validate the effectiveness of our method.
PEFT-Bench: A Parameter-Efficient Fine-Tuning Methods Benchmark
Robert Belanec, Branislav Pecher, Ivan Srba, Maria Bielikova · Nov 26, 2025

Simulation Env General

Despite the advances in PEFT methods, current evaluations remain limited (in terms of evaluated models and datasets) and difficult to reproduce.
EconCausal: A Context-Aware Causal Reasoning Benchmark for Large Language Models in Social Science
Donggyu Lee, Hyeok Yun, Meeyoung Cha, Sungwon Park, Sangyoon Park · Oct 8, 2025

Automatic MetricsSimulation Env General

To address this, we introduce EconCausal, a large-scale benchmark comprising 10,490 context-annotated causal triplets extracted from 2,595 high-quality empirical studies published in top-tier economics and finance journals.
EpidemIQs: Prompt-to-Paper LLM Agents for Epidemic Modeling and Analysis
Mohammad Hossein Samaei, Faryad Darabi Sahneh, Lee W. Cohnstaedt, Caterina Scoglio · Sep 24, 2025

Llm As JudgeSimulation Env General

We introduce EpidemIQs, a novel multi-agent LLM framework that integrates user inputs and autonomously conducts literature review, analytical derivation, network modeling, mechanistic modeling, stochastic simulations, data visualization and
"Don't Do That!": Guiding Embodied Systems through Large Language Model-based Constraint Generation
Amin Seffo, Aladin Djuhera, Masataro Asai, Holger Boche · Jun 4, 2025

Simulation Env MathCoding

Recent advancements in large language models (LLMs) have spurred interest in robotic navigation that incorporates complex spatial, mathematical, and conditional constraints from natural language into the planning problem.

Cost + Simulation Env Metric Papers

Research Narrative

Why This Matters For Eval Research

Protocol Takeaways

Benchmark Interpretation

Metric Interpretation

Abstract Evidence Highlights

Researcher Checklist

Suggested Reading Order

Known Limitations

Research Utility Links

Top Papers Reporting This Metric

Other Metric Hubs