Skip to content
← Back to explorer

Metric Hub

Cost + Simulation Env Metric Papers

Updated from current HFEPX corpus (Feb 27, 2026). 12 papers are grouped in this metric page. Common evaluation modes: Simulation Env, Automatic Metrics. Most common rater population: Domain Experts. Common annotation unit: Freeform. Frequently cited benchmark: Retrieval. Common metric signal: cost. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Feb 24, 2026.

Papers: 12 Last published: Feb 24, 2026 Global RSS

Research Narrative

Grounded narrative Model: deterministic-grounded Source: persisted

Updated from current HFEPX corpus (Feb 27, 2026). This page tracks 12 papers for Cost + Simulation Env Metric Papers. Dominant protocol signals include simulation environments, automatic metrics, LLM-as-judge, with frequent benchmark focus on Retrieval, ALFWorld and metric focus on cost, accuracy. Use the grounded sections below to prioritize reproducible protocol choices, benchmark-matched comparisons, and judge-vs-human evaluation checks.

Why This Matters For Eval Research

Protocol Takeaways

Benchmark Interpretation

  • Retrieval appears in 16.7% of hub papers (2/12); use this cohort for benchmark-matched comparisons.
  • ALFWorld appears in 8.3% of hub papers (1/12); use this cohort for benchmark-matched comparisons.

Metric Interpretation

  • cost is reported in 100% of hub papers (12/12); compare with a secondary metric before ranking methods.
  • accuracy is reported in 16.7% of hub papers (2/12); compare with a secondary metric before ranking methods.

Abstract Evidence Highlights

Direct snippets from paper abstracts to ground protocol and benchmark interpretation.

Human-eval abstract signal: Off-dynamics offline reinforcement learning (RL) aims to learn a policy for a target domain using limited target data and abundant source data collected under different transition dynamics.

Human-eval abstract signal: LLMs are increasingly being used for complex problems which are not necessarily resolved in a single response, but require interacting with an environment to acquire information.

LLM-judge abstract signal: We evaluate EpidemIQs across several different epidemic scenarios, measuring computational cost, workflow reliability, task success rate, and LLM-as-Judge and human expert reviews to estimate the overall quality and technical correctness of the generated results.

Retrieval benchmark signal: We evaluate our framework on three embodied planning benchmarks-Robotouille Synchronous, Robotouille Asynchronous, and ALFWorld.

Protocol abstract signal: We introduce the Massive Audio Embedding Benchmark (MAEB), a large-scale benchmark covering 30 tasks across speech, music, environmental sounds, and cross-modal audio-text reasoning in 100+ languages.

Protocol abstract signal: To reach frontier-level intelligence, we design a scalable reinforcement learning framework that combines verifiable signals with preference feedback, while remaining stable under large-scale off-policy training, enabling consistent self-improvement across mathematics, code, and tool use.

Protocol abstract signal: Model collaboration -- systems where multiple language models (LMs) collaborate -- combines the strengths of diverse models with cost in loading multiple LMs.

Protocol abstract signal: While Large Language Models (LLMs) have demonstrated strong zero-shot reasoning capabilities, their deployment as embodied agents still faces fundamental challenges in long-horizon planning.

Researcher Checklist

  • Close gap on Papers with explicit human feedback. Coverage is a replication risk (16.7% vs 45% target).
  • Close gap on Papers reporting quality controls. Coverage is a replication risk (0% vs 30% target).
  • Tighten coverage on Papers naming benchmarks/datasets. Coverage is usable but incomplete (33.3% vs 35% target).
  • Maintain strength on Papers naming evaluation metrics. Coverage is strong (100% vs 35% target).
  • Close gap on Papers with known rater population. Coverage is a replication risk (16.7% vs 35% target).
  • Close gap on Papers with known annotation unit. Coverage is a replication risk (16.7% vs 35% target).

Papers with explicit human feedback

Coverage is a replication risk (16.7% vs 45% target).

Papers reporting quality controls

Coverage is a replication risk (0% vs 30% target).

Papers naming benchmarks/datasets

Coverage is usable but incomplete (33.3% vs 35% target).

Papers naming evaluation metrics

Coverage is strong (100% vs 35% target).

Papers with known rater population

Coverage is a replication risk (16.7% vs 35% target).

Papers with known annotation unit

Coverage is a replication risk (16.7% vs 35% target).

Suggested Reading Order

  1. 1. Localized Dynamics-Aware Domain Adaption for Off-Dynamics Offline Reinforcement Learning

    Start here for detailed protocol reporting, including rater and quality-control evidence.

  2. 2. Calibrate-Then-Act: Cost-Aware Exploration in LLM Agents

    Start here for detailed protocol reporting, including rater and quality-control evidence.

  3. 3. MAEB: Massive Audio Embedding Benchmark

    Start here for detailed protocol reporting, including rater and quality-control evidence.

  4. 4. EpidemIQs: Prompt-to-Paper LLM Agents for Epidemic Modeling and Analysis

    Include an LLM-as-judge paper to assess judge design and agreement assumptions.

  5. 5. Step 3.5 Flash: Open Frontier-Level Intelligence with 11B Active Parameters

    Adds simulation environments with pairwise preferences for broader coverage within this hub.

  6. 6. The Single-Multi Evolution Loop for Self-Improving Model Collaboration Systems

    Adds simulation environments for broader coverage within this hub.

  7. 7. Embodied Task Planning via Graph-Informed Action Generation with Large Language Model

    Adds simulation environments for broader coverage within this hub.

  8. 8. DIAL: Direct Iterative Adversarial Learning for Realistic Multi-Turn Dialogue Simulation

    Adds automatic metrics for broader coverage within this hub.

Known Limitations

  • Only 0% of papers report quality controls; prioritize calibration/adjudication evidence.
  • Rater population is under-specified (16.7% coverage).
  • Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Links

llm_as_judge vs automatic_metrics

both=0, left_only=1, right_only=2

0 papers use both Llm As Judge and Automatic Metrics.

simulation_env vs automatic_metrics

both=2, left_only=10, right_only=0

2 papers use both Simulation Env and Automatic Metrics.

simulation_env vs llm_as_judge

both=1, left_only=11, right_only=0

1 papers use both Simulation Env and Llm As Judge.

Benchmark Brief

Retrieval

Coverage: 2 papers (16.7%)

2 papers (16.7%) mention Retrieval.

Examples: Calibrate-Then-Act: Cost-Aware Exploration in LLM Agents , Embodied Task Planning via Graph-Informed Action Generation with Large Language Model

Benchmark Brief

ALFWorld

Coverage: 1 papers (8.3%)

1 papers (8.3%) mention ALFWorld.

Examples: Embodied Task Planning via Graph-Informed Action Generation with Large Language Model

Benchmark Brief

BrowseComp

Coverage: 1 papers (8.3%)

1 papers (8.3%) mention BrowseComp.

Examples: Step 3.5 Flash: Open Frontier-Level Intelligence with 11B Active Parameters

Metric Brief

coherence

Coverage: 1 papers (8.3%)

1 papers (8.3%) mention coherence.

Examples: Embodied Task Planning via Graph-Informed Action Generation with Large Language Model

Top Papers Reporting This Metric

Other Metric Hubs