Skip to content
← Back to explorer

HFEPX Hub

General + Simulation Env Papers

Updated from current HFEPX corpus (Feb 27, 2026). 67 papers are grouped in this hub page. Common evaluation modes: Simulation Env, Automatic Metrics. Most common rater population: Domain Experts. Common annotation unit: Trajectory. Frequent quality control: Inter Annotator Agreement Reported. Frequently cited benchmark: Retrieval. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Feb 26, 2026.

Papers: 67 Last published: Feb 26, 2026 Global RSS Tag RSS
GeneralSimulation Env

Research Narrative

Grounded narrative Model: deterministic-grounded Source: persisted

Updated from current HFEPX corpus (Feb 27, 2026). This page tracks 67 papers for General + Simulation Env Papers. Dominant protocol signals include simulation environments, automatic metrics, LLM-as-judge, with frequent benchmark focus on Retrieval, APPS and metric focus on accuracy, cost. Use the grounded sections below to prioritize reproducible protocol choices, benchmark-matched comparisons, and judge-vs-human evaluation checks.

Why This Matters For Eval Research

Protocol Takeaways

Benchmark Interpretation

  • Retrieval appears in 9% of hub papers (6/67); use this cohort for benchmark-matched comparisons.
  • APPS appears in 1.5% of hub papers (1/67); use this cohort for benchmark-matched comparisons.

Metric Interpretation

  • accuracy is reported in 11.9% of hub papers (8/67); compare with a secondary metric before ranking methods.
  • cost is reported in 10.4% of hub papers (7/67); compare with a secondary metric before ranking methods.

Researcher Checklist

  • Close gap on Papers with explicit human feedback. Coverage is a replication risk (14.9% vs 45% target).
  • Close gap on Papers reporting quality controls. Coverage is a replication risk (1.5% vs 30% target).
  • Tighten coverage on Papers naming benchmarks/datasets. Coverage is usable but incomplete (26.9% vs 35% target).
  • Maintain strength on Papers naming evaluation metrics. Coverage is strong (41.8% vs 35% target).
  • Close gap on Papers with known rater population. Coverage is a replication risk (6% vs 35% target).
  • Close gap on Papers with known annotation unit. Coverage is a replication risk (13.4% vs 35% target).

Papers with explicit human feedback

Coverage is a replication risk (14.9% vs 45% target).

Papers reporting quality controls

Coverage is a replication risk (1.5% vs 30% target).

Papers naming benchmarks/datasets

Coverage is usable but incomplete (26.9% vs 35% target).

Papers naming evaluation metrics

Coverage is strong (41.8% vs 35% target).

Papers with known rater population

Coverage is a replication risk (6% vs 35% target).

Papers with known annotation unit

Coverage is a replication risk (13.4% vs 35% target).

Suggested Reading Order

  1. 1. CiteLLM: An Agentic Platform for Trustworthy Scientific Reference Discovery

    Start here for detailed protocol reporting, including rater and quality-control evidence.

  2. 2. Towards Simulating Social Media Users with LLMs: Evaluating the Operational Validity of Conditioned Comment Prediction

    Start here for detailed protocol reporting, including rater and quality-control evidence.

  3. 3. Retrieval Challenges in Low-Resource Public Service Information: A Case Study on Food Pantry Access

    Start here for detailed protocol reporting, including rater and quality-control evidence.

  4. 4. PreScience: A Benchmark for Forecasting Scientific Contributions

    Include a human-eval paper to anchor calibration against automated judge settings.

  5. 5. ARLArena: A Unified Framework for Stable Agentic Reinforcement Learning

    Adds simulation environments for broader coverage within this hub.

  6. 6. LiLo-VLA: Compositional Long-Horizon Manipulation via Linked Object-Centric Policies

    Adds simulation environments for broader coverage within this hub.

  7. 7. A Knowledge-Driven Approach to Music Segmentation, Music Source Separation and Cinematic Audio Source Separation

    Adds simulation environments for broader coverage within this hub.

  8. 8. Efficient Hierarchical Any-Angle Path Planning on Multi-Resolution 3D Grids

    Adds simulation environments for broader coverage within this hub.

Known Limitations

  • Only 1.5% of papers report quality controls; prioritize calibration/adjudication evidence.
  • Rater population is under-specified (6% coverage).
  • Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Links

human_eval vs llm_as_judge

both=0, left_only=1, right_only=2

0 papers use both Human Eval and Llm As Judge.

human_eval vs automatic_metrics

both=0, left_only=1, right_only=9

0 papers use both Human Eval and Automatic Metrics.

llm_as_judge vs automatic_metrics

both=0, left_only=2, right_only=9

0 papers use both Llm As Judge and Automatic Metrics.

Benchmark Brief

APPS

Coverage: 1 papers (1.5%)

1 papers (1.5%) mention APPS.

Examples: The Tool Decathlon: Benchmarking Language Agents for Diverse, Realistic, and Long-Horizon Task Execution

Benchmark Brief

Arlarena

Coverage: 1 papers (1.5%)

1 papers (1.5%) mention Arlarena.

Examples: ARLArena: A Unified Framework for Stable Agentic Reinforcement Learning

Top Papers

Related Hubs