HFEPX Hub

Simulation Env + General (Last 30 Days)

Updated from current HFEPX corpus (Apr 27, 2026). 11 papers are grouped in this hub page.

Read Full Context

Updated from current HFEPX corpus (Apr 27, 2026). 11 papers are grouped in this hub page. Common evaluation modes: Simulation Env, Automatic Metrics. Most common rater population: Domain Experts. Common annotation unit: Trajectory. Frequent quality control: Calibration. Frequently cited benchmark: ALFWorld. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Mar 30, 2026.

Papers: 11 Last published: Mar 30, 2026 Global RSS Tag RSS

Simulation EnvGeneralLast 30d

Researcher Quick Triage

This hub is best used for protocol triage and replication planning from abstract-level evidence. Quality band: Developing .

All Sampled Papers (11) Replication-Ready Only (1)

High-Signal Coverage

100.0%

11 / 11 sampled papers are not low-signal flagged.

Replication-Ready Set

Benchmark + metric + eval mode explicitly present.

Judge/Human Comparability

Papers containing both `human_eval` and `llm_as_judge`.

1 papers are replication-ready (benchmark + metric + explicit evaluation mode).
0 papers support judge-vs-human agreement analysis.
1 papers report explicit quality controls (calibration/adjudication/IAA).

Primary action: Use this page for scouting only; collect additional papers before attempting replication-critical comparisons.

Need evaluators for this research workflow?

Post a Job →

Why This Matters For Eval Research

9.1% of papers report explicit human-feedback signals, led by demonstration data.
simulation environments appears in 100% of papers in this hub.
ALFWorld is a recurring benchmark anchor for cross-paper comparisons in this page.

Protocol Takeaways

Most common quality-control signal is rater calibration (9.1% of papers).
Rater context is mostly domain experts, and annotation is commonly trajectory-level annotation; use this to scope replication staffing.
Pair this hub with llm_as_judge pages to benchmark automated-vs-human evaluation tradeoffs.

Benchmark Interpretation

ALFWorld appears in 9.1% of hub papers (1/11); use this cohort for benchmark-matched comparisons.
Occubench appears in 9.1% of hub papers (1/11); use this cohort for benchmark-matched comparisons.

Metric Interpretation

accuracy is reported in 36.4% of hub papers (4/11); compare with a secondary metric before ranking methods.
cost is reported in 9.1% of hub papers (1/11); compare with a secondary metric before ranking methods.

Researcher Checklist (Expanded)

Researcher Checklist

Gap: Papers with explicit human feedback

Coverage is a replication risk (9.1% vs 45% target).
Gap: Papers reporting quality controls

Coverage is a replication risk (9.1% vs 30% target).
Gap: Papers naming benchmarks/datasets

Coverage is a replication risk (18.2% vs 35% target).
Strong: Papers naming evaluation metrics

Coverage is strong (54.5% vs 35% target).
Gap: Papers with known rater population

Coverage is a replication risk (18.2% vs 35% target).
Moderate: Papers with known annotation unit

Coverage is usable but incomplete (27.3% vs 35% target).

Strengths

Agentic evaluation appears in 72.7% of papers.

Known Gaps

Only 9.1% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (18.2% coverage).
Benchmark coverage is thin (18.2% of papers mention benchmarks/datasets).

Suggested Next Analyses

Pair this hub with llm_as_judge pages to benchmark automated-vs-human evaluation tradeoffs.
Stratify by benchmark (ALFWorld vs Occubench) before comparing methods.
Track metric sensitivity by reporting both accuracy and cost.
Add inter-annotator agreement checks when reproducing these protocols.

Recommended Queries (Expanded)

Recommended Queries

Human Eval Protocols Benchmark Slice: ALFWorld Metric Slice: accuracy Recent High-Signal Papers

Start with These 3

Use these when you need one protocol anchor, one benchmark anchor, and one recent comparison point before reading the wider hub.

Strongest protocol reference

ReDAct: Uncertainty-Aware Deferral for LLM Agents

Highest protocol score with explicit human/eval signal plus ALFWorld.

Strongest benchmark reference

SEAL: An Open, Auditable, and Fair Data Generation Framework for AI-N…

Reported benchmark with accuracy gives a fast comparison anchor.

Strongest recent paper

SOLE-R1: Video-Language Reasoning as the Sole Reward for On-Robot Rei…

Useful for current practice scanning; published Mar 30, 2026.

Start Here (Best First 6)

Ranked for protocol completeness (human signal, benchmark + metric anchors, quality controls, and judge/human overlap).

ReDAct: Uncertainty-Aware Deferral for LLM Agents
Apr 8, 2026 · Citations: 0 · Score: 5.5

HF: Not reported · Eval: Simulation Env · Benchmark: ALFWorld · Metric: Cost
SEAL: An Open, Auditable, and Fair Data Generation Framework for AI-Native 6G Networks
Apr 2, 2026 · Citations: 0 · Score: 5.0

HF: Not reported · Eval: Automatic Metrics, Simulation Env · Benchmark: Not Reported · Metric: Accuracy
SOLE-R1: Video-Language Reasoning as the Sole Reward for On-Robot Reinforcement Learning
Mar 30, 2026 · Citations: 0 · Score: 4.0

HF: Demonstrations · Eval: Simulation Env · Benchmark: Not Reported · Metric: Not Reported
OccuBench: Evaluating AI Agents on Real-World Professional Tasks via Language Environment Simulation
Apr 13, 2026 · Citations: 0 · Score: 4.0

HF: Not reported · Eval: Simulation Env · Benchmark: Occubench · Metric: Not Reported
Wiggle and Go! System Identification for Zero-Shot Dynamic Rope Manipulation
Apr 23, 2026 · Citations: 0 · Score: 3.5

HF: Not reported · Eval: Automatic Metrics, Simulation Env · Benchmark: Not Reported · Metric: Accuracy
Social Dynamics as Critical Vulnerabilities that Undermine Objective Decision-Making in LLM Collectives
Apr 7, 2026 · Citations: 0 · Score: 3.5

HF: Not reported · Eval: Automatic Metrics, Simulation Env · Benchmark: Not Reported · Metric: Accuracy

Protocol Matrix (Top 12)

Use this to quickly compare protocol ingredients instead of scanning long prose.

Paper	HF Signal	Eval Modes	Benchmarks	Metrics	QC
ReDAct: Uncertainty-Aware Deferral for LLM Agents Apr 8, 2026	No Not Reported	Simulation Env	ALFWorld	Token cost	Not Reported
SEAL: An Open, Auditable, and Fair Data Generation Framework for AI-Native 6G Networks Apr 2, 2026	No Not Reported	Automatic Metrics , Simulation Env	Not Reported	Accuracy	Calibration
SOLE-R1: Video-Language Reasoning as the Sole Reward for On-Robot Reinforcement Learning Mar 30, 2026	Yes Demonstrations	Simulation Env	Not Reported	Not Reported	Not Reported
OccuBench: Evaluating AI Agents on Real-World Professional Tasks via Language Environment Simulation Apr 13, 2026	No Not Reported	Simulation Env	Occubench	Not Reported	Not Reported
Wiggle and Go! System Identification for Zero-Shot Dynamic Rope Manipulation Apr 23, 2026	No Not Reported	Automatic Metrics , Simulation Env	Not Reported	Accuracy	Not Reported
Social Dynamics as Critical Vulnerabilities that Undermine Objective Decision-Making in LLM Collectives Apr 7, 2026	No Not Reported	Automatic Metrics , Simulation Env	Not Reported	Accuracy	Not Reported
ActionParty: Multi-Subject Action Binding in Generative Video Games Apr 2, 2026	No Not Reported	Automatic Metrics , Simulation Env	Not Reported	Accuracy	Not Reported
Learning to Play Blackjack: A Curriculum Learning Perspective Mar 31, 2026	No Not Reported	Automatic Metrics , Simulation Env	Not Reported	Win rate	Not Reported
Towards Real-world Human Behavior Simulation: Benchmarking Large Language Models on Long-horizon, Cross-scenario, Heterogeneous Behavior Traces Apr 9, 2026	No Not Reported	Simulation Env	Not Reported	Not Reported	Not Reported
Heterogeneous Debate Engine: Identity-Grounded Cognitive Architecture for Resilient LLM-Based Ethical Tutoring Mar 28, 2026	No Not Reported	Simulation Env	Not Reported	Not Reported	Not Reported
Sell More, Play Less: Benchmarking LLM Realistic Selling Skill Apr 8, 2026	No Not Reported	Human Eval , Simulation Env	Not Reported	Not Reported	Not Reported

Protocol Diff (Top Papers)

Fast side-by-side comparison for the highest-ranked papers in this hub.

Signal	ReDAct: Uncertainty-Aware Deferral for LLM Agents	SEAL: An Open, Auditable, and Fair Data Generation…	SOLE-R1: Video-Language Reasoning as the Sole Rewar…
Human Feedback	Not reported	Not reported	Demonstrations
Evaluation Modes	Simulation Env	Automatic Metrics, Simulation Env	Simulation Env
Benchmarks	ALFWorld	Not reported	Not reported
Metrics	Token cost	Accuracy	Not reported
Quality Controls	Not reported	Calibration	Not reported
Rater Population	Unknown	Unknown	Unknown
Annotation Unit	Trajectory	Unknown	Trajectory

Research Utility Snapshot

Human Feedback Mix

Demonstrations (1)

Evaluation Modes

Simulation Env (11)
Automatic Metrics (5)
Human Eval (1)

Top Benchmarks

ALFWorld (1)
Occubench (1)

Top Metrics

Accuracy (4)
Cost (1)
Latency (1)
Token cost (1)

Rater Population Mix

Domain Experts (1)
Mixed (1)

Quality Controls

Calibration (1)

Coverage diagnostics (sample-based): human-feedback 9.1% · benchmarks 18.2% · metrics 54.5% · quality controls 9.1%.

Top Papers

SOLE-R1: Video-Language Reasoning as the Sole Reward for On-Robot Reinforcement Learning
Philip Schroeder, Thomas Weng, Karl Schmeckpeper, Eric Rosen, Stephen Hart · Mar 30, 2026 · Citations: 0

Demonstrations Simulation Env Long Horizon

To address this limitation, we introduce SOLE-R1 (Self-Observing LEarner), a video-language reasoning model explicitly designed to serve as the sole reward signal for online RL.
ReDAct: Uncertainty-Aware Deferral for LLM Agents
Dzianis Piatrashyn, Nikita Kotelevskii, Kirill Grishchenkov, Nikita Glazkov, Ivan Nasonov · Apr 8, 2026 · Citations: 0

Simulation Env Long Horizon

Recently, LLM-based agents have become increasingly popular across many applications, including complex sequential decision-making problems.
Wiggle and Go! System Identification for Zero-Shot Dynamic Rope Manipulation
Arthur Jakobsson, Abhinav Mahajan, Karthik Pullalarevu, Krishna Suresh, Yunchao Yao · Apr 23, 2026 · Citations: 0

Automatic MetricsSimulation Env Long Horizon

To mitigate this, we present a novel approach that leverages learned simulation priors to inform goal-conditioned dynamic manipulation of ropes for efficient and accurate task execution.
Social Dynamics as Critical Vulnerabilities that Undermine Objective Decision-Making in LLM Collectives
Changgeon Ko, Jisu Shin, Hoyun Song, Huije Lee, Eui Jun Hwang · Apr 7, 2026 · Citations: 0

Automatic MetricsSimulation Env Multi Agent

Large language model (LLM) agents are increasingly acting as human delegates in multi-agent environments, where a representative agent integrates diverse peer perspectives to make a final decision.
SEAL: An Open, Auditable, and Fair Data Generation Framework for AI-Native 6G Networks
Sunder Ali Khowaja, Kapal Dev, Engin Zeydan, Madhusanka Liyanage · Apr 2, 2026 · Citations: 0

Automatic MetricsSimulation Env

In this regard, we propose the Synthetic Data Generation with Ethics Audit Loop (SEAL) framework, which extends baseline modular pipelines with an Ethical and Regulatory Compliance by Design (ERCD) module and a Federated Learning (FL)…
OccuBench: Evaluating AI Agents on Real-World Professional Tasks via Language Environment Simulation
Xiaomeng Hu, Yinger Zhang, Fei Huang, Jianhong Tu, Yang Su · Apr 13, 2026 · Citations: 0

Simulation Env Multi Agent

We introduce OccuBench, a benchmark covering 100 real-world professional task scenarios across 10 industry categories and 65 specialized domains, enabled by Language Environment Simulators (LESs) that simulate domain-specific environments…
ActionParty: Multi-Subject Action Binding in Generative Video Games
Alexander Pondaven, Ziyi Wu, Igor Gilitschenski, Philip Torr, Sergey Tulyakov · Apr 2, 2026 · Citations: 0

Automatic MetricsSimulation Env Multi Agent

However, these models are largely restricted to single-agent settings, failing to control multiple agents simultaneously in a scene.
Towards Real-world Human Behavior Simulation: Benchmarking Large Language Models on Long-horizon, Cross-scenario, Heterogeneous Behavior Traces
Jiawei Chen, Ruoxi Xu, Boxi Cao, Ruotong Pan, Yunfei Zhang · Apr 9, 2026 · Citations: 0

Simulation Env Long Horizon

However, existing benchmarks remain constrained to isolated scenarios, narrow action spaces, or synthetic data, failing to capture the holistic nature of authentic human behavior.
Heterogeneous Debate Engine: Identity-Grounded Cognitive Architecture for Resilient LLM-Based Ethical Tutoring
Jakub Masłowski, Jarosław A. Chudziak · Mar 28, 2026 · Citations: 0

Simulation Env Multi Agent

Large Language Models (LLMs) are being increasingly used as autonomous agents in complex reasoning tasks, opening the niche for dialectical interactions.
Learning to Play Blackjack: A Curriculum Learning Perspective
Amirreza Alasti, Efe Erdal, Yücel Celik, Theresa Eimer · Mar 31, 2026 · Citations: 0

Automatic MetricsSimulation Env

We propose a novel framework that uses a Large Language Model (LLM) to dynamically generate a curriculum over available actions, enabling the agent to incorporate each action individually.
Sell More, Play Less: Benchmarking LLM Realistic Selling Skill
Xuanbo Su, Wenhao Hu, Haibo Su, Yunzhang Chen, Le Zhan · Apr 8, 2026 · Citations: 0

Human EvalSimulation Env

We introduce SalesLLM benchmark, a bilingual (ZH/EN) benchmark derived from realistic applications covering Financial Services and Consumer Goods, built from 30,074 scripted configurations and 1,805 curated multi-turn scenarios with…

Related Hubs

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.

Self-Service

Post a Job

Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.

Create Account & Post a Job

Managed Service

For Large Projects

Done-for-You

We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.

Learn About Managed Service

For Freelancers

Join as an AI Trainer

Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.

Join Now