HFEPX Hub

Simulation Env + Coding Papers

Updated from current HFEPX corpus (Apr 12, 2026). 25 papers are grouped in this hub page.

Read Full Context

Updated from current HFEPX corpus (Apr 12, 2026). 25 papers are grouped in this hub page. Common evaluation modes: Simulation Env, Automatic Metrics. Most common rater population: Domain Experts. Common annotation unit: Trajectory. Frequent quality control: Adjudication. Frequently cited benchmark: Ad-Bench. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Feb 15, 2026.

Papers: 25 Last published: Feb 15, 2026 Global RSS Tag RSS

Simulation EnvCoding

Researcher Quick Triage

This hub is best used for protocol triage and replication planning from abstract-level evidence. Quality band: Medium .

All Sampled Papers (25) Replication-Ready Only (5)

High-Signal Coverage

100.0%

25 / 25 sampled papers are not low-signal flagged.

Replication-Ready Set

Benchmark + metric + eval mode explicitly present.

Judge/Human Comparability

Papers containing both `human_eval` and `llm_as_judge`.

5 papers are replication-ready (benchmark + metric + explicit evaluation mode).
0 papers support judge-vs-human agreement analysis.
1 papers report explicit quality controls (calibration/adjudication/IAA).

Primary action: Start with the top 2 papers in “Start Here”, then validate assumptions in the protocol matrix.

Need evaluators for this research workflow?

Post a Job →

Why This Matters For Eval Research

36% of papers report explicit human-feedback signals, led by critique/edit feedback.
simulation environments appears in 100% of papers in this hub.
Ad-Bench is a recurring benchmark anchor for cross-paper comparisons in this page.

Protocol Takeaways

Most common quality-control signal is adjudication (4% of papers).
Rater context is mostly domain experts, and annotation is commonly trajectory-level annotation; use this to scope replication staffing.
Pair this hub with llm_as_judge pages to benchmark automated-vs-human evaluation tradeoffs.

Benchmark Interpretation

Ad-Bench appears in 4% of hub papers (1/25); use this cohort for benchmark-matched comparisons.
AIME appears in 4% of hub papers (1/25); use this cohort for benchmark-matched comparisons.

Metric Interpretation

accuracy is reported in 28% of hub papers (7/25); compare with a secondary metric before ranking methods.
success rate is reported in 12% of hub papers (3/25); compare with a secondary metric before ranking methods.

Researcher Checklist (Expanded)

Researcher Checklist

Moderate: Papers with explicit human feedback

Coverage is usable but incomplete (36% vs 45% target).
Gap: Papers reporting quality controls

Coverage is a replication risk (4% vs 30% target).
Moderate: Papers naming benchmarks/datasets

Coverage is usable but incomplete (32% vs 35% target).
Strong: Papers naming evaluation metrics

Coverage is strong (68% vs 35% target).
Gap: Papers with known rater population

Coverage is a replication risk (16% vs 35% target).
Strong: Papers with known annotation unit

Coverage is strong (36% vs 35% target).

Strengths

Agentic evaluation appears in 64% of papers.

Known Gaps

Only 4% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (16% coverage).

Suggested Next Analyses

Pair this hub with llm_as_judge pages to benchmark automated-vs-human evaluation tradeoffs.
Stratify by benchmark (Ad-Bench vs AIME) before comparing methods.
Track metric sensitivity by reporting both accuracy and success rate.
Add inter-annotator agreement checks when reproducing these protocols.

Recommended Queries (Expanded)

Recommended Queries

Human Eval Protocols Benchmark Slice: Ad-Bench Metric Slice: accuracy Recent High-Signal Papers

Start with These 3

Use these when you need one protocol anchor, one benchmark anchor, and one recent comparison point before reading the wider hub.

Strongest protocol reference

AD-Bench: A Real-World, Trajectory-Aware Advertising Analytics Benchm…

Highest protocol score with explicit human/eval signal plus Ad-Bench.

Strongest benchmark reference

AJAR: Adaptive Jailbreak Architecture for Red-teaming

Harmbench with success rate gives a fast comparison anchor.

Strongest recent paper

Let's Think in Two Steps: Mitigating Agreement Bias in MLLMs with Sel…

Useful for current practice scanning; published Jul 15, 2025.

Start Here (Best First 6)

Ranked for protocol completeness (human signal, benchmark + metric anchors, quality controls, and judge/human overlap).

AD-Bench: A Real-World, Trajectory-Aware Advertising Analytics Benchmark for LLM Agents
Feb 15, 2026 · Citations: 0 · Score: 7.5

HF: Expert Verification · Eval: Simulation Env · Benchmark: Ad Bench · Metric: Pass@1
AJAR: Adaptive Jailbreak Architecture for Red-teaming
Jan 16, 2026 · Citations: 0 · Score: 7.5

HF: Red Team · Eval: Simulation Env · Benchmark: Harmbench · Metric: Success rate
Let's Think in Two Steps: Mitigating Agreement Bias in MLLMs with Self-Grounded Verification
Jul 15, 2025 · Citations: 0 · Score: 6.5

HF: Pairwise Preference · Eval: Automatic Metrics, Simulation Env · Benchmark: VisualWebArena · Metric: Accuracy
Don't Pass@k: A Bayesian Framework for Large Language Model Evaluation
Oct 5, 2025 · Citations: 0 · Score: 6.5

HF: Rubric Rating · Eval: Automatic Metrics, Simulation Env · Benchmark: AIME · Metric: Accuracy
When Users Change Their Mind: Evaluating Interruptible Agents in Long-Horizon Web Navigation
Apr 1, 2026 · Citations: 0 · Score: 6.5

HF: Critique Edit · Eval: Simulation Env · Benchmark: WebArena · Metric: Not Reported
VehicleMemBench: An Executable Benchmark for Multi-User Long-Term Memory in In-Vehicle Agents
Mar 25, 2026 · Citations: 0 · Score: 6.5

HF: Pairwise Preference · Eval: Simulation Env · Benchmark: Vehiclemembench · Metric: Not Reported

Protocol Matrix (Top 12)

Use this to quickly compare protocol ingredients instead of scanning long prose.

Paper	HF Signal	Eval Modes	Benchmarks	Metrics	QC
AD-Bench: A Real-World, Trajectory-Aware Advertising Analytics Benchmark for LLM Agents Feb 15, 2026	Yes Expert Verification	Simulation Env	Ad Bench	Pass@1 , Pass@3	Not Reported
AJAR: Adaptive Jailbreak Architecture for Red-teaming Jan 16, 2026	Yes Red Team	Simulation Env	Harmbench	Success rate , Jailbreak success rate	Not Reported
Let's Think in Two Steps: Mitigating Agreement Bias in MLLMs with Self-Grounded Verification Jul 15, 2025	Yes Pairwise Preference	Automatic Metrics , Simulation Env	VisualWebArena , OSWorld	Accuracy	Not Reported
Don't Pass@k: A Bayesian Framework for Large Language Model Evaluation Oct 5, 2025	Yes Rubric Rating	Automatic Metrics , Simulation Env	AIME	Accuracy , Pass@k	Not Reported
When Users Change Their Mind: Evaluating Interruptible Agents in Long-Horizon Web Navigation Apr 1, 2026	Yes Critique Edit	Simulation Env	WebArena , Interruptbench	Not Reported	Not Reported
VehicleMemBench: An Executable Benchmark for Multi-User Long-Term Memory in In-Vehicle Agents Mar 25, 2026	Yes Pairwise Preference	Simulation Env	Vehiclemembench	Not Reported	Not Reported
PRBench: End-to-end Paper Reproduction in Physics Research Mar 29, 2026	Yes Rubric Rating , Expert Verification	Automatic Metrics , Simulation Env	Not Reported	Accuracy , Success rate	Not Reported
LUDOBENCH: Evaluating LLM Behavioural Decision-Making Through Spot-Based Board Game Scenarios in Ludo Apr 7, 2026	No Not Reported	Simulation Env	Ludobench	Dice	Not Reported
From Pixels to Policies: Reinforcing Spatial Reasoning in Language Models for Content-Aware Layout Design Feb 14, 2026	Yes Critique Edit	Simulation Env	Not Reported	Latency	Not Reported
Hierarchy-of-Groups Policy Optimization for Long-Horizon Agentic Tasks Feb 26, 2026	No Not Reported	Simulation Env	ALFWorld , WebShop	Not Reported	Not Reported
FrameRef: A Framing Dataset and Simulation Testbed for Modeling Bounded Rational Information Health Feb 17, 2026	No Not Reported	Human Eval , Simulation Env	Not Reported	Not Reported	Adjudication
JAWS: Enhancing Long-term Rollout of Neural PDE Solvers via Spatially-Adaptive Jacobian Regularization Mar 4, 2026	No Not Reported	Automatic Metrics , Simulation Env	Not Reported	Accuracy	Not Reported

Protocol Diff (Top Papers)

Fast side-by-side comparison for the highest-ranked papers in this hub.

Signal	AD-Bench: A Real-World, Trajectory-Aware Advertisin…	AJAR: Adaptive Jailbreak Architecture for Red-teami…	Let's Think in Two Steps: Mitigating Agreement Bias…
Human Feedback	Expert Verification	Red Team	Pairwise Preference
Evaluation Modes	Simulation Env	Simulation Env	Automatic Metrics, Simulation Env
Benchmarks	Ad Bench	Harmbench	VisualWebArena, OSWorld
Metrics	Pass@1, Pass@3	Success rate, Jailbreak success rate	Accuracy
Quality Controls	Not reported	Not reported	Not reported
Rater Population	Domain Experts	Unknown	Unknown
Annotation Unit	Trajectory	Unknown	Trajectory

Research Utility Snapshot

Human Feedback Mix

Critique Edit (2)
Expert Verification (2)
Pairwise Preference (2)
Rubric Rating (2)

Evaluation Modes

Simulation Env (25)
Automatic Metrics (8)
Human Eval (1)

Top Benchmarks

Ad Bench (1)
AIME (1)
ALFWorld (1)
Harmbench (1)

Top Metrics

Accuracy (7)
Success rate (3)
Pass@1 (2)
Agreement (1)

Rater Population Mix

Domain Experts (4)

Quality Controls

Adjudication (1)

Coverage diagnostics (sample-based): human-feedback 36.0% · benchmarks 32.0% · metrics 68.0% · quality controls 4.0%.

Top Papers

AD-Bench: A Real-World, Trajectory-Aware Advertising Analytics Benchmark for LLM Agents
Lingxiang Hu, Yiding Sun, Tianle Xia, Wenwei Li, Ming Xu · Feb 15, 2026 · Citations: 0

Expert Verification Simulation Env Long Horizon

While Large Language Model (LLM) agents have achieved remarkable progress in complex reasoning tasks, evaluating their performance in real-world environments has become a critical problem.
Let's Think in Two Steps: Mitigating Agreement Bias in MLLMs with Self-Grounded Verification
Moises Andrade, Joonhyuk Cha, Brandon Ho, Vriksha Srihari, Karmesh Yadav · Jul 15, 2025 · Citations: 0

Pairwise Preference Automatic MetricsSimulation Env Long Horizon

We evaluate MLLM verifiers across web navigation, computer use, and robotics, spanning 13+ models, 28+ designs, and thousands of trajectories from diverse agents.
Don't Pass@k: A Bayesian Framework for Large Language Model Evaluation
Mohsen Hariri, Amirhossein Samandar, Michael Hinczewski, Vipin Chaudhary · Oct 5, 2025 · Citations: 0

Rubric Rating Automatic MetricsSimulation Env

We present a principled Bayesian evaluation framework that replaces Pass@k and average accuracy over N trials (avg@N) with posterior estimates of a model's underlying success probability and credible intervals, yielding stable rankings and…
PRBench: End-to-end Paper Reproduction in Physics Research
Shi Qiu, Junyi Deng, Yiwei Deng, Haoran Dong, Jieyu Fu · Mar 29, 2026 · Citations: 0

Rubric RatingExpert Verification Automatic MetricsSimulation Env

We introduce PRBench, a benchmark of 30 expert-curated tasks spanning 11 subfields of physics.
When Users Change Their Mind: Evaluating Interruptible Agents in Long-Horizon Web Navigation
Henry Peng Zou, Chunyu Miao, Wei-Chieh Huang, Yankai Chen, Yue Zhou · Apr 1, 2026 · Citations: 0

Critique Edit Simulation Env Long Horizon

As LLM agents transition from short, static problem solving to executing complex, long-horizon tasks in dynamic environments, the ability to handle user interruptions, such as adding requirement or revising goals, during mid-task execution…
VehicleMemBench: An Executable Benchmark for Multi-User Long-Term Memory in In-Vehicle Agents
Yuhao Chen, Yi Xu, Xinyun Ding, Xiang Fang, Shuochen Liu · Mar 25, 2026 · Citations: 0

Pairwise Preference Simulation Env Tool Use

With the growing demand for intelligent in-vehicle experiences, vehicle-based agents are evolving from simple assistants to long-term companions.
AJAR: Adaptive Jailbreak Architecture for Red-teaming
Yipu Dou, Wang Yang · Jan 16, 2026 · Citations: 0

Red Team Simulation Env

Large language model (LLM) safety evaluation is moving from content moderation to action security as modern systems gain persistent state, tool access, and autonomous control loops.
From Pixels to Policies: Reinforcing Spatial Reasoning in Language Models for Content-Aware Layout Design
Sha Li, Stefano Petrangeli, Yu Shen, Xiang Chen · Feb 14, 2026 · Citations: 0

Critique Edit Simulation Env

We introduce LaySPA, a reinforcement learning framework that equips large language models (LLMs) with explicit and interpretable spatial reasoning for content-aware graphic layout design.
FrameRef: A Framing Dataset and Simulation Testbed for Modeling Bounded Rational Information Health
Victor De Lima, Jiqun Liu, Grace Hui Yang · Feb 17, 2026 · Citations: 0

Human EvalSimulation Env Long Horizon

Within this framework, we construct framing-sensitive agent personas by fine-tuning language models with framing-conditioned loss attenuation, inducing targeted biases while preserving overall task competence.
LUDOBENCH: Evaluating LLM Behavioural Decision-Making Through Spot-Based Board Game Scenarios in Ludo
Ojas Jain, Dhruv Kumar · Apr 7, 2026 · Citations: 0

Simulation Env Multi Agent

We introduce LudoBench, a benchmark for evaluating LLM strategic reasoning in Ludo, a stochastic multi-agent board game whose dice mechanics, piece capture, safe-square navigation, and home-path progression introduce meaningful planning…
On Discovering Algorithms for Adversarial Imitation Learning
Shashank Reddy Chirra, Jayden Teoh, Praveen Paruchuri, Pradeep Varakantham · Oct 1, 2025 · Citations: 0

Demonstrations Simulation Env

RA functions in AIL are typically derived from divergence minimization objectives, relying heavily on human design and ingenuity.
JAWS: Enhancing Long-term Rollout of Neural PDE Solvers via Spatially-Adaptive Jacobian Regularization
Fengxiang Nie, Yasuhiro Suzuki · Mar 4, 2026 · Citations: 0

Automatic MetricsSimulation Env Long Horizon

Experiments demonstrate that JAWS serves as an effective spectral pre-conditioner for trajectory optimization, allowing short-horizon, memory-efficient training to match the accuracy of long-horizon baselines.
Hierarchy-of-Groups Policy Optimization for Long-Horizon Agentic Tasks
Shuo He, Lang Feng, Qi Wei, Xin Cheng, Lei Feng · Feb 26, 2026 · Citations: 0

Simulation Env Long Horizon

Group-based reinforcement learning (RL), such as GRPO, has advanced the capabilities of large language models on long-horizon agentic tasks.
Self-Correcting VLA: Online Action Refinement via Sparse World Imagination
Chenyv Liu, Wentao Tan, Lei Zhu, Fengling Li, Jingjing Li · Feb 25, 2026 · Citations: 0

Simulation Env Long Horizon

Reinforcement learning enhances physical grounding through exploration yet typically relies on external reward signals that remain isolated from the agent's internal states.
Lookahead Tree-Based Rollouts for Enhanced Trajectory-Level Exploration in Reinforcement Learning with Verifiable Rewards
Shangyu Xing, Siyuan Wang, Chenyuan Yang, Xinyu Dai, Xiang Ren · Oct 28, 2025 · Citations: 0

Simulation Env Long Horizon

To address this limitation, we propose Lookahead Tree-Based Rollouts (LATR), a novel rollout strategy designed to explicitly promotes trajectory-level diversity by enforcing branching into different candidate tokens likely to yield distinct…
Mixed-Initiative Dialog for Human-Robot Collaborative Manipulation
Albert Yu, Chengshu Li, Luca Macesanu, Arnav Balaji, Ruchira Ray · Aug 7, 2025 · Citations: 0

Simulation Env Long Horizon

Effective robotic systems for long-horizon human-robot collaboration must adapt to a wide range of human partners, whose physical behavior, willingness to assist, and understanding of the robot's capabilities may change over time.
"Don't Do That!": Guiding Embodied Systems through Large Language Model-based Constraint Generation
Amin Seffo, Aladin Djuhera, Masataro Asai, Holger Boche · Jun 4, 2025 · Citations: 0

Simulation Env Web Browsing

Recent advancements in large language models (LLMs) have spurred interest in robotic navigation that incorporates complex spatial, mathematical, and conditional constraints from natural language into the planning problem.
Multi-Agent Environments for Vehicle Routing Problems
Ricardo Gama, Ricardo Cunha, Daniel Fuertes, Carlos R. del-Blanco, Hugo L. Fernandes · Nov 21, 2024 · Citations: 0

Simulation Env Multi Agent

Here, we propose MAEnvs4VRP library, a unified framework for multi-agent vehicle routing environments that supports classical, dynamic, stochastic, and multi-task problem variants within a single modular design.
KBVQ-MoE: KLT-guided SVD with Bias-Corrected Vector Quantization for MoE Large Language Models
Zukang Xu, Zhixiong Zhao, Xing Hu, Zhixuan Chen, Dawei Yang · Jan 30, 2026 · Citations: 0

Automatic MetricsSimulation Env

Mixture of Experts (MoE) models have achieved great success by significantly improving performance while maintaining computational efficiency through sparse expert activation.
MALLVI: A Multi-Agent Framework for Integrated Generalized Robotics Manipulation
Iman Ahmadi, Mehrshad Taji, Arad Mahdinezhad Kashani, AmirHossein Jadidi, Saina Kashani · Feb 18, 2026 · Citations: 0

Simulation Env Multi Agent

MALLVI presents a Multi Agent Large Language and Vision framework that enables closed-loop feedback driven robotic manipulation.
OR-Agent: Bridging Evolutionary Search and Structured Research for Automated Algorithm Discovery
Qi Liu, Ruochen Hao, Can Li, Wanjing Ma · Feb 14, 2026 · Citations: 0

Simulation Env Multi Agent

We present OR-Agent, a configurable multi-agent research framework designed for automated exploration in rich experimental environments.
World Simulation with Video Foundation Models for Physical AI
NVIDIA, :, Arslan Ali, Junjie Bai, Maciej Bala · Oct 28, 2025 · Citations: 0

Simulation Env Long Horizon

These capabilities enable more reliable synthetic data generation, policy evaluation, and closed-loop simulation for robotics and autonomous systems.
Counterfactual Simulation Training for Chain-of-Thought Faithfulness
Peter Hase, Christopher Potts · Feb 24, 2026 · Citations: 0

Automatic MetricsSimulation Env

In this paper, we introduce a training method called Counterfactual Simulation Training (CST), which aims to improve CoT faithfulness by rewarding CoTs that enable a simulator to accurately predict a model's outputs over counterfactual…
HumanLLM: Benchmarking and Improving LLM Anthropomorphism via Human Cognitive Patterns
Xintao Wang, Jian Yang, Weiyuan Li, Rui Xie, Jen-tse Huang · Jan 15, 2026 · Citations: 0

Automatic MetricsSimulation Env

We present HumanLLM, a framework treating psychological patterns as interacting causal forces.
Persona-driven Simulation of Voting Behavior in the European Parliament with Large Language Models
Maximilian Kreutner, Marlene Lutz, Markus Strohmaier · Jun 13, 2025 · Citations: 0

Automatic MetricsSimulation Env

We evaluate whether predictions are stable in response to counterfactual arguments, different persona prompts, and generation methods.

Related Hubs

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.

Self-Service

Post a Job

Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.

Create Account & Post a Job

Managed Service

For Large Projects

Done-for-You

We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.

Learn About Managed Service

For Freelancers

Join as an AI Trainer

Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.

Join Now