HFEPX Hub

Simulation Env Papers (Last 45 Days)

Updated from current HFEPX corpus (Apr 17, 2026). 69 papers are grouped in this hub page.

Read Full Context

Updated from current HFEPX corpus (Apr 17, 2026). 69 papers are grouped in this hub page. Common evaluation modes: Simulation Env, Automatic Metrics. Most common rater population: Domain Experts. Common annotation unit: Trajectory. Frequent quality control: Calibration. Frequently cited benchmark: ALFWorld. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Mar 22, 2026.

Papers: 69 Last published: Mar 22, 2026 Global RSS Tag RSS

Simulation EnvLast 45d

Researcher Quick Triage

This hub is best used for protocol triage and replication planning from abstract-level evidence. Quality band: Medium .

Analysis blocks below are computed from the currently loaded sample (59 of 69 total papers in this hub).

All Sampled Papers (59) Replication-Ready Only (6)

High-Signal Coverage

100.0%

59 / 59 sampled papers are not low-signal flagged.

Replication-Ready Set

Benchmark + metric + eval mode explicitly present.

Judge/Human Comparability

Papers containing both `human_eval` and `llm_as_judge`.

6 papers are replication-ready (benchmark + metric + explicit evaluation mode).
1 papers support judge-vs-human agreement analysis.
2 papers report explicit quality controls (calibration/adjudication/IAA).

Primary action: Start with the top 2 papers in “Start Here”, then validate assumptions in the protocol matrix.

Need evaluators for this research workflow?

Post a Job →

Why This Matters For Eval Research

27.7% of papers report explicit human-feedback signals, led by demonstration data.
simulation environments appears in 68.1% of papers in this hub.
ALFWorld is a recurring benchmark anchor for cross-paper comparisons in this page.

Protocol Takeaways

1 sampled papers report both human evaluation and LLM-as-judge, supporting direct agreement checks.
Most common quality-control signal is rater calibration (1.4% of papers).
Rater context is mostly domain experts, and annotation is commonly trajectory-level annotation; use this to scope replication staffing.

Benchmark Interpretation

ALFWorld appears in 4.3% of hub papers (2/69); use this cohort for benchmark-matched comparisons.
WebArena appears in 4.3% of hub papers (2/69); use this cohort for benchmark-matched comparisons.

Metric Interpretation

accuracy is reported in 25.5% of hub papers (12/69); compare with a secondary metric before ranking methods.
success rate is reported in 6.4% of hub papers (3/69); compare with a secondary metric before ranking methods.

Researcher Checklist (Expanded)

Researcher Checklist

Moderate: Papers with explicit human feedback

Coverage is usable but incomplete (27.7% vs 45% target).
Gap: Papers reporting quality controls

Coverage is a replication risk (2.1% vs 30% target).
Moderate: Papers naming benchmarks/datasets

Coverage is usable but incomplete (27.7% vs 35% target).
Strong: Papers naming evaluation metrics

Coverage is strong (48.9% vs 35% target).
Gap: Papers with known rater population

Coverage is a replication risk (8.5% vs 35% target).
Moderate: Papers with known annotation unit

Coverage is usable but incomplete (23.4% vs 35% target).

Strengths

Contains both human-eval and LLM-as-judge protocols for head-to-head methodology comparison.
Agentic evaluation appears in 70.2% of papers.

Known Gaps

Only 2.1% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (8.5% coverage).
Annotation unit is under-specified (23.4% coverage).

Suggested Next Analyses

Compare papers that report both human_eval and llm_as_judge to quantify judge-human agreement drift.
Stratify by benchmark (ALFWorld vs WebArena) before comparing methods.
Track metric sensitivity by reporting both accuracy and success rate.
Add inter-annotator agreement checks when reproducing these protocols.

Recommended Queries (Expanded)

Recommended Queries

Judge vs Human Agreement Benchmark Slice: ALFWorld Metric Slice: accuracy Recent High-Signal Papers

Start with These 3

Use these when you need one protocol anchor, one benchmark anchor, and one recent comparison point before reading the wider hub.

Strongest protocol reference

AgentHER: Hindsight Experience Replay for LLM Agent Trajectory Relabe…

Highest protocol score with explicit human/eval signal plus WebArena.

Strongest benchmark reference

When Users Change Their Mind: Evaluating Interruptible Agents in Long…

WebArena gives a fast comparison anchor.

Strongest recent paper

VehicleMemBench: An Executable Benchmark for Multi-User Long-Term Mem…

Useful for current practice scanning; published Mar 25, 2026.

Start Here (Best First 6)

Ranked for protocol completeness (human signal, benchmark + metric anchors, quality controls, and judge/human overlap).

AgentHER: Hindsight Experience Replay for LLM Agent Trajectory Relabeling
Mar 22, 2026 · Citations: 0 · Score: 10.0

HF: Demonstrations · Eval: Human Eval, Llm As Judge · Benchmark: WebArena · Metric: Precision
When Users Change Their Mind: Evaluating Interruptible Agents in Long-Horizon Web Navigation
Apr 1, 2026 · Citations: 0 · Score: 6.5

HF: Critique Edit · Eval: Simulation Env · Benchmark: WebArena · Metric: Not Reported
VehicleMemBench: An Executable Benchmark for Multi-User Long-Term Memory in In-Vehicle Agents
Mar 25, 2026 · Citations: 0 · Score: 6.5

HF: Pairwise Preference · Eval: Simulation Env · Benchmark: Vehiclemembench · Metric: Not Reported
Meanings and Measurements: Multi-Agent Probabilistic Grounding for Vision-Language Navigation
Mar 19, 2026 · Citations: 0 · Score: 6.5

HF: Demonstrations · Eval: Simulation Env · Benchmark: Mapg Bench · Metric: Not Reported
PRBench: End-to-end Paper Reproduction in Physics Research
Mar 29, 2026 · Citations: 0 · Score: 6.0

HF: Rubric Rating, Expert Verification · Eval: Automatic Metrics, Simulation Env · Benchmark: Not Reported · Metric: Accuracy
LifeSim: Long-Horizon User Life Simulator for Personalized Assistant Evaluation
Mar 12, 2026 · Citations: 0 · Score: 6.0

HF: Pairwise Preference · Eval: Simulation Env · Benchmark: Lifesim Eval · Metric: Not Reported

Protocol Matrix (Top 12)

Use this to quickly compare protocol ingredients instead of scanning long prose.

Paper	HF Signal	Eval Modes	Benchmarks	Metrics	QC
AgentHER: Hindsight Experience Replay for LLM Agent Trajectory Relabeling Mar 22, 2026	Yes Demonstrations	Human Eval , Llm As Judge	WebArena , ToolBench	Precision , Pass@1	Not Reported
When Users Change Their Mind: Evaluating Interruptible Agents in Long-Horizon Web Navigation Apr 1, 2026	Yes Critique Edit	Simulation Env	WebArena , Interruptbench	Not Reported	Not Reported
VehicleMemBench: An Executable Benchmark for Multi-User Long-Term Memory in In-Vehicle Agents Mar 25, 2026	Yes Pairwise Preference	Simulation Env	Vehiclemembench	Not Reported	Not Reported
Meanings and Measurements: Multi-Agent Probabilistic Grounding for Vision-Language Navigation Mar 19, 2026	Yes Demonstrations	Simulation Env	Mapg Bench	Not Reported	Not Reported
PRBench: End-to-end Paper Reproduction in Physics Research Mar 29, 2026	Yes Rubric Rating , Expert Verification	Automatic Metrics , Simulation Env	Not Reported	Accuracy , Success rate	Not Reported
LifeSim: Long-Horizon User Life Simulator for Personalized Assistant Evaluation Mar 12, 2026	Yes Pairwise Preference	Simulation Env	Lifesim Eval	Not Reported	Not Reported
ReDAct: Uncertainty-Aware Deferral for LLM Agents Apr 8, 2026	No Not Reported	Simulation Env	ALFWorld	Token cost	Not Reported
Dual-Modality Multi-Stage Adversarial Safety Training: Robustifying Multimodal Web Agents Against Cross-Modal Attacks Mar 4, 2026	Yes Demonstrations	Simulation Env	MiniWoB++	Not Reported	Not Reported
Do No Harm: Exposing Hidden Vulnerabilities of LLMs via Persona-based Client Simulation Attack in Psychological Counseling Apr 6, 2026	Yes Red Team	Simulation Env	Not Reported	Perplexity	Not Reported
LUDOBENCH: Evaluating LLM Behavioural Decision-Making Through Spot-Based Board Game Scenarios in Ludo Apr 7, 2026	No Not Reported	Simulation Env	Ludobench	Dice	Not Reported
Red-Teaming Vision-Language-Action Models via Quality Diversity Prompt Generation for Robust Robot Policies Mar 12, 2026	Yes Red Team	Simulation Env	Not Reported	Task success	Not Reported
DeceptGuard :A Constitutional Oversight Framework For Detecting Deception in LLM Agents Mar 14, 2026	No Not Reported	Automatic Metrics , Simulation Env	Deceptarena	Faithfulness	Not Reported

Protocol Diff (Top Papers)

Fast side-by-side comparison for the highest-ranked papers in this hub.

Signal	AgentHER: Hindsight Experience Replay for LLM Agent…	When Users Change Their Mind: Evaluating Interrupti…	VehicleMemBench: An Executable Benchmark for Multi-…
Human Feedback	Demonstrations	Critique Edit	Pairwise Preference
Evaluation Modes	Human Eval, Llm As Judge	Simulation Env	Simulation Env
Benchmarks	WebArena, ToolBench	WebArena, Interruptbench	Vehiclemembench
Metrics	Precision, Pass@1	Not reported	Not reported
Quality Controls	Not reported	Not reported	Not reported
Rater Population	Unknown	Unknown	Unknown
Annotation Unit	Trajectory	Unknown	Unknown

Research Utility Snapshot

Human Feedback Mix

Demonstrations (4)
Pairwise Preference (4)
Rubric Rating (3)
Red Team (2)

Evaluation Modes

Simulation Env (47)
Automatic Metrics (14)
Llm As Judge (5)
Human Eval (2)

Top Benchmarks

ALFWorld (2)
WebArena (2)
BIRD (1)
Deceptarena (1)

Top Metrics

Accuracy (12)
Success rate (3)
Cost (2)
Dice (1)

Rater Population Mix

Domain Experts (3)
Mixed (1)

Quality Controls

Calibration (1)

Coverage diagnostics (sample-based): human-feedback 23.7% · benchmarks 22.0% · metrics 39.0% · quality controls 3.4%.

Top Papers

AgentHER: Hindsight Experience Replay for LLM Agent Trajectory Relabeling
Liang Ding · Mar 22, 2026 · Citations: 0

Demonstrations Human EvalLlm As Judge Long Horizon

LLM agents fail on the majority of real-world tasks -- GPT-4o succeeds on fewer than 15% of WebArena navigation tasks and below 55% pass@1 on ToolBench (Zhou et al., 2024; Qin et al., 2024) -- yet every failed trajectory is routinely…
PRBench: End-to-end Paper Reproduction in Physics Research
Shi Qiu, Junyi Deng, Yiwei Deng, Haoran Dong, Jieyu Fu · Mar 29, 2026 · Citations: 0

Rubric RatingExpert Verification Automatic MetricsSimulation Env

We introduce PRBench, a benchmark of 30 expert-curated tasks spanning 11 subfields of physics.
When Users Change Their Mind: Evaluating Interruptible Agents in Long-Horizon Web Navigation
Henry Peng Zou, Chunyu Miao, Wei-Chieh Huang, Yankai Chen, Yue Zhou · Apr 1, 2026 · Citations: 0

Critique Edit Simulation Env Long Horizon

As LLM agents transition from short, static problem solving to executing complex, long-horizon tasks in dynamic environments, the ability to handle user interruptions, such as adding requirement or revising goals, during mid-task execution…
VehicleMemBench: An Executable Benchmark for Multi-User Long-Term Memory in In-Vehicle Agents
Yuhao Chen, Yi Xu, Xinyun Ding, Xiang Fang, Shuochen Liu · Mar 25, 2026 · Citations: 0

Pairwise Preference Simulation Env Tool Use

With the growing demand for intelligent in-vehicle experiences, vehicle-based agents are evolving from simple assistants to long-term companions.
Meanings and Measurements: Multi-Agent Probabilistic Grounding for Vision-Language Navigation
Swagat Padhan, Lakshya Jain, Bhavya Minesh Shah, Omkar Patil, Thao Nguyen · Mar 19, 2026 · Citations: 0

Demonstrations Simulation Env Multi Agent

To address this limitation, we propose MAPG (Multi-Agent Probabilistic Grounding), an agentic framework that decomposes language queries into structured subcomponents and queries a VLM to ground each component.
LifeSim: Long-Horizon User Life Simulator for Personalized Assistant Evaluation
Feiyu Duan, Xuanjing Huang, Zhongyu Wei · Mar 12, 2026 · Citations: 0

Pairwise Preference Simulation Env Long Horizon

However, existing benchmarks for personalized assistants remain misaligned with real-world user-assistant interactions, failing to capture the complexity of external contexts and users' cognitive states.
SOLE-R1: Video-Language Reasoning as the Sole Reward for On-Robot Reinforcement Learning
Philip Schroeder, Thomas Weng, Karl Schmeckpeper, Eric Rosen, Stephen Hart · Mar 30, 2026 · Citations: 0

Demonstrations Simulation Env Long Horizon

To address this limitation, we introduce SOLE-R1 (Self-Observing LEarner), a video-language reasoning model explicitly designed to serve as the sole reward signal for online RL.
I Can't Believe It's Corrupt: Evaluating Corruption in Multi-Agent Governance Systems
Vedanta S P, Ponnurangam Kumaraguru · Mar 19, 2026 · Citations: 0

Rubric Rating Simulation Env Multi Agent

Large language models are increasingly proposed as autonomous agents for high-stakes public workflows, yet we lack systematic evidence about whether they would follow institutional rules when granted authority.
Build, Judge, Optimize: A Blueprint for Continuous Improvement of Multi-Agent Consumer Assistants
Alejandro Breen Herrera, Aayush Sheth, Steven G. Xu, Zhucheng Zhan, Charles Wright · Mar 3, 2026 · Citations: 0

Pairwise PreferenceRubric Rating Llm As JudgeSimulation Env Long Horizon

Conversational shopping assistants (CSAs) represent a compelling application of agentic AI, but moving from prototype to production reveals two underexplored challenges: how to evaluate multi-turn interactions and how to optimize tightly…
ReDAct: Uncertainty-Aware Deferral for LLM Agents
Dzianis Piatrashyn, Nikita Kotelevskii, Kirill Grishchenkov, Nikita Glazkov, Ivan Nasonov · Apr 8, 2026 · Citations: 0

Simulation Env Long Horizon

Recently, LLM-based agents have become increasingly popular across many applications, including complex sequential decision-making problems.
From Control to Foresight: Simulation as a New Paradigm for Human-Agent Collaboration
Gaole He, Brian Y. Lim · Mar 12, 2026 · Citations: 0

Pairwise Preference Simulation Env Long Horizon

Large Language Models (LLMs) are increasingly used to power autonomous agents for complex, multi-step tasks.
Dual-Modality Multi-Stage Adversarial Safety Training: Robustifying Multimodal Web Agents Against Cross-Modal Attacks
Haoyu Liu, Dingcheng Li, Lukas Rutishauser, Zeyu Zheng · Mar 4, 2026 · Citations: 0

Demonstrations Simulation Env

Multimodal web agents that process both screenshots and accessibility trees are increasingly deployed to interact with web interfaces, yet their dual-stream architecture opens an underexplored attack surface: an adversary who injects…
Do No Harm: Exposing Hidden Vulnerabilities of LLMs via Persona-based Client Simulation Attack in Psychological Counseling
Qingyang Xu, Yaling Shen, Stephanie Fong, Zimu Wang, Yiwen Jiang · Apr 6, 2026 · Citations: 0

Red Team Simulation Env

The increasing use of large language models (LLMs) in mental healthcare raises safety concerns in high-stakes therapeutic interactions.
Red-Teaming Vision-Language-Action Models via Quality Diversity Prompt Generation for Robust Robot Policies
Siddharth Srikanth, Freddie Liang, Ya-Chuan Hsu, Varun Bhatt, Shihan Zhao · Mar 12, 2026 · Citations: 0

Red Team Simulation Env

Our results across multiple simulation benchmarks show that Q-DIG finds more diverse and meaningful failure modes compared to baseline methods, and that fine-tuning VLAs on the generated instructions improves task success rates.
DeceptGuard :A Constitutional Oversight Framework For Detecting Deception in LLM Agents
Snehasis Mukhopadhyay · Mar 14, 2026 · Citations: 0

Automatic MetricsSimulation Env Long Horizon

We introduce DECEPTGUARD, a unified framework that systematically compares three monitoring regimes: black-box monitors (actions and outputs only), CoT-aware monitors (additionally observing the agent's chain-of-thought reasoning trace),…
BEACON: Language-Conditioned Navigation Affordance Prediction under Occlusion
Xinyu Gao, Gang Chen, Javier Alonso-Mora · Mar 10, 2026 · Citations: 0

Automatic MetricsSimulation Env Web Browsing

As a result, they struggle to infer target locations in occluded regions, typically caused by furniture or moving humans.
LUDOBENCH: Evaluating LLM Behavioural Decision-Making Through Spot-Based Board Game Scenarios in Ludo
Ojas Jain, Dhruv Kumar · Apr 7, 2026 · Citations: 0

Simulation Env Multi Agent

We introduce LudoBench, a benchmark for evaluating LLM strategic reasoning in Ludo, a stochastic multi-agent board game whose dice mechanics, piece capture, safe-square navigation, and home-path progression introduce meaningful planning…
MAPO: Mixed Advantage Policy Optimization for Long-Horizon Multi-Turn Dialogue
Naifan Zhang, Ruihan Sun, Jinwei Su, Hengjie Yang, Zhengyuan Pan · Mar 6, 2026 · Citations: 0

Llm As JudgeSimulation Env Long Horizon

We propose a critic-free and efficient RL algorithm named MAPO that leverages dense process feedback from a judge model and propagates long-horizon effects through Monte Carlo returns.
Social Dynamics as Critical Vulnerabilities that Undermine Objective Decision-Making in LLM Collectives
Changgeon Ko, Jisu Shin, Hoyun Song, Huije Lee, Eui Jun Hwang · Apr 7, 2026 · Citations: 0

Automatic MetricsSimulation Env Multi Agent

Large language model (LLM) agents are increasingly acting as human delegates in multi-agent environments, where a representative agent integrates diverse peer perspectives to make a final decision.
JAWS: Enhancing Long-term Rollout of Neural PDE Solvers via Spatially-Adaptive Jacobian Regularization
Fengxiang Nie, Yasuhiro Suzuki · Mar 4, 2026 · Citations: 0

Automatic MetricsSimulation Env Long Horizon

Experiments demonstrate that JAWS serves as an effective spectral pre-conditioner for trajectory optimization, allowing short-horizon, memory-efficient training to match the accuracy of long-horizon baselines.
Spatio-Temporal Attention Enhanced Multi-Agent DRL for UAV-Assisted Wireless Networks with Limited Communications
Che Chen, Lanhua Li, Shimin Gong, Yu Zhao, Yuming Fang · Mar 23, 2026 · Citations: 0

Simulation Env Long Horizon

To maximize the overall throughput, we first propose a delay-tolerant multi-agent deep reinforcement learning (MADRL) algorithm that integrates a delay-penalized reward to encourage information sharing among UAVs, while jointly optimizing…
SEAL: An Open, Auditable, and Fair Data Generation Framework for AI-Native 6G Networks
Sunder Ali Khowaja, Kapal Dev, Engin Zeydan, Madhusanka Liyanage · Apr 2, 2026 · Citations: 0

Automatic MetricsSimulation Env

In this regard, we propose the Synthetic Data Generation with Ethics Audit Loop (SEAL) framework, which extends baseline modular pipelines with an Ethical and Regulatory Compliance by Design (ERCD) module and a Federated Learning (FL)…
OccuBench: Evaluating AI Agents on Real-World Professional Tasks via Language Environment Simulation
Xiaomeng Hu, Yinger Zhang, Fei Huang, Jianhong Tu, Yang Su · Apr 13, 2026 · Citations: 0

Simulation Env Multi Agent

We introduce OccuBench, a benchmark covering 100 real-world professional task scenarios across 10 industry categories and 65 specialized domains, enabled by Language Environment Simulators (LESs) that simulate domain-specific environments…
ActionParty: Multi-Subject Action Binding in Generative Video Games
Alexander Pondaven, Ziyi Wu, Igor Gilitschenski, Philip Torr, Sergey Tulyakov · Apr 2, 2026 · Citations: 0

Automatic MetricsSimulation Env Multi Agent

However, these models are largely restricted to single-agent settings, failing to control multiple agents simultaneously in a scene.
MolQuest: A Benchmark for Agentic Evaluation of Abductive Reasoning in Chemical Structure Elucidation
Taolin Han, Shuang Wu, Jinghang Wang, Yuhao Zhou, Renquan Lv · Mar 26, 2026 · Citations: 0

Automatic MetricsSimulation Env Long Horizon

Current scientific evaluation benchmarks predominantly rely on static, single-turn Question Answering (QA) formats, which are inadequate for measuring model performance in complex scientific tasks that require multi-step iteration and…
Mind over Space: Can Multimodal Large Language Models Mentally Navigate?
Qihui Zhu, Shouwei Ruan, Xiao Yang, Hao Jiang, Yao Huang · Mar 23, 2026 · Citations: 0

Automatic MetricsSimulation Env Web Browsing

Despite the widespread adoption of MLLMs in embodied agents, their capabilities remain largely confined to reactive planning from immediate observations, consistently failing in spatial reasoning across extensive spatiotemporal scales.
Integrating Deep RL and Bayesian Inference for ObjectNav in Mobile Robotics
João Castelo-Branco, José Santos-Victor, Alexandre Bernardino · Mar 26, 2026 · Citations: 0

Simulation Env Web Browsing

Autonomous object search is challenging for mobile robots operating in indoor environments due to partial observability, perceptual uncertainty, and the need to trade off exploration and navigation efficiency.
Reward Prediction with Factorized World States
Yijun Shen, Delong Chen, Xianming Hu, Jiaming Mi, Hongbo Zhao · Mar 10, 2026 · Citations: 0

Llm As JudgeSimulation Env

We evaluate on RewardPrediction, a new benchmark dataset spanning five diverse domains and comprising 2,454 unique action-observation trajectories with step-wise ground-truth rewards.
Contextualized Privacy Defense for LLM Agents
Yule Wen, Yanzhe Zhang, Jianxun Lian, Xiaoyuan Yi, Xing Xie · Mar 3, 2026 · Citations: 0

Simulation Env Long Horizon

LLM agents increasingly act on users' personal information, yet existing privacy defenses remain limited in both design and adaptability.
CCD-CBT: Multi-Agent Therapeutic Interaction for CBT Guided by Cognitive Conceptualization Diagram
Chang Liu, Changsheng Ma, Yongfeng Tao, Bin Hu, Minqiang Yang · Apr 8, 2026 · Citations: 0

Simulation Env Multi Agent

However, existing methods often rely on static cognitive profiles and omniscient single-agent simulation, failing to capture the dynamic, information-asymmetric nature of real therapy.
RADIUS: Ranking, Distribution, and Significance - A Comprehensive Alignment Suite for Survey Simulation
Weronika Łajewska, Paul Missault, George Davidson, Saab Mansour · Mar 19, 2026 · Citations: 0

Automatic MetricsSimulation Env

Simulation of surveys using LLMs is emerging as a powerful application for generating human-like responses at scale.
Towards Real-world Human Behavior Simulation: Benchmarking Large Language Models on Long-horizon, Cross-scenario, Heterogeneous Behavior Traces
Jiawei Chen, Ruoxi Xu, Boxi Cao, Ruotong Pan, Yunfei Zhang · Apr 9, 2026 · Citations: 0

Simulation Env Long Horizon

However, existing benchmarks remain constrained to isolated scenarios, narrow action spaces, or synthetic data, failing to capture the holistic nature of authentic human behavior.
Strategic Persuasion with Trait-Conditioned Multi-Agent Systems for Iterative Legal Argumentation
Philipp D. Siedler · Apr 8, 2026 · Citations: 0

Simulation Env Multi Agent

We present the Strategic Courtroom Framework, a multi-agent simulation environment in which prosecution and defense teams composed of trait-conditioned Large Language Model (LLM) agents engage in iterative, round-based legal argumentation.
From High-Dimensional Spaces to Verifiable ODD Coverage for Safety-Critical AI-based Systems
Thomas Stefani, Johann Maximilian Christensen, Elena Hoemann, Frank Köster, Sven Hallerbach · Apr 2, 2026 · Citations: 0

Simulation Env Long Horizon

While Artificial Intelligence (AI) offers transformative potential for operational performance, its deployment in safety-critical domains such as aviation requires strict adherence to rigorous certification standards.
Heterogeneous Debate Engine: Identity-Grounded Cognitive Architecture for Resilient LLM-Based Ethical Tutoring
Jakub Masłowski, Jarosław A. Chudziak · Mar 28, 2026 · Citations: 0

Simulation Env Multi Agent

Large Language Models (LLMs) are being increasingly used as autonomous agents in complex reasoning tasks, opening the niche for dialectical interactions.
GameplayQA: A Benchmarking Framework for Decision-Dense POV-Synced Multi-Video Understanding of 3D Virtual Agents
Yunzhe Wang, Runhui Xu, Kexin Zheng, Tianyi Zhang, Jayavibhav Niranjan Kogundi · Mar 25, 2026 · Citations: 0

Simulation Env Multi Agent

Multimodal LLMs are increasingly deployed as perceptual backbones for autonomous agents in 3D environments, from robotics to virtual worlds.
GRACE: A Unified 2D Multi-Robot Path Planning Simulator & Benchmark for Grid, Roadmap, And Continuous Environments
Chuanlong Zang, Anna Mannucci, Isabelle Barz, Philipp Schillinger, Florian Lier · Mar 11, 2026 · Citations: 0

Simulation Env Multi Agent

Advancing Multi-Agent Pathfinding (MAPF) and Multi-Robot Motion Planning (MRMP) requires platforms that enable transparent, reproducible comparisons across modeling choices.
Influencing LLM Multi-Agent Dialogue via Policy-Parameterized Prompts
Hongbo Bo, Jingyu Hu, Weiru Liu · Mar 10, 2026 · Citations: 0

Simulation Env Multi Agent

Large Language Models (LLMs) have emerged as a new paradigm for multi-agent systems.
Alignment Backfire: Language-Dependent Reversal of Safety Interventions Across 16 Languages in LLM Multi-Agent Systems
Hiroki Fukui · Mar 5, 2026 · Citations: 0

Simulation Env Multi Agent

We report four preregistered studies (1,584 multi-agent simulations across 16 languages and three model families) demonstrating that alignment interventions in large language models produce a structurally analogous phenomenon: surface…
HACHIMI: Scalable and Controllable Student Persona Generation via Orchestrated Agents
Yilin Jiang, Fei Tan, Xuanyu Yin, Jing Leng, Aimin Zhou · Mar 5, 2026 · Citations: 0

Simulation Env Multi Agent

We formalize this as Theory-Aligned and Distribution-Controllable Persona Generation (TAD-PG) and introduce HACHIMI, a multi-agent Propose-Validate-Revise framework that generates theory-aligned, quota-controlled personas.
Learning to Play Blackjack: A Curriculum Learning Perspective
Amirreza Alasti, Efe Erdal, Yücel Celik, Theresa Eimer · Mar 31, 2026 · Citations: 0

Automatic MetricsSimulation Env

We propose a novel framework that uses a Large Language Model (LLM) to dynamically generate a curriculum over available actions, enabling the agent to incorporate each action individually.
Does Explanation Correctness Matter? Linking Computational XAI Evaluation to Human Understanding
Gregor Baer, Chao Zhang, Isel Grau, Pieter Van Gorp · Mar 26, 2026 · Citations: 0

Automatic MetricsSimulation Env

Higher correctness is assumed to produce better human understanding, but this link has not been tested experimentally with controlled levels.
LED: A Benchmark for Evaluating Layout Error Detection in Document Analysis
Inbum Heo, Taewook Hwang, Jeesu Jung, Sangkeun Jung · Mar 18, 2026 · Citations: 0

Automatic MetricsSimulation Env

To overcome this limitation, we propose Layout Error Detection (LED), a benchmark that evaluates structural reasoning in DLA predictions beyond surface-level accuracy.
Belief-Sim: Towards Belief-Driven Simulation of Demographic Misinformation Susceptibility
Angana Borah, Zohaib Khan, Rada Mihalcea, Verónica Pérez-Rosas · Mar 3, 2026 · Citations: 0

Automatic MetricsSimulation Env

As Large Language Models (LLMs) are increasingly used to simulate human behaviors, we investigate whether they can simulate demographic misinformation susceptibility, treating beliefs as a primary driving factor.
Box Maze: A Process-Control Architecture for Reliable LLM Reasoning
Zou Qiang · Mar 19, 2026 · Citations: 0

Simulation Env

Existing safety approaches -- such as reinforcement learning from human feedback (RLHF) and output filtering -- primarily operate at the behavioral level and may lack explicit architectural mechanisms for enforcing reasoning process…
Sell More, Play Less: Benchmarking LLM Realistic Selling Skill
Xuanbo Su, Wenhao Hu, Haibo Su, Yunzhang Chen, Le Zhan · Apr 8, 2026 · Citations: 0

Human EvalSimulation Env

We introduce SalesLLM benchmark, a bilingual (ZH/EN) benchmark derived from realistic applications covering Financial Services and Consumer Goods, built from 30,074 scripted configurations and 1,805 curated multi-turn scenarios with…
Eval4Sim: An Evaluation Framework for Persona Simulation
Eliseo Bao, Anxo Perez, Xi Wang, Javier Parapar · Mar 3, 2026 · Citations: 0

Llm As JudgeSimulation Env

Large Language Model (LLM) personas with explicit specifications of attributes, background, and behavioural tendencies are increasingly used to simulate human conversations for tasks such as user modeling, social reasoning, and behavioural…
Beyond Prompt: Fine-grained Simulation of Cognitively Impaired Standardized Patients via Stochastic Steering
Weikang Zhang, Zimo Zhu, Zhichuan Yang, Chen Huang, Wenqiang Lei · Apr 14, 2026 · Citations: 0

Simulation Env

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Arch: An AI-Native Hardware Description Language for Register-Transfer Clocked Hardware Design
Shuqing Zhao · Apr 7, 2026 · Citations: 0

Simulation Env

We present case studies of an 8-way set-associative L1 data cache and a synthesizable PG021-compatible AXI DMA controller (with Yosys and OpenSTA results on Sky130), and compare Arch to SystemVerilog, VHDL, Chisel, Bluespec, and other…
Bridging Natural Language and Microgrid Dynamics: A Context-Aware Simulator and Dataset
Tinko Sebastian Bartels, Ruixiang Wu, Xinyu Lu, Yikai Lu, Fanzeng Xia · Apr 7, 2026 · Citations: 0

Simulation Env

Traditional energy management relies heavily on numerical time series, thereby neglecting the significant predictive power embedded in human-generated context (e.g., event schedules, system logs, user intentions).
InfBaGel: Human-Object-Scene Interaction Generation with Dynamic Perception and Iterative Refinement
Yude Zou, Junji Gong, Xing Gao, Zixuan Li, Tianxing Chen · Apr 6, 2026 · Citations: 0

Simulation Env

Human-object-scene interactions (HOSI) generation has broad applications in embodied AI, simulation, and animation.
A Quantum Search Approach to Magic Square Constraint Problems with Classical Benchmarking
Rituparna R, Harsha Varthini, Aswani Kumar Cherukuri · Apr 6, 2026 · Citations: 0

Simulation Env

Rather than integrating classical and quantum solvers in an iterative loop, this work uses the classical component for structured initialisation and the quantum component for search, and benchmarks the quantum approach against classical…
Adversarial Camouflage
Paweł Borsukiewicz, Daniele Lunghi, Melissa Tessa, Jacques Klein, Tegawendé F. Bissyandé · Mar 23, 2026 · Citations: 0

Simulation Env

Optimized patterns, once found, are projected onto semantically valid facial regions for evaluation.
Sim-to-Real of Humanoid Locomotion Policies via Joint Torque Space Perturbation Injection
Junhyeok Rui Cha, Woohyun Cha, Jaeyong Shin, Donghyeon Kim, Jaeheung Park · Mar 23, 2026 · Citations: 0

Simulation Env

Experimental results demonstrate that the proposed approach enables humanoid locomotion policies to achieve superior robustness against complex, unseen reality gaps in both simulation and real-world deployment.
On the Number of Conditional Independence Tests in Constraint-based Causal Discovery
Marc Franquesa Monés, Jiaqi Zhang, Caroline Uhler · Mar 23, 2026 · Citations: 0

Simulation Env

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
What Do World Models Learn in RL? Probing Latent Representations in Learned Environment Simulators
Xinyu Zhang · Mar 23, 2026 · Citations: 0

Simulation Env

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Evaluating Game Difficulty in Tetris Block Puzzle
Chun-Jui Wang, Jian-Ting Guo, Hung Guei, Chung-Chin Shih, Ti-Rong Wu · Mar 19, 2026 · Citations: 0

Simulation Env

Inspired by prior work that uses AlphaZero as a strong evaluator for chess variants, we study difficulty in this domain using Stochastic Gumbel AlphaZero (SGAZ), a budget-aware planning agent for stochastic environments.
Unmasking Algorithmic Bias in Predictive Policing: A GAN-Based Simulation Framework with Multi-City Temporal Analysis
Pronob Kumar Barman, Pronoy Kumar Barman · Mar 19, 2026 · Citations: 0

Simulation Env

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Improving moment tensor solutions under Earth structure uncertainty with simulation-based inference
A. A. Saoulis, T. -S. Pham, A. M. G. Ferreira · Mar 19, 2026 · Citations: 0

Simulation Env

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.

Related Hubs

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.

Self-Service

Post a Job

Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.

Create Account & Post a Job

Managed Service

For Large Projects

Done-for-You

We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.

Learn About Managed Service

For Freelancers

Join as an AI Trainer

Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.

Join Now