HFEPX Hub

Simulation Env + Multi Agent Papers

Updated from current HFEPX corpus (Feb 27, 2026). 12 papers are grouped in this hub page. Common evaluation modes: Simulation Env, Llm As Judge. Most common rater population: Domain Experts. Common annotation unit: Freeform. Frequently cited benchmark: Lawbench. Common metric signal: cost. Newest paper in this set is from Feb 24, 2026.

Papers: 12 Last published: Feb 24, 2026 Global RSS Tag RSS

Simulation EnvMulti Agent

Research Narrative

Grounded narrative Model: deterministic-grounded

Updated from current HFEPX corpus (Feb 27, 2026). This page covers 12 papers centered on Simulation Env + Multi Agent Papers. Common evaluation modes include Simulation Env, Llm As Judge, with benchmark emphasis on Lawbench, Visualwebarena. Use the anchored takeaways below to compare protocol choices and identify papers with stronger evidence depth.

Why This Matters For Eval Research

Evaluation emphasis: Simulation Env and Llm As Judge appear frequently in this slice.

Evidence: Cooperative-Competitive Team Play of Real-World Craft Robots , Architecting AgentOS: From Token-Level Context to Emergent System-Level Intelligence , MALLVI: A Multi-Agent Framework for Integrated Generalized Robotics Manipulation
Benchmark concentration: Lawbench, Visualwebarena helps control cross-paper variance.

Evidence: Architecting AgentOS: From Token-Level Context to Emergent System-Level Intelligence , MALLVI: A Multi-Agent Framework for Integrated Generalized Robotics Manipulation , World-Model-Augmented Web Agents with Action Correction
Metric concentration: cost, success rate is repeatedly reported in this group.

Evidence: MALLVI: A Multi-Agent Framework for Integrated Generalized Robotics Manipulation , World-Model-Augmented Web Agents with Action Correction , Colosseum: Auditing Collusion in Cooperative Multi-Agent Systems

Protocol Takeaways

Pair this hub with a human_eval-heavy hub to validate judge-model calibration.

Evidence: World-Model-Augmented Web Agents with Action Correction , Colosseum: Auditing Collusion in Cooperative Multi-Agent Systems , Does Socialization Emerge in AI Agent Society? A Case Study of Moltbook
Stratify by benchmark (Lawbench vs Visualwebarena) before comparing methods.

Evidence: Colosseum: Auditing Collusion in Cooperative Multi-Agent Systems , Does Socialization Emerge in AI Agent Society? A Case Study of Moltbook , OR-Agent: Bridging Evolutionary Search and Structured Research for Automated Algorithm Discovery
Papers with explicit human feedback is visible in approximately 16.7% of papers in this set.

Evidence: Does Socialization Emerge in AI Agent Society? A Case Study of Moltbook , OR-Agent: Bridging Evolutionary Search and Structured Research for Automated Algorithm Discovery , Multimodal Multi-Agent Empowered Legal Judgment Prediction

Benchmark Interpretation

Lawbench appears as a recurring benchmark anchor in this page.
1 papers (8.3%) mention Lawbench.
Most common evaluation modes: Simulation Env.

Metric Interpretation

cost is a common reported metric and should be paired with protocol context before ranking methods.
1 papers (8.3%) mention cost.
Most common evaluation modes: Llm As Judge, Simulation Env.

Researcher Checklist

Papers with explicit human feedback: Coverage is a replication risk (16.7% vs 45% target).
Papers reporting quality controls: Coverage is a replication risk (0% vs 30% target).
Papers naming benchmarks/datasets: Coverage is a replication risk (16.7% vs 35% target).
Papers naming evaluation metrics: Coverage is a replication risk (8.3% vs 35% target).
Papers with known rater population: Coverage is a replication risk (16.7% vs 35% target).
Papers with known annotation unit: Coverage is a replication risk (8.3% vs 35% target).

Papers with explicit human feedback

Coverage is a replication risk (16.7% vs 45% target).

Papers reporting quality controls

Coverage is a replication risk (0% vs 30% target).

Papers naming benchmarks/datasets

Coverage is a replication risk (16.7% vs 35% target).

Papers naming evaluation metrics

Coverage is a replication risk (8.3% vs 35% target).

Papers with known rater population

Coverage is a replication risk (16.7% vs 35% target).

Papers with known annotation unit

Coverage is a replication risk (8.3% vs 35% target).

Known Limitations

Narrative synthesis is grounded in metadata and abstracts only; full-paper method details may be missing.
Extraction fields are conservative and can under-report implicit protocol details.
Cross-page comparisons should control for benchmark and metric mismatch.

Research Utility Links

LLM-as-Judge Protocols - Finds judge-based evaluation setups to compare calibration and drift risks.
Benchmark Slice: Lawbench - Prioritizes benchmark-specific protocol comparisons.
Metric Slice: cost - Finds papers where reported metrics are directly comparable.
Recent High-Signal Papers - Keeps the hub connected to the latest HFEPX corpus updates.

simulation_env vs llm_as_judge

both=2, left_only=10, right_only=0

2 papers use both Simulation Env and Llm As Judge.

Top Papers

Cooperative-Competitive Team Play of Real-World Craft Robots
Rui Zhao, Xihui Li, Yizheng Zhang, Yuzhen Liu, Zhong Zhang · Feb 24, 2026 · Citations: 0

Simulation Env Multi Agent

Multi-agent deep Reinforcement Learning (RL) has made significant progress in developing intelligent game-playing agents in recent years.
Architecting AgentOS: From Token-Level Context to Emergent System-Level Intelligence
ChengYou Li, XiaoDong Liu, XiangBao Meng, XinYu Zhao · Feb 24, 2026 · Citations: 0

Simulation Env Multi Agent

The paradigm of Large Language Models is undergoing a fundamental transition from static inference engines to dynamic autonomous cognitive systems.While current research primarily focuses on scaling context windows or optimizing prompt engi
MALLVI: A Multi-Agent Framework for Integrated Generalized Robotics Manipulation
Iman Ahmadi, Mehrshad Taji, Arad Mahdinezhad Kashani, AmirHossein Jadidi, Saina Kashani · Feb 18, 2026 · Citations: 0

Simulation Env Multi Agent

MALLVI presents a Multi Agent Large Language and Vision framework that enables closed-loop feedback driven robotic manipulation.
World-Model-Augmented Web Agents with Action Correction
Zhouzhou Shen, Xueyu Hu, Xiyun Li, Tianqing Fang, Juncheng Li · Feb 17, 2026 · Citations: 0

Llm As JudgeSimulation Env Multi Agent

Web agents based on large language models have demonstrated promising capability in automating web tasks.
Colosseum: Auditing Collusion in Cooperative Multi-Agent Systems
Mason Nakamura, Abhinav Kumar, Saswat Das, Sahar Abdelnabi, Saaduddin Mahmud · Feb 16, 2026 · Citations: 0

Simulation Env Multi Agent

Multi-agent systems, where LLM agents communicate through free-form language, enable sophisticated coordination for solving complex cooperative tasks.
Does Socialization Emerge in AI Agent Society? A Case Study of Moltbook
Ming Li, Xirui Li, Tianyi Zhou · Feb 15, 2026 · Citations: 0

Simulation Env Multi Agent

As large language model agents increasingly populate networked environments, a fundamental question arises: do artificial intelligence (AI) agent societies undergo convergence dynamics similar to human social systems?
OR-Agent: Bridging Evolutionary Search and Structured Research for Automated Algorithm Discovery
Qi Liu, Ruochen Hao, Can Li, Wanjing Ma · Feb 14, 2026 · Citations: 0

Simulation Env Multi Agent

We present OR-Agent, a configurable multi-agent research framework designed for automated exploration in rich experimental environments.
Multimodal Multi-Agent Empowered Legal Judgment Prediction
Zhaolu Kang, Junhao Gong, Qingxi Chen, Hao Zhang, Jiaxin Liu · Jan 19, 2026 · Citations: 0

Simulation Env Multi Agent

Furthermore, we build JurisMM, a large dataset with over 100,000 recent Chinese judicial records, including both text and multimodal video-text data, enabling comprehensive evaluation.
SPACeR: Self-Play Anchoring with Centralized Reference Models
Wei-Jer Chang, Akshay Rangesh, Kevin Joseph, Matthew Strong, Masayoshi Tomizuka · Oct 20, 2025 · Citations: 0

Demonstrations Simulation Env Multi Agent

Developing autonomous vehicles (AVs) requires not only safety and efficiency, but also realistic, human-like behaviors that are socially aware and predictable.
EpidemIQs: Prompt-to-Paper LLM Agents for Epidemic Modeling and Analysis
Mohammad Hossein Samaei, Faryad Darabi Sahneh, Lee W. Cohnstaedt, Caterina Scoglio · Sep 24, 2025 · Citations: 0

Expert Verification Llm As JudgeSimulation Env Multi Agent

We introduce EpidemIQs, a novel multi-agent LLM framework that integrates user inputs and autonomously conducts literature review, analytical derivation, network modeling, mechanistic modeling, stochastic simulations, data visualization and
Collaborative Document Editing with Multiple Users and AI Agents
Florian Lehmann, Krystsina Shauchenka, Daniel Buschek · Sep 15, 2025 · Citations: 0

Simulation Env Multi Agent

We propose integrating AI agents directly into collaborative writing environments.
Multi-agent deep reinforcement learning with centralized training and decentralized execution for transportation infrastructure management
M. Saifullah, K. G. Papakonstantinou, A. Bhattacharya, S. M. Stoffels, C. P. Andriotis · Jan 23, 2024 · Citations: 0

Simulation Env Multi Agent

To tackle the high dimensionality of state and action spaces, we propose DDMAC-CTDE, a Deep Decentralized Multi-Agent Actor-Critic (DDMAC) reinforcement learning architecture with Centralized Training and Decentralized Execution (CTDE).