HFEPX Hub

General + Simulation Env Papers

Updated from current HFEPX corpus (Feb 27, 2026). 67 papers are grouped in this hub page. Common evaluation modes: Simulation Env, Automatic Metrics. Most common rater population: Domain Experts. Common annotation unit: Trajectory. Frequent quality control: Inter Annotator Agreement Reported. Frequently cited benchmark: Retrieval. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Feb 26, 2026.

Papers: 67 Last published: Feb 26, 2026 Global RSS Tag RSS

GeneralSimulation Env

Research Narrative

Grounded narrative Model: deterministic-grounded Source: persisted

Updated from current HFEPX corpus (Feb 27, 2026). This page tracks 67 papers for General + Simulation Env Papers. Dominant protocol signals include simulation environments, automatic metrics, LLM-as-judge, with frequent benchmark focus on Retrieval, APPS and metric focus on accuracy, cost. Use the grounded sections below to prioritize reproducible protocol choices, benchmark-matched comparisons, and judge-vs-human evaluation checks.

Why This Matters For Eval Research

14.9% of papers report explicit human-feedback signals, led by pairwise preferences.

Evidence: CiteLLM: An Agentic Platform for Trustworthy Scientific Reference Discovery , Towards Simulating Social Media Users with LLMs: Evaluating the Operational Validity of Conditioned Comment Prediction , Retrieval Challenges in Low-Resource Public Service Information: A Case Study on Food Pantry Access , ARLArena: A Unified Framework for Stable Agentic Reinforcement Learning
simulation environments appears in 100% of papers in this hub.

Evidence: CiteLLM: An Agentic Platform for Trustworthy Scientific Reference Discovery , Towards Simulating Social Media Users with LLMs: Evaluating the Operational Validity of Conditioned Comment Prediction , Retrieval Challenges in Low-Resource Public Service Information: A Case Study on Food Pantry Access , ARLArena: A Unified Framework for Stable Agentic Reinforcement Learning
Retrieval is a recurring benchmark anchor for cross-paper comparisons in this page.

Evidence: Retrieval Challenges in Low-Resource Public Service Information: A Case Study on Food Pantry Access , InterviewSim: A Scalable Framework for Interview-Grounded Personality Simulation , CiteLLM: An Agentic Platform for Trustworthy Scientific Reference Discovery , Towards Simulating Social Media Users with LLMs: Evaluating the Operational Validity of Conditioned Comment Prediction

Protocol Takeaways

Most common quality-control signal is inter-annotator agreement reporting (1.5% of papers).

Evidence: PreScience: A Benchmark for Forecasting Scientific Contributions , CiteLLM: An Agentic Platform for Trustworthy Scientific Reference Discovery , Towards Simulating Social Media Users with LLMs: Evaluating the Operational Validity of Conditioned Comment Prediction , Retrieval Challenges in Low-Resource Public Service Information: A Case Study on Food Pantry Access
Rater context is mostly domain experts, and annotation is commonly trajectory-level annotation; use this to scope replication staffing.

Evidence: CiteLLM: An Agentic Platform for Trustworthy Scientific Reference Discovery , Towards Simulating Social Media Users with LLMs: Evaluating the Operational Validity of Conditioned Comment Prediction , Retrieval Challenges in Low-Resource Public Service Information: A Case Study on Food Pantry Access , ARLArena: A Unified Framework for Stable Agentic Reinforcement Learning
Compare papers that report both human_eval and llm_as_judge to quantify judge-human agreement drift.

Evidence: PreScience: A Benchmark for Forecasting Scientific Contributions , CiteLLM: An Agentic Platform for Trustworthy Scientific Reference Discovery , Towards Simulating Social Media Users with LLMs: Evaluating the Operational Validity of Conditioned Comment Prediction , Retrieval Challenges in Low-Resource Public Service Information: A Case Study on Food Pantry Access

Benchmark Interpretation

Retrieval appears in 9% of hub papers (6/67); use this cohort for benchmark-matched comparisons.
APPS appears in 1.5% of hub papers (1/67); use this cohort for benchmark-matched comparisons.

Metric Interpretation

accuracy is reported in 11.9% of hub papers (8/67); compare with a secondary metric before ranking methods.
cost is reported in 10.4% of hub papers (7/67); compare with a secondary metric before ranking methods.

Researcher Checklist

Close gap on Papers with explicit human feedback. Coverage is a replication risk (14.9% vs 45% target).
Close gap on Papers reporting quality controls. Coverage is a replication risk (1.5% vs 30% target).
Tighten coverage on Papers naming benchmarks/datasets. Coverage is usable but incomplete (26.9% vs 35% target).
Maintain strength on Papers naming evaluation metrics. Coverage is strong (41.8% vs 35% target).
Close gap on Papers with known rater population. Coverage is a replication risk (6% vs 35% target).
Close gap on Papers with known annotation unit. Coverage is a replication risk (13.4% vs 35% target).

Papers with explicit human feedback

Coverage is a replication risk (14.9% vs 45% target).

Papers reporting quality controls

Coverage is a replication risk (1.5% vs 30% target).

Papers naming benchmarks/datasets

Coverage is usable but incomplete (26.9% vs 35% target).

Papers naming evaluation metrics

Coverage is strong (41.8% vs 35% target).

Papers with known rater population

Coverage is a replication risk (6% vs 35% target).

Papers with known annotation unit

Coverage is a replication risk (13.4% vs 35% target).

Known Limitations

Only 1.5% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (6% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Links

Judge vs Human Agreement - Compares papers that evaluate with both human raters and LLM judges.
Benchmark Slice: Retrieval - Prioritizes benchmark-specific protocol comparisons.
Metric Slice: accuracy - Finds papers where reported metrics are directly comparable.
IAA-Reported Evaluations - Highlights evaluations that explicitly report inter-annotator agreement.
Recent High-Signal Papers - Keeps the hub connected to the latest HFEPX corpus updates.

human_eval vs llm_as_judge

both=0, left_only=1, right_only=2

0 papers use both Human Eval and Llm As Judge.

human_eval vs automatic_metrics

both=0, left_only=1, right_only=9

0 papers use both Human Eval and Automatic Metrics.

llm_as_judge vs automatic_metrics

both=0, left_only=2, right_only=9

0 papers use both Llm As Judge and Automatic Metrics.

Benchmark Brief

Retrieval

Coverage: 6 papers (9%)

6 papers (9%) mention Retrieval.

Examples: Retrieval Challenges in Low-Resource Public Service Information: A Case Study on Food Pantry Access , InterviewSim: A Scalable Framework for Interview-Grounded Personality Simulation , Multimodal Peer Review Simulation with Actionable To-Do Recommendations for Community-Aware Manuscript Revisions

Benchmark Brief

APPS

Coverage: 1 papers (1.5%)

1 papers (1.5%) mention APPS.

Examples: The Tool Decathlon: Benchmarking Language Agents for Diverse, Realistic, and Long-Horizon Task Execution

Benchmark Brief

Arlarena

Coverage: 1 papers (1.5%)

1 papers (1.5%) mention Arlarena.

Examples: ARLArena: A Unified Framework for Stable Agentic Reinforcement Learning

Metric Brief

accuracy

Coverage: 8 papers (11.9%)

8 papers (11.9%) mention accuracy.

Examples: Context-Aware Mapping of 2D Drawing Annotations to 3D CAD Features Using LLM-Assisted Reasoning for Manufacturing Automation , Towards interpretable models for language proficiency assessment: Predicting the CEFR level of Estonian learner texts , BrowseComp-$V^3$: A Visual, Vertical, and Verifiable Benchmark for Multimodal Browsing Agents

Metric Brief

cost

Coverage: 7 papers (10.4%)

7 papers (10.4%) mention cost.

Examples: Localized Dynamics-Aware Domain Adaption for Off-Dynamics Offline Reinforcement Learning , The Single-Multi Evolution Loop for Self-Improving Model Collaboration Systems , DIAL: Direct Iterative Adversarial Learning for Realistic Multi-Turn Dialogue Simulation

Metric Brief

success rate

Coverage: 4 papers (6%)

4 papers (6%) mention success rate.

Examples: LiLo-VLA: Compositional Long-Horizon Manipulation via Linked Object-Centric Policies , AdapTools: Adaptive Tool-based Indirect Prompt Injection Attacks on Agentic LLMs , The Tool Decathlon: Benchmarking Language Agents for Diverse, Realistic, and Long-Horizon Task Execution

Most Cited In This Hub

Fast path to methods with the strongest citation traction in this scope.

Papers: CiteLLM: An Agentic Platform for Trustworthy Scientific Reference Discovery , Towards Simulating Social Media Users with LLMs: Evaluating the Operational Validity of Conditioned Comment Prediction , Retrieval Challenges in Low-Resource Public Service Information: A Case Study on Food Pantry Access

Most Recent

Fast path to latest protocol changes and newly published evaluation setups.

Best Protocol Detail

Papers with explicit rater/unit metadata and quality-control signals for reproducibility.

Top Papers

CiteLLM: An Agentic Platform for Trustworthy Scientific Reference Discovery
Mengze Hong, Di Jiang, Chen Jason Zhang, Zichang Guo, Yawen Li · Feb 26, 2026 · Citations: 0

Simulation Env

In this work, we present CiteLLM, a specialized agentic platform designed to enable trustworthy reference discovery for grounding author-drafted claims and statements.
Towards Simulating Social Media Users with LLMs: Evaluating the Operational Validity of Conditioned Comment Prediction
Nils Schwager, Simon Münker, Alistair Plum, Achim Rettinger · Feb 26, 2026 · Citations: 0

Simulation Env

This framework enables a rigorous evaluation of current LLM capabilities with respect to the simulation of social media user behavior.
Retrieval Challenges in Low-Resource Public Service Information: A Case Study on Food Pantry Access
Touseef Hasan, Laila Cure, Souvika Sarkar · Feb 25, 2026 · Citations: 0

Simulation Env

We conduct a pilot evaluation study using community-sourced queries to examine system behavior in realistic scenarios.
ARLArena: A Unified Framework for Stable Agentic Reinforcement Learning
Xiaoxuan Wang, Han Zhang, Haixin Wang, Yidan Shi, Ruoyan Li · Feb 25, 2026 · Citations: 0

Simulation Env Long Horizon

Agentic reinforcement learning (ARL) has rapidly gained attention as a promising paradigm for training agents to solve complex, multi-step interactive tasks.
LiLo-VLA: Compositional Long-Horizon Manipulation via Linked Object-Centric Policies
Yue Yang, Shuo Cheng, Yu Fang, Homanga Bharadhwaj, Mingyu Ding · Feb 25, 2026 · Citations: 0

Simulation Env Long Horizon

We introduce a 21-task simulation benchmark consisting of two challenging suites: LIBERO-Long++ and Ultra-Long.
A Knowledge-Driven Approach to Music Segmentation, Music Source Separation and Cinematic Audio Source Separation
Chun-wei Ho, Sabato Marco Siniscalchi, Kai Li, Chin-Hui Lee · Feb 25, 2026 · Citations: 0

Simulation Env

Evaluation on simulation data shows that score-guided learning achieves very good music segmentation and separation results.
Efficient Hierarchical Any-Angle Path Planning on Multi-Resolution 3D Grids
Victor Reijgwart, Cesar Cadena, Roland Siegwart, Lionel Ott · Feb 24, 2026 · Citations: 0

Simulation Env Long Horizon

Hierarchical, multi-resolution volumetric mapping approaches are widely used to represent large and complex environments as they can efficiently capture their occupancy and connectivity information.
Cooperative-Competitive Team Play of Real-World Craft Robots
Rui Zhao, Xihui Li, Yizheng Zhang, Yuzhen Liu, Zhong Zhang · Feb 24, 2026 · Citations: 0

Simulation Env Multi Agent

Multi-agent deep Reinforcement Learning (RL) has made significant progress in developing intelligent game-playing agents in recent years.
Localized Dynamics-Aware Domain Adaption for Off-Dynamics Offline Reinforcement Learning
Zhangjie Xia, Yu Yang, Pan Xu · Feb 24, 2026 · Citations: 0

Simulation Env

Off-dynamics offline reinforcement learning (RL) aims to learn a policy for a target domain using limited target data and abundant source data collected under different transition dynamics.
Training-Free Intelligibility-Guided Observation Addition for Noisy ASR
Haoyang Li, Changsong Liu, Wei Rao, Hao Shi, Sakriani Sakti · Feb 24, 2026 · Citations: 0

Simulation Env

Automatic speech recognition (ASR) degrades severely in noisy environments.
Architecting AgentOS: From Token-Level Context to Emergent System-Level Intelligence
ChengYou Li, XiaoDong Liu, XiangBao Meng, XinYu Zhao · Feb 24, 2026 · Citations: 0

Simulation Env Multi Agent

The paradigm of Large Language Models is undergoing a fundamental transition from static inference engines to dynamic autonomous cognitive systems.While current research primarily focuses on scaling context windows or optimizing prompt engi
POMDPPlanners: Open-Source Package for POMDP Planning
Yaacov Pariente, Vadim Indelman · Feb 24, 2026 · Citations: 0

Simulation Env

We present POMDPPlanners, an open-source Python package for empirical evaluation of Partially Observable Markov Decision Process (POMDP) planning algorithms.
AdapTools: Adaptive Tool-based Indirect Prompt Injection Attacks on Agentic LLMs
Che Wang, Jiaming Zhang, Ziqi Zhang, Zijie Wang, Yinghui Wang · Feb 24, 2026 · Citations: 0

Simulation Env

The integration of external data services (e.g., Model Context Protocol, MCP) has made large language model-based agents increasingly powerful for complex task execution.
PreScience: A Benchmark for Forecasting Scientific Contributions
Anirudh Ajith, Amanpreet Singh, Jay DeYoung, Nadav Kunievsky, Austin C. Kozlowski · Feb 24, 2026 · Citations: 0

Human EvalSimulation Env

We introduce PreScience -- a scientific forecasting benchmark that decomposes the research process into four interdependent generative tasks: collaborator prediction, prior work selection, contribution generation, and impact prediction.
InterviewSim: A Scalable Framework for Interview-Grounded Personality Simulation
Yu Li, Pranav Narayanan Venkit, Yada Pruksachatkun, Chien-Sheng Wu · Feb 23, 2026 · Citations: 0

Simulation Env

Existing evaluation approaches rely on demographic surveys, personality questionnaires, or short AI-led interviews as proxies, but lack direct assessment against what individuals actually said.
Contextual Safety Reasoning and Grounding for Open-World Robots
Zachary Ravichandran, David Snyder, Alexander Robey, Hamed Hassani, Vijay Kumar · Feb 23, 2026 · Citations: 0

Simulation Env Web Browsing

Traditional safety approaches enforce fixed constraints in user-specified contexts, limiting their ability to handle the open-ended contextual variability of real-world deployment.
Context-Aware Mapping of 2D Drawing Annotations to 3D CAD Features Using LLM-Assisted Reasoning for Manufacturing Automation
Muhammad Tayyab Khan, Lequn Chen, Wenhe Feng, Seung Ki Moon · Feb 20, 2026 · Citations: 0

Automatic MetricsSimulation Env

When deterministic scoring cannot resolve an ambiguity, the system escalates to multimodal and constrained large-language-model reasoning, followed by a single human-in-the-loop (HITL) review step.
The Statistical Signature of LLMs
Ortal Hadad, Edoardo Loru, Jacopo Nudo, Niccolò Di Marco, Matteo Cinelli · Feb 20, 2026 · Citations: 0

Simulation Env

We analyze compression behavior across three progressively more complex information ecosystems: controlled human-LLM continuations, generative mediation of a knowledge infrastructure (Wikipedia vs.
Neural Synchrony Between Socially Interacting Language Models
Zhining Zhang, Wentao Zhu, Chi Han, Yizhou Wang, Heng Ji · Feb 19, 2026 · Citations: 0

Simulation Env

Neuroscience has uncovered a fundamental mechanism of our social nature: human brain activity becomes synchronized with others in many social contexts involving interaction.
HQFS: Hybrid Quantum Classical Financial Security with VQC Forecasting, QUBO Annealing, and Audit-Ready Post-Quantum Signing
Srikumar Nayak · Feb 19, 2026 · Citations: 0

Simulation Env

Here's the corrected paragraph with all punctuation and formatting issues fixed: Financial risk systems usually follow a two-step routine: a model predicts return or risk, and then an optimizer makes a decision such as a portfolio rebalance
ConvApparel: A Benchmark Dataset and Validation Framework for User Simulators in Conversational Recommenders
Ofer Meshi, Krisztian Balog, Sally Goldman, Avi Caciularu, Guy Tennenholtz · Feb 18, 2026 · Citations: 0

Simulation Env

We introduce ConvApparel, a new dataset of human-AI conversations designed to address this gap.
SimToolReal: An Object-Centric Policy for Zero-Shot Dexterous Tool Manipulation
Kushal Kedia, Tyler Ga Wei Lum, Jeannette Bohg, C. Karen Liu · Feb 18, 2026 · Citations: 0

Simulation Env

The ability to manipulate tools significantly expands the set of tasks a robot can perform.
MemoryArena: Benchmarking Agent Memory in Interdependent Multi-Session Agentic Tasks
Zexue He, Yu Wang, Churan Zhi, Yuanzhe Hu, Tzu-Ping Chen · Feb 18, 2026 · Citations: 0

Pairwise Preference Simulation Env Web Browsing

Existing evaluations of agents with memory typically assess memorization and action in isolation.
World-Model-Augmented Web Agents with Action Correction
Zhouzhou Shen, Xueyu Hu, Xiyun Li, Tianqing Fang, Juncheng Li · Feb 17, 2026 · Citations: 0

Llm As JudgeSimulation Env Multi Agent

Web agents based on large language models have demonstrated promising capability in automating web tasks.
Colosseum: Auditing Collusion in Cooperative Multi-Agent Systems
Mason Nakamura, Abhinav Kumar, Saswat Das, Sahar Abdelnabi, Saaduddin Mahmud · Feb 16, 2026 · Citations: 0

Simulation Env Multi Agent

Multi-agent systems, where LLM agents communicate through free-form language, enable sophisticated coordination for solving complex cooperative tasks.
OpaqueToolsBench: Learning Nuances of Tool Behavior Through Interaction
Skyler Hallinan, Thejas Venkatesh, Xiang Ren, Sai Praneeth Karimireddy, Ashwin Paranjape · Feb 16, 2026 · Citations: 0

Simulation Env Tool Use

Tool-calling is essential for Large Language Model (LLM) agents to complete real-world tasks.
A Comparative Analysis of Social Network Topology in Reddit and Moltbook
Yiming Zhu, Gareth Tyson, Pan Hui · Feb 14, 2026 · Citations: 0

Simulation Env

Recent advances in agent-mediated systems have enabled a new paradigm of social network simulation, where AI agents interact with human-like autonomy.
Towards interpretable models for language proficiency assessment: Predicting the CEFR level of Estonian learner texts
Kais Allkivi · Feb 13, 2026 · Citations: 0

Automatic MetricsSimulation Env

Additional evaluation on an earlier exam sample revealed that the writings have become more complex over a 7-10-year period, while accuracy still reached 0.8 with some feature sets.
BrowseComp-$V^3$: A Visual, Vertical, and Verifiable Benchmark for Multimodal Browsing Agents
Huanyao Zhang, Jiepeng Zhou, Bo Li, Bowen Zhou, Yanzhe Shan · Feb 13, 2026 · Citations: 0

Automatic MetricsSimulation Env Web Browsing

Multimodal large language models (MLLMs), equipped with increasingly advanced planning and tool-use capabilities, are evolving into autonomous agents capable of performing multimodal web browsing and deep search in open-world environments.
TSR: Trajectory-Search Rollouts for Multi-Turn RL of LLM Agents
Aladin Djuhera, Swanand Ravindra Kadhe, Farhan Ahmed, Heiko Ludwig, Holger Boche · Feb 12, 2026 · Citations: 0

Simulation Env Long Horizon

Advances in large language models (LLMs) are driving a shift toward using reinforcement learning (RL) to train agents from iterative, multi-turn interactions across tasks.
Transforming Science Learning Materials in the Era of Artificial Intelligence
Xiaoming Zhai, Kent Crippen · Feb 8, 2026 · Citations: 0

Simulation Env

However, these innovations also raise critical ethical and pedagogical concerns, including issues of algorithmic bias, data privacy, transparency, and the need for human oversight.
How Well Can LLM Agents Simulate End-User Security and Privacy Attitudes and Behaviors?
Yuxuan Li, Leyang Li, Hao-Ping Lee, Sauvik Das · Feb 6, 2026 · Citations: 0

Simulation Env

A growing body of research assumes that large language model (LLM) agents can serve as proxies for how people form attitudes toward and behave in response to security and privacy (S&P) threats.
The Single-Multi Evolution Loop for Self-Improving Model Collaboration Systems
Shangbin Feng, Kishan Panaganti, Yulia Tsvetkov, Wenhao Yu · Feb 5, 2026 · Citations: 0

Simulation Env

Model collaboration -- systems where multiple language models (LMs) collaborate -- combines the strengths of diverse models with cost in loading multiple LMs.
DCoPilot: Generative AI-Empowered Policy Adaptation for Dynamic Data Center Operations
Minghao Li, Ruihang Wang, Rui Tan, Yonggang Wen · Feb 2, 2026 · Citations: 0

Simulation Env

However, manually designing piecewise deep reinforcement learning (DRL) agents cannot keep pace with frequent dynamics shifts and service-level agreement (SLA) changes of an evolving DC.
Toward Ultra-Long-Horizon Agentic Science: Cognitive Accumulation for Machine Learning Engineering
Xinyu Zhu, Yuzhu Cai, Zexi Liu, Bingyang Zheng, Cheng Wang · Jan 15, 2026 · Citations: 0

Simulation Env Long Horizon

The advancement of artificial intelligence toward agentic science is currently bottlenecked by the challenge of ultra-long-horizon autonomy, the ability to sustain strategic coherence and iterative correction over experimental cycles spanni
Fast-ThinkAct: Efficient Vision-Language-Action Reasoning via Verbalizable Latent Planning
Chi-Pin Huang, Yunze Man, Zhiding Yu, Min-Hung Chen, Jan Kautz · Jan 14, 2026 · Citations: 0

Pairwise Preference Simulation Env Long Horizon

Fast-ThinkAct learns to reason efficiently with latent CoTs by distilling from a teacher, driven by a preference-guided objective to align manipulation trajectories that transfers both linguistic and visual planning capabilities for embodie
DIAL: Direct Iterative Adversarial Learning for Realistic Multi-Turn Dialogue Simulation
Ziyi Zhu, Olivier Tieleman, Caitlin A. Stamatis, Luka Smyth, Thomas D. Hull · Dec 23, 2025 · Citations: 0

Automatic MetricsSimulation Env

Realistic user simulation is crucial for training and evaluating multi-turn dialogue systems, yet creating simulators that accurately replicate human behavior remains a significant challenge.
Aerial Vision-Language Navigation with a Unified Framework for Spatial, Temporal and Embodied Reasoning
Huilin Xu, Zhuoyang Liu, Yixiang Luomei, Feng Xu · Dec 9, 2025 · Citations: 0

Simulation Env Long Horizon

Extensive experiments on the AerialVLN and OpenFly benchmark validate the effectiveness of our method.
From Moderation to Mediation: Can LLMs Serve as Mediators in Online Flame Wars?
Dawei Li, Abdullah Alnaibari, Arslan Bisharat, Manny Sandoval, Deborah Hall · Dec 2, 2025 · Citations: 0

Simulation Env

To assess mediation quality, we construct a large Reddit-based dataset and propose a multi-stage evaluation pipeline combining principle-based scoring, user simulation, and human comparison.
PEFT-Factory: Unified Parameter-Efficient Fine-Tuning of Autoregressive Large Language Models
Robert Belanec, Ivan Srba, Maria Bielikova · Dec 2, 2025 · Citations: 0

Simulation Env

While its modular design supports extensibility, it natively provides a representative set of 19 PEFT methods, 27 classification and text generation datasets addressing 12 tasks, and both standard and PEFT-specific evaluation metrics.
PEFT-Bench: A Parameter-Efficient Fine-Tuning Methods Benchmark
Robert Belanec, Branislav Pecher, Ivan Srba, Maria Bielikova · Nov 26, 2025 · Citations: 0

Simulation Env

Despite the advances in PEFT methods, current evaluations remain limited (in terms of evaluated models and datasets) and difficult to reproduce.
Multimodal Peer Review Simulation with Actionable To-Do Recommendations for Community-Aware Manuscript Revisions
Mengze Hong, Di Jiang, Weiwei Zhao, Yawen Li, Yihang Wang · Nov 14, 2025 · Citations: 0

Critique Edit Simulation Env

Experimental results highlight the effectiveness of the proposed system in generating more comprehensive and useful reviews aligned with expert standards, surpassing ablated baselines and advancing transparent, human-centered scholarly assi
BEAT: Visual Backdoor Attacks on VLM-based Embodied Agents via Contrastive Trigger Learning
Qiusi Zhan, Hyeonjeong Ha, Rui Yang, Sirui Xu, Hanyang Chen · Oct 31, 2025 · Citations: 0

Pairwise Preference Automatic MetricsSimulation Env Long Horizon

Recent advances in Vision-Language Models (VLMs) have propelled embodied agents by enabling direct perception, reasoning, and planning task-oriented actions from visual inputs.
The Tool Decathlon: Benchmarking Language Agents for Diverse, Realistic, and Long-Horizon Task Execution
Junlong Li, Wenshuo Zhao, Jian Zhao, Weihao Zeng, Haoze Wu · Oct 29, 2025 · Citations: 0

Simulation Env Long Horizon

Real-world language agents must handle complex, multi-step workflows across diverse Apps.
MoMaGen: Generating Demonstrations under Soft and Hard Constraints for Multi-Step Bimanual Mobile Manipulation
Chengshu Li, Mengdi Xu, Arpit Bahety, Hang Yin, Yunfan Jiang · Oct 21, 2025 · Citations: 0

Demonstrations Simulation Env Long Horizon

Imitation learning from large-scale, diverse human demonstrations has been shown to be effective for training robots, but collecting such data is costly and time-consuming.
SPACeR: Self-Play Anchoring with Centralized Reference Models
Wei-Jer Chang, Akshay Rangesh, Kevin Joseph, Matthew Strong, Masayoshi Tomizuka · Oct 20, 2025 · Citations: 0

Demonstrations Simulation Env Multi Agent

Developing autonomous vehicles (AVs) requires not only safety and efficiency, but also realistic, human-like behaviors that are socially aware and predictable.
Detecting Early and Implicit Suicidal Ideation via Longitudinal and Information Environment Signals on Social Media
Soorya Ram Shimgekar, Ruining Zhao, Agam Goyal, Violeta J. Rodriguez, Paul A. Bloom · Oct 16, 2025 · Citations: 0

Simulation Env

On social media, several individuals experiencing suicidal ideation (SI) do not disclose their distress explicitly.
Assessing Web Search Credibility and Response Groundedness in Chat Assistants
Ivan Vykopal, Matúš Pikuliak, Simon Ostermann, Marián Šimko · Oct 15, 2025 · Citations: 0

Simulation Env

Chat assistants increasingly integrate web search functionality, enabling them to retrieve and cite external sources.
EconCausal: A Context-Aware Causal Reasoning Benchmark for Large Language Models in Social Science
Donggyu Lee, Hyeok Yun, Meeyoung Cha, Sungwon Park, Sangyoon Park · Oct 8, 2025 · Citations: 0

Automatic MetricsSimulation Env

To address this, we introduce EconCausal, a large-scale benchmark comprising 10,490 context-annotated causal triplets extracted from 2,595 high-quality empirical studies published in top-tier economics and finance journals.
Polychromic Objectives for Reinforcement Learning
Jubayer Ibn Hamid, Ifdita Hasan Orney, Ellen Xu, Chelsea Finn, Dorsa Sadigh · Sep 29, 2025 · Citations: 0

Simulation Env

Reinforcement learning fine-tuning (RLFT) is a dominant paradigm for improving pretrained policies for downstream tasks.
ProPerSim: Developing Proactive and Personalized AI Assistants through User-Assistant Simulation
Jiho Kim, Junseong Choi, Woosog Chay, Daeun Kyung, Yeonsu Kwon · Sep 26, 2025 · Citations: 0

Pairwise Preference Simulation Env

In our simulation environment, a user agent with a rich persona interacts with the assistant, providing ratings on how well each suggestion aligns with its preferences and context.
EpidemIQs: Prompt-to-Paper LLM Agents for Epidemic Modeling and Analysis
Mohammad Hossein Samaei, Faryad Darabi Sahneh, Lee W. Cohnstaedt, Caterina Scoglio · Sep 24, 2025 · Citations: 0

Expert Verification Llm As JudgeSimulation Env Multi Agent

We introduce EpidemIQs, a novel multi-agent LLM framework that integrates user inputs and autonomously conducts literature review, analytical derivation, network modeling, mechanistic modeling, stochastic simulations, data visualization and
Collaborative Document Editing with Multiple Users and AI Agents
Florian Lehmann, Krystsina Shauchenka, Daniel Buschek · Sep 15, 2025 · Citations: 0

Simulation Env Multi Agent

We propose integrating AI agents directly into collaborative writing environments.
Language and Experience: A Computational Model of Social Learning in Complex Tasks
Cédric Colas, Tracey Mills, Ben Prystawski, Michael Henry Tessler, Noah Goodman · Aug 26, 2025 · Citations: 0

Simulation Env

The ability to combine linguistic guidance from others with direct experience is central to human development, enabling safe and rapid learning in new environments.
Shop-R1: Rewarding LLMs to Simulate Human Behavior in Online Shopping via Reinforcement Learning
Yimeng Zhang, Tian Wang, Jiri Gesi, Ziyi Wang, Yuxuan Lu · Jul 23, 2025 · Citations: 0

Simulation Env

Large Language Models (LLMs) have recently demonstrated strong potential in generating 'believable human-like' behavior in web environments.
On the Inference (In-)Security of Vertical Federated Learning: Efficient Auditing against Inference Tampering Attack
Chung-ju Huang, Ziqi Zhang, Yinggui Wang, Binghui Wang, Tao Wei · Jul 3, 2025 · Citations: 0

Automatic MetricsSimulation Env

Vertical Federated Learning (VFL) is an emerging distributed learning paradigm for cross-silo collaboration without accessing participants' data.
Sensory-Motor Control with Large Language Models via Iterative Policy Refinement
Jônata Tyska Carvalho, Stefano Nolfi · Jun 5, 2025 · Citations: 0

Simulation Env

We propose a method that enables large language models (LLMs) to control embodied agents through the generation of control policies that directly map continuous observation vectors to continuous action vectors.
Synthesis of discrete-continuous quantum circuits with multimodal diffusion models
Florian Fürrutter, Zohim Chandani, Ikko Hamamura, Hans J. Briegel, Gorka Muñoz-Gil · Jun 2, 2025 · Citations: 0

Automatic MetricsSimulation Env

We benchmark the model over different experiments, analyzing the method's accuracy across varying qubit counts and circuit depths, showcasing the ability of the method to outperform existing approaches in gate counts and under noisy conditi
RedTeamCUA: Realistic Adversarial Testing of Computer-Use Agents in Hybrid Web-OS Environments
Zeyi Liao, Jaylen Jones, Linxi Jiang, Yuting Ning, Eric Fosler-Lussier · May 28, 2025 · Citations: 0

Red Team Simulation Env Web Browsing

Computer-use agents (CUAs) promise to automate complex tasks across operating systems (OS) and the web, but remain vulnerable to indirect prompt injection.
Can LLMs Simulate Human Behavioral Variability? A Case Study in the Phonemic Fluency Task
Mengyang Qiu, Zoe Brisebois, Siena Sun · May 22, 2025 · Citations: 0

Pairwise Preference Simulation Env

Large language models (LLMs) are increasingly explored as substitutes for human participants in cognitive tasks, but their ability to simulate human behavioral variability remains unclear.

General + Simulation Env Papers

Research Narrative

Why This Matters For Eval Research

Protocol Takeaways

Benchmark Interpretation

Metric Interpretation

Researcher Checklist

Suggested Reading Order

Known Limitations

Research Utility Links

Top Papers

Related Hubs