HFEPX Hub

CS.CL + Simulation Env Papers

Updated from current HFEPX corpus (Feb 27, 2026). 71 papers are grouped in this hub page. Common evaluation modes: Simulation Env, Automatic Metrics. Most common rater population: Domain Experts. Common annotation unit: Trajectory. Frequent quality control: Inter Annotator Agreement Reported. Frequently cited benchmark: Retrieval. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Feb 26, 2026.

Papers: 71 Last published: Feb 26, 2026 Global RSS Tag RSS

Cs.CLSimulation Env

Research Narrative

Grounded narrative Model: deterministic-grounded Source: persisted

Updated from current HFEPX corpus (Feb 27, 2026). This page tracks 71 papers for CS.CL + Simulation Env Papers. Dominant protocol signals include simulation environments, automatic metrics, human evaluation, with frequent benchmark focus on Retrieval, APPS and metric focus on accuracy, cost. Use the grounded sections below to prioritize reproducible protocol choices, benchmark-matched comparisons, and judge-vs-human evaluation checks.

Why This Matters For Eval Research

15.5% of papers report explicit human-feedback signals, led by pairwise preferences.

Evidence: CiteLLM: An Agentic Platform for Trustworthy Scientific Reference Discovery , TherapyProbe: Generating Design Knowledge for Relational Safety in Mental Health Chatbots Through Adversarial Simulation , Towards Simulating Social Media Users with LLMs: Evaluating the Operational Validity of Conditioned Comment Prediction , Dynamic Personality Adaptation in Large Language Models via State Machines
simulation environments appears in 100% of papers in this hub.

Evidence: CiteLLM: An Agentic Platform for Trustworthy Scientific Reference Discovery , TherapyProbe: Generating Design Knowledge for Relational Safety in Mental Health Chatbots Through Adversarial Simulation , Towards Simulating Social Media Users with LLMs: Evaluating the Operational Validity of Conditioned Comment Prediction , Dynamic Personality Adaptation in Large Language Models via State Machines
Retrieval is a recurring benchmark anchor for cross-paper comparisons in this page.

Evidence: InterviewSim: A Scalable Framework for Interview-Grounded Personality Simulation , CiteLLM: An Agentic Platform for Trustworthy Scientific Reference Discovery , TherapyProbe: Generating Design Knowledge for Relational Safety in Mental Health Chatbots Through Adversarial Simulation , Towards Simulating Social Media Users with LLMs: Evaluating the Operational Validity of Conditioned Comment Prediction

Protocol Takeaways

Most common quality-control signal is inter-annotator agreement reporting (1.4% of papers).

Evidence: PreScience: A Benchmark for Forecasting Scientific Contributions , CiteLLM: An Agentic Platform for Trustworthy Scientific Reference Discovery , TherapyProbe: Generating Design Knowledge for Relational Safety in Mental Health Chatbots Through Adversarial Simulation , Towards Simulating Social Media Users with LLMs: Evaluating the Operational Validity of Conditioned Comment Prediction
Rater context is mostly domain experts, and annotation is commonly trajectory-level annotation; use this to scope replication staffing.

Evidence: TherapyProbe: Generating Design Knowledge for Relational Safety in Mental Health Chatbots Through Adversarial Simulation , Assessing Risks of Large Language Models in Mental Health Support: A Framework for Automated Clinical AI Red Teaming , CiteLLM: An Agentic Platform for Trustworthy Scientific Reference Discovery , Towards Simulating Social Media Users with LLMs: Evaluating the Operational Validity of Conditioned Comment Prediction
Compare papers that report both human_eval and llm_as_judge to quantify judge-human agreement drift.

Evidence: MixSarc: A Bangla-English Code-Mixed Corpus for Implicit Meaning Identification , PreScience: A Benchmark for Forecasting Scientific Contributions , CiteLLM: An Agentic Platform for Trustworthy Scientific Reference Discovery , TherapyProbe: Generating Design Knowledge for Relational Safety in Mental Health Chatbots Through Adversarial Simulation

Benchmark Interpretation

Retrieval appears in 9.9% of hub papers (7/71); use this cohort for benchmark-matched comparisons.
APPS appears in 2.8% of hub papers (2/71); use this cohort for benchmark-matched comparisons.

Metric Interpretation

accuracy is reported in 18.3% of hub papers (13/71); compare with a secondary metric before ranking methods.
cost is reported in 11.3% of hub papers (8/71); compare with a secondary metric before ranking methods.

Researcher Checklist

Close gap on Papers with explicit human feedback. Coverage is a replication risk (15.5% vs 45% target).
Close gap on Papers reporting quality controls. Coverage is a replication risk (1.4% vs 30% target).
Tighten coverage on Papers naming benchmarks/datasets. Coverage is usable but incomplete (31% vs 35% target).
Maintain strength on Papers naming evaluation metrics. Coverage is strong (49.3% vs 35% target).
Close gap on Papers with known rater population. Coverage is a replication risk (9.9% vs 35% target).
Close gap on Papers with known annotation unit. Coverage is a replication risk (12.7% vs 35% target).

Papers with explicit human feedback

Coverage is a replication risk (15.5% vs 45% target).

Papers reporting quality controls

Coverage is a replication risk (1.4% vs 30% target).

Papers naming benchmarks/datasets

Coverage is usable but incomplete (31% vs 35% target).

Papers naming evaluation metrics

Coverage is strong (49.3% vs 35% target).

Papers with known rater population

Coverage is a replication risk (9.9% vs 35% target).

Papers with known annotation unit

Coverage is a replication risk (12.7% vs 35% target).

Known Limitations

Only 1.4% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (9.9% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Links

Judge vs Human Agreement - Compares papers that evaluate with both human raters and LLM judges.
Benchmark Slice: Retrieval - Prioritizes benchmark-specific protocol comparisons.
Metric Slice: accuracy - Finds papers where reported metrics are directly comparable.
IAA-Reported Evaluations - Highlights evaluations that explicitly report inter-annotator agreement.
Recent High-Signal Papers - Keeps the hub connected to the latest HFEPX corpus updates.

human_eval vs llm_as_judge

both=0, left_only=3, right_only=1

0 papers use both Human Eval and Llm As Judge.

human_eval vs automatic_metrics

both=1, left_only=2, right_only=14

1 papers use both Human Eval and Automatic Metrics.

llm_as_judge vs automatic_metrics

both=0, left_only=1, right_only=15

0 papers use both Llm As Judge and Automatic Metrics.

Benchmark Brief

Retrieval

Coverage: 7 papers (9.9%)

7 papers (9.9%) mention Retrieval.

Examples: InterviewSim: A Scalable Framework for Interview-Grounded Personality Simulation , Calibrate-Then-Act: Cost-Aware Exploration in LLM Agents , Quecto-V1: Empirical Analysis of 8-bit Quantized Small Language Models for On-Device Legal Retrieval

Benchmark Brief

APPS

Coverage: 2 papers (2.8%)

2 papers (2.8%) mention APPS.

Examples: UI-Venus-1.5 Technical Report , The Tool Decathlon: Benchmarking Language Agents for Diverse, Realistic, and Long-Horizon Task Execution

Benchmark Brief

SWE-bench

Coverage: 2 papers (2.8%)

2 papers (2.8%) mention SWE-bench.

Examples: Hybrid-Gym: Training Coding Agents to Generalize Across Tasks , SWE-Master: Unleashing the Potential of Software Engineering Agents via Post-Training

Metric Brief

accuracy

Coverage: 13 papers (18.3%)

13 papers (18.3%) mention accuracy.

Examples: Small Wins Big: Comparing Large Language Models and Domain Fine-Tuned Models for Sarcasm Detection in Code-Mixed Hinglish Text , MixSarc: A Bangla-English Code-Mixed Corpus for Implicit Meaning Identification , Counterfactual Simulation Training for Chain-of-Thought Faithfulness

Metric Brief

cost

Coverage: 8 papers (11.3%)

8 papers (11.3%) mention cost.

Examples: Calibrate-Then-Act: Cost-Aware Exploration in LLM Agents , MAEB: Massive Audio Embedding Benchmark , Step 3.5 Flash: Open Frontier-Level Intelligence with 11B Active Parameters

Metric Brief

coherence

Coverage: 5 papers (7%)

5 papers (7%) mention coherence.

Examples: ToolMATH: A Math Tool Benchmark for Realistic Long-Horizon Multi-Tool Reasoning , From Pixels to Policies: Reinforcing Spatial Reasoning in Language Models for Content-Aware Layout Design , How Well Can LLM Agents Simulate End-User Security and Privacy Attitudes and Behaviors?

Most Cited In This Hub

Fast path to methods with the strongest citation traction in this scope.

Papers: CiteLLM: An Agentic Platform for Trustworthy Scientific Reference Discovery , TherapyProbe: Generating Design Knowledge for Relational Safety in Mental Health Chatbots Through Adversarial Simulation , Towards Simulating Social Media Users with LLMs: Evaluating the Operational Validity of Conditioned Comment Prediction

Most Recent

Fast path to latest protocol changes and newly published evaluation setups.

Best Protocol Detail

Papers with explicit rater/unit metadata and quality-control signals for reproducibility.

Top Papers

CiteLLM: An Agentic Platform for Trustworthy Scientific Reference Discovery
Mengze Hong, Di Jiang, Chen Jason Zhang, Zichang Guo, Yawen Li · Feb 26, 2026 · Citations: 0

Simulation Env

In this work, we present CiteLLM, a specialized agentic platform designed to enable trustworthy reference discovery for grounding author-drafted claims and statements.
TherapyProbe: Generating Design Knowledge for Relational Safety in Mental Health Chatbots Through Adversarial Simulation
Joydeep Chandra, Satyam Kumar Navneet, Yong Zhang · Feb 26, 2026 · Citations: 0

Expert Verification Simulation Env Multi Agent

As mental health chatbots proliferate to address the global treatment gap, a critical question emerges: How do we design for relational safety the quality of interaction patterns that unfold across conversations rather than the correctness
Towards Simulating Social Media Users with LLMs: Evaluating the Operational Validity of Conditioned Comment Prediction
Nils Schwager, Simon Münker, Alistair Plum, Achim Rettinger · Feb 26, 2026 · Citations: 0

Simulation Env

This framework enables a rigorous evaluation of current LLM capabilities with respect to the simulation of social media user behavior.
Dynamic Personality Adaptation in Large Language Models via State Machines
Leon Pielage, Ole Hätscher, Mitja Back, Bernhard Marschall, Benjamin Risse · Feb 25, 2026 · Citations: 0

Simulation Env

This work demonstrates the feasibility of modular, personality-adaptive architectures for education, customer support, and broader human-computer interaction.
TG-ASR: Translation-Guided Learning with Parallel Gated Cross Attention for Low-Resource Automatic Speech Recognition
Cheng-Yeh Yang, Chien-Chun Wang, Li-Wei Chen, Hung-Shin Lee, Hsin-Min Wang · Feb 25, 2026 · Citations: 0

Simulation Env

Low-resource automatic speech recognition (ASR) continues to pose significant challenges, primarily due to the limited availability of transcribed data for numerous languages.
Small Wins Big: Comparing Large Language Models and Domain Fine-Tuned Models for Sarcasm Detection in Code-Mixed Hinglish Text
Bitan Majumder, Anirban Sen · Feb 25, 2026 · Citations: 0

Automatic MetricsSimulation Env

Sarcasm detection in multilingual and code-mixed environments remains a challenging task for natural language processing models due to structural variations, informal expressions, and low-resource linguistic availability.
MixSarc: A Bangla-English Code-Mixed Corpus for Implicit Meaning Identification
Kazi Samin Yasar Alam, Md Tanbir Chowdhury, Tamim Ahmed, Ajwad Abrar, Md Rafid Haque · Feb 25, 2026 · Citations: 0

Human EvalAutomatic Metrics

We benchmark transformer-based models and evaluate zero-shot large language models under structured prompting.
Evaluating Proactive Risk Awareness of Large Language Models
Xuan Luo, Yubin Chen, Zhiyu Hou, Linpu Yu, Geng Tu · Feb 24, 2026 · Citations: 0

Simulation Env

As large language models (LLMs) are increasingly embedded in everyday decision-making, their safety responsibilities extend beyond reacting to explicit harmful intent toward anticipating unintended but consequential risks.
Explicit Grammar Semantic Feature Fusion for Robust Text Classification
Azrin Sultana, Firoz Ahmed · Feb 24, 2026 · Citations: 0

Simulation Env

Natural Language Processing enables computers to understand human language by analysing and classifying text efficiently with deep-level grammatical and semantic features.
ToolMATH: A Math Tool Benchmark for Realistic Long-Horizon Multi-Tool Reasoning
Hyeonje Choi, Jeongsoo Lee, Hyojun Lee, Jay-Yoon Lee · Feb 24, 2026 · Citations: 0

Simulation Env Long Horizon

We introduce \ToolMATH, a math-grounded benchmark that evaluates tool-augmented language models in realistic multi-tool environments where the output depends on calling schema-specified tools and sustaining multi-step execution.
Counterfactual Simulation Training for Chain-of-Thought Faithfulness
Peter Hase, Christopher Potts · Feb 24, 2026 · Citations: 0

Automatic MetricsSimulation Env

Inspecting Chain-of-Thought reasoning is among the most common means of understanding why an LLM produced its output.
PreScience: A Benchmark for Forecasting Scientific Contributions
Anirudh Ajith, Amanpreet Singh, Jay DeYoung, Nadav Kunievsky, Austin C. Kozlowski · Feb 24, 2026 · Citations: 0

Human EvalSimulation Env

We introduce PreScience -- a scientific forecasting benchmark that decomposes the research process into four interdependent generative tasks: collaborator prediction, prior work selection, contribution generation, and impact prediction.
InterviewSim: A Scalable Framework for Interview-Grounded Personality Simulation
Yu Li, Pranav Narayanan Venkit, Yada Pruksachatkun, Chien-Sheng Wu · Feb 23, 2026 · Citations: 0

Simulation Env

Existing evaluation approaches rely on demographic surveys, personality questionnaires, or short AI-led interviews as proxies, but lack direct assessment against what individuals actually said.
Assessing Risks of Large Language Models in Mental Health Support: A Framework for Automated Clinical AI Red Teaming
Ian Steenstra, Paola Pedrelli, Weiyan Shi, Stacy Marsella, Timothy W. Bickmore · Feb 23, 2026 · Citations: 0

Red Team Simulation Env

Large Language Models (LLMs) are increasingly utilized for mental health support; however, current safety benchmarks often fail to detect the complex, longitudinal risks inherent in therapeutic dialogue.
Reasoning Capabilities of Large Language Models. Lessons Learned from General Game Playing
Maciej Świechowski, Adam Żychowski, Jacek Mańdziuk · Feb 22, 2026 · Citations: 0

Simulation Env

The main results indicate that three of the evaluated models generally perform well across most experimental settings, with performance degradation observed as the evaluation horizon increases (i.e., with a higher number of game steps).
Benchmark Test-Time Scaling of General LLM Agents
Xiaochuan Li, Ryan Ming, Pranav Setlur, Abhijay Paladugu, Andy Tang · Feb 22, 2026 · Citations: 0

Simulation Env

LLM agents are increasingly expected to function as general-purpose systems capable of resolving open-ended user requests.
SPQ: An Ensemble Technique for Large Language Model Compression
Jiamin Yao, Eren Gultepe · Feb 20, 2026 · Citations: 0

Automatic MetricsSimulation Env

Applied to LLaMA-2-7B, SPQ achieves up to 75% memory reduction while maintaining or improving perplexity (e.g., WikiText-2 5.47 to 4.91) and preserving accuracy on downstream benchmarks such as C4, TruthfulQA, and GSM8K.
Analyzing and Improving Chain-of-Thought Monitorability Through Information Theory
Usman Anwar, Tim Bakker, Dana Kianfar, Cristina Pinneri, Christos Louizos · Feb 20, 2026 · Citations: 0

Automatic MetricsSimulation Env

Chain-of-thought (CoT) monitors are LLM-based systems that analyze reasoning traces to detect when outputs may exhibit attributes of interest, such as test-hacking behavior during code generation.
The Statistical Signature of LLMs
Ortal Hadad, Edoardo Loru, Jacopo Nudo, Niccolò Di Marco, Matteo Cinelli · Feb 20, 2026 · Citations: 0

Simulation Env

We analyze compression behavior across three progressively more complex information ecosystems: controlled human-LLM continuations, generative mediation of a knowledge infrastructure (Wikipedia vs.
NIMMGen: Learning Neural-Integrated Mechanistic Digital Twins with LLMs
Zihan Guan, Rituparna Datta, Mengxuan Hu, Shunshun Liu, Aiying Zhang · Feb 20, 2026 · Citations: 0

Simulation Env

Recent work has explored LLM-based agentic frameworks to automatically construct mechanistic models from data; however, existing problem settings substantially oversimplify real-world conditions, leaving it unclear whether LLM-generated mec
Neural Synchrony Between Socially Interacting Language Models
Zhining Zhang, Wentao Zhu, Chi Han, Yizhou Wang, Heng Ji · Feb 19, 2026 · Citations: 0

Simulation Env

Neuroscience has uncovered a fundamental mechanism of our social nature: human brain activity becomes synchronized with others in many social contexts involving interaction.
HQFS: Hybrid Quantum Classical Financial Security with VQC Forecasting, QUBO Annealing, and Audit-Ready Post-Quantum Signing
Srikumar Nayak · Feb 19, 2026 · Citations: 0

Simulation Env

Here's the corrected paragraph with all punctuation and formatting issues fixed: Financial risk systems usually follow a two-step routine: a model predicts return or risk, and then an optimizer makes a decision such as a portfolio rebalance
ConvApparel: A Benchmark Dataset and Validation Framework for User Simulators in Conversational Recommenders
Ofer Meshi, Krisztian Balog, Sally Goldman, Avi Caciularu, Guy Tennenholtz · Feb 18, 2026 · Citations: 0

Simulation Env

We introduce ConvApparel, a new dataset of human-AI conversations designed to address this gap.
Hybrid-Gym: Training Coding Agents to Generalize Across Tasks
Yiqing Xie, Emmy Liu, Gaokai Zhang, Nachiket Kotalwar, Shubham Gandhi · Feb 18, 2026 · Citations: 0

Simulation Env

When assessing the quality of coding agents, predominant benchmarks focus on solving single issues on GitHub, such as SWE-Bench.
Calibrate-Then-Act: Cost-Aware Exploration in LLM Agents
Wenxuan Ding, Nicholas Tomlin, Greg Durrett · Feb 18, 2026 · Citations: 0

Simulation Env

Each problem has latent environment state that can be reasoned about via a prior which is passed to the LLM agent.
Quecto-V1: Empirical Analysis of 8-bit Quantized Small Language Models for On-Device Legal Retrieval
Subrit Dikshit · Feb 18, 2026 · Citations: 0

Automatic MetricsSimulation Env

The rapid proliferation of Large Language Models (LLMs) has revolutionized Natural Language Processing (NLP) but has simultaneously created a "resource divide." State-of-the-art legal intelligence systems typically rely on massive parameter
MemoryArena: Benchmarking Agent Memory in Interdependent Multi-Session Agentic Tasks
Zexue He, Yu Wang, Churan Zhi, Yuanzhe Hu, Tzu-Ping Chen · Feb 18, 2026 · Citations: 0

Pairwise Preference Simulation Env Web Browsing

Existing evaluations of agents with memory typically assess memorization and action in isolation.
MAEB: Massive Audio Embedding Benchmark
Adnan El Assadi, Isaac Chung, Chenghao Xiao, Roman Solomatin, Animesh Jha · Feb 17, 2026 · Citations: 0

Simulation Env

We introduce the Massive Audio Embedding Benchmark (MAEB), a large-scale benchmark covering 30 tasks across speech, music, environmental sounds, and cross-modal audio-text reasoning in 100+ languages.
World-Model-Augmented Web Agents with Action Correction
Zhouzhou Shen, Xueyu Hu, Xiyun Li, Tianqing Fang, Juncheng Li · Feb 17, 2026 · Citations: 0

Llm As JudgeSimulation Env Multi Agent

Web agents based on large language models have demonstrated promising capability in automating web tasks.
FrameRef: A Framing Dataset and Simulation Testbed for Modeling Bounded Rational Information Health
Victor De Lima, Jiqun Liu, Grace Hui Yang · Feb 17, 2026 · Citations: 0

Human EvalSimulation Env Long Horizon

Within this framework, we construct framing-sensitive agent personas by fine-tuning language models with framing-conditioned loss attenuation, inducing targeted biases while preserving overall task competence.
Colosseum: Auditing Collusion in Cooperative Multi-Agent Systems
Mason Nakamura, Abhinav Kumar, Saswat Das, Sahar Abdelnabi, Saaduddin Mahmud · Feb 16, 2026 · Citations: 0

Simulation Env Multi Agent

Multi-agent systems, where LLM agents communicate through free-form language, enable sophisticated coordination for solving complex cooperative tasks.
OpaqueToolsBench: Learning Nuances of Tool Behavior Through Interaction
Skyler Hallinan, Thejas Venkatesh, Xiang Ren, Sai Praneeth Karimireddy, Ashwin Paranjape · Feb 16, 2026 · Citations: 0

Simulation Env Tool Use

Tool-calling is essential for Large Language Model (LLM) agents to complete real-world tasks.
Physical Commonsense Reasoning for Lower-Resourced Languages and Dialects: a Study on Basque
Jaione Bengoetxea, Itziar Gonzalez-Dios, Rodrigo Agerri · Feb 16, 2026 · Citations: 0

Automatic MetricsSimulation Env

Physical commonsense reasoning represents a fundamental capability of human intelligence, enabling individuals to understand their environment, predict future events, and navigate physical spaces.
Does Socialization Emerge in AI Agent Society? A Case Study of Moltbook
Ming Li, Xirui Li, Tianyi Zhou · Feb 15, 2026 · Citations: 0

Simulation Env Multi Agent

As large language model agents increasingly populate networked environments, a fundamental question arises: do artificial intelligence (AI) agent societies undergo convergence dynamics similar to human social systems?
From Pixels to Policies: Reinforcing Spatial Reasoning in Language Models for Content-Aware Layout Design
Sha Li, Stefano Petrangeli, Yu Shen, Xiang Chen · Feb 14, 2026 · Citations: 0

Critique Edit Simulation Env

We introduce LaySPA, a reinforcement learning framework that equips large language models (LLMs) with explicit and interpretable spatial reasoning for content-aware graphic layout design.
Towards interpretable models for language proficiency assessment: Predicting the CEFR level of Estonian learner texts
Kais Allkivi · Feb 13, 2026 · Citations: 0

Automatic MetricsSimulation Env

Additional evaluation on an earlier exam sample revealed that the writings have become more complex over a 7-10-year period, while accuracy still reached 0.8 with some feature sets.
TSR: Trajectory-Search Rollouts for Multi-Turn RL of LLM Agents
Aladin Djuhera, Swanand Ravindra Kadhe, Farhan Ahmed, Heiko Ludwig, Holger Boche · Feb 12, 2026 · Citations: 0

Simulation Env Long Horizon

Advances in large language models (LLMs) are driving a shift toward using reinforcement learning (RL) to train agents from iterative, multi-turn interactions across tasks.
Step 3.5 Flash: Open Frontier-Level Intelligence with 11B Active Parameters
Ailin Huang, Ang Li, Aobo Kong, Bin Wang, Binxing Jiao · Feb 11, 2026 · Citations: 0

Pairwise Preference Simulation Env Tool Use

We introduce Step 3.5 Flash, a sparse Mixture-of-Experts (MoE) model that bridges frontier-level agentic intelligence and computational efficiency.
TestExplora: Benchmarking LLMs for Proactive Bug Discovery via Repository-Level Test Generation
Steven Liu, Jane Luo, Xin Zhang, Aofan Liu, Hao Liu · Feb 11, 2026 · Citations: 0

Simulation Env

Current evaluations systematically overlook the third goal.
UI-Venus-1.5 Technical Report
Venus Team, Changlong Gao, Zhangxuan Gu, Yulin Liu, Xinyu Qiu · Feb 9, 2026 · Citations: 0

Simulation Env Long Horizon

GUI agents have emerged as a powerful paradigm for automating interactions in digital environments, yet achieving both broad generality and consistently strong task performance remains challenging.
How Well Can LLM Agents Simulate End-User Security and Privacy Attitudes and Behaviors?
Yuxuan Li, Leyang Li, Hao-Ping Lee, Sauvik Das · Feb 6, 2026 · Citations: 0

Simulation Env

A growing body of research assumes that large language model (LLM) agents can serve as proxies for how people form attitudes toward and behave in response to security and privacy (S&P) threats.
The Single-Multi Evolution Loop for Self-Improving Model Collaboration Systems
Shangbin Feng, Kishan Panaganti, Yulia Tsvetkov, Wenhao Yu · Feb 5, 2026 · Citations: 0

Simulation Env

Model collaboration -- systems where multiple language models (LMs) collaborate -- combines the strengths of diverse models with cost in loading multiple LMs.
SWE-Master: Unleashing the Potential of Software Engineering Agents via Post-Training
Huatong Song, Lisheng Huang, Shuang Sun, Jinhao Jiang, Ran Le · Feb 3, 2026 · Citations: 0

Simulation Env Long Horizon

In this technical report, we present SWE-Master, an open-source and fully reproducible post-training framework for building effective software engineering agents.
Embodied Task Planning via Graph-Informed Action Generation with Large Language Model
Xiang Li, Ning Yan, Masood Mortazavi · Jan 29, 2026 · Citations: 0

Simulation Env Long Horizon

While Large Language Models (LLMs) have demonstrated strong zero-shot reasoning capabilities, their deployment as embodied agents still faces fundamental challenges in long-horizon planning.
APEX-Agents
Bertie Vidgen, Austin Mann, Abby Fennelly, John Wright Stanly, Lucas Rothman · Jan 20, 2026 · Citations: 0

Rubric RatingExpert Verification Simulation Env Long Horizon

We introduce the AI Productivity Index for Agents (APEX-Agents), a benchmark for assessing whether AI agents can execute long-horizon, cross-application tasks created by investment banking analysts, management consultants, and corporate law
Multimodal Multi-Agent Empowered Legal Judgment Prediction
Zhaolu Kang, Junhao Gong, Qingxi Chen, Hao Zhang, Jiaxin Liu · Jan 19, 2026 · Citations: 0

Simulation Env Multi Agent

Furthermore, we build JurisMM, a large dataset with over 100,000 recent Chinese judicial records, including both text and multimodal video-text data, enabling comprehensive evaluation.
DIAL: Direct Iterative Adversarial Learning for Realistic Multi-Turn Dialogue Simulation
Ziyi Zhu, Olivier Tieleman, Caitlin A. Stamatis, Luka Smyth, Thomas D. Hull · Dec 23, 2025 · Citations: 0

Automatic MetricsSimulation Env

Realistic user simulation is crucial for training and evaluating multi-turn dialogue systems, yet creating simulators that accurately replicate human behavior remains a significant challenge.
PEFT-Factory: Unified Parameter-Efficient Fine-Tuning of Autoregressive Large Language Models
Robert Belanec, Ivan Srba, Maria Bielikova · Dec 2, 2025 · Citations: 0

Simulation Env

While its modular design supports extensibility, it natively provides a representative set of 19 PEFT methods, 27 classification and text generation datasets addressing 12 tasks, and both standard and PEFT-specific evaluation metrics.
PEFT-Bench: A Parameter-Efficient Fine-Tuning Methods Benchmark
Robert Belanec, Branislav Pecher, Ivan Srba, Maria Bielikova · Nov 26, 2025 · Citations: 0

Simulation Env

Despite the advances in PEFT methods, current evaluations remain limited (in terms of evaluated models and datasets) and difficult to reproduce.
Multimodal Peer Review Simulation with Actionable To-Do Recommendations for Community-Aware Manuscript Revisions
Mengze Hong, Di Jiang, Weiwei Zhao, Yawen Li, Yihang Wang · Nov 14, 2025 · Citations: 0

Critique Edit Simulation Env

Experimental results highlight the effectiveness of the proposed system in generating more comprehensive and useful reviews aligned with expert standards, surpassing ablated baselines and advancing transparent, human-centered scholarly assi
BEAT: Visual Backdoor Attacks on VLM-based Embodied Agents via Contrastive Trigger Learning
Qiusi Zhan, Hyeonjeong Ha, Rui Yang, Sirui Xu, Hanyang Chen · Oct 31, 2025 · Citations: 0

Pairwise Preference Automatic MetricsSimulation Env Long Horizon

Recent advances in Vision-Language Models (VLMs) have propelled embodied agents by enabling direct perception, reasoning, and planning task-oriented actions from visual inputs.
The Tool Decathlon: Benchmarking Language Agents for Diverse, Realistic, and Long-Horizon Task Execution
Junlong Li, Wenshuo Zhao, Jian Zhao, Weihao Zeng, Haoze Wu · Oct 29, 2025 · Citations: 0

Simulation Env Long Horizon

Real-world language agents must handle complex, multi-step workflows across diverse Apps.
PARL: Prompt-based Agents for Reinforcement Learning
Yarik Menchaca Resendiz, Roman Klinger · Oct 24, 2025 · Citations: 0

Simulation Env

However, limited work evaluates LLMs as agents in reinforcement learning (RL) tasks (e.g., playing games), where learning occurs through interaction with an environment and a reward system.
Detecting Early and Implicit Suicidal Ideation via Longitudinal and Information Environment Signals on Social Media
Soorya Ram Shimgekar, Ruining Zhao, Agam Goyal, Violeta J. Rodriguez, Paul A. Bloom · Oct 16, 2025 · Citations: 0

Simulation Env

On social media, several individuals experiencing suicidal ideation (SI) do not disclose their distress explicitly.
Assessing Web Search Credibility and Response Groundedness in Chat Assistants
Ivan Vykopal, Matúš Pikuliak, Simon Ostermann, Marián Šimko · Oct 15, 2025 · Citations: 0

Simulation Env

Chat assistants increasingly integrate web search functionality, enabling them to retrieve and cite external sources.
EconCausal: A Context-Aware Causal Reasoning Benchmark for Large Language Models in Social Science
Donggyu Lee, Hyeok Yun, Meeyoung Cha, Sungwon Park, Sangyoon Park · Oct 8, 2025 · Citations: 0

Automatic MetricsSimulation Env

To address this, we introduce EconCausal, a large-scale benchmark comprising 10,490 context-annotated causal triplets extracted from 2,595 high-quality empirical studies published in top-tier economics and finance journals.
ProPerSim: Developing Proactive and Personalized AI Assistants through User-Assistant Simulation
Jiho Kim, Junseong Choi, Woosog Chay, Daeun Kyung, Yeonsu Kwon · Sep 26, 2025 · Citations: 0

Pairwise Preference Simulation Env

In our simulation environment, a user agent with a rich persona interacts with the assistant, providing ratings on how well each suggestion aligns with its preferences and context.
Collaborative Document Editing with Multiple Users and AI Agents
Florian Lehmann, Krystsina Shauchenka, Daniel Buschek · Sep 15, 2025 · Citations: 0

Simulation Env Multi Agent

We propose integrating AI agents directly into collaborative writing environments.
Language and Experience: A Computational Model of Social Learning in Complex Tasks
Cédric Colas, Tracey Mills, Ben Prystawski, Michael Henry Tessler, Noah Goodman · Aug 26, 2025 · Citations: 0

Simulation Env

The ability to combine linguistic guidance from others with direct experience is central to human development, enabling safe and rapid learning in new environments.
Shop-R1: Rewarding LLMs to Simulate Human Behavior in Online Shopping via Reinforcement Learning
Yimeng Zhang, Tian Wang, Jiri Gesi, Ziyi Wang, Yuxuan Lu · Jul 23, 2025 · Citations: 0

Simulation Env

Large Language Models (LLMs) have recently demonstrated strong potential in generating 'believable human-like' behavior in web environments.

CS.CL + Simulation Env Papers

Research Narrative

Why This Matters For Eval Research

Protocol Takeaways

Benchmark Interpretation

Metric Interpretation

Researcher Checklist

Suggested Reading Order

Known Limitations

Research Utility Links

Top Papers

Related Hubs