HFEPX Hub

CS.MA Human Feedback And Eval Papers

Updated from current HFEPX corpus (Feb 27, 2026). 11 papers are grouped in this hub page. Common evaluation modes: Automatic Metrics, Simulation Env. Most common rater population: Domain Experts. Common annotation unit: Freeform. Frequent quality control: Adjudication. Frequently cited benchmark: Lawbench. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Feb 26, 2026.

Papers: 11 Last published: Feb 26, 2026 Global RSS

Cs.MA

Research Narrative

Grounded narrative Model: deterministic-grounded Source: persisted

Updated from current HFEPX corpus (Feb 27, 2026). This page tracks 11 papers for CS.MA Human Feedback And Eval Papers. Dominant protocol signals include automatic metrics, simulation environments, with frequent benchmark focus on Lawbench, LiveCodeBench and metric focus on accuracy, calibration. Use the grounded sections below to prioritize reproducible protocol choices, benchmark-matched comparisons, and judge-vs-human evaluation checks.

Why This Matters For Eval Research

18.2% of papers report explicit human-feedback signals, led by expert verification.

Evidence: Team of Thoughts: Efficient Test-time Scaling of Agentic Systems through Orchestrated Tool Calling , A Decision-Theoretic Formalisation of Steganography With Applications to LLM Monitoring , Hierarchical LLM-Based Multi-Agent Framework with Prompt Optimization for Multi-Robot Task Planning , Training Generalizable Collaborative Agents via Strategic Risk Aversion
automatic metrics appears in 63.6% of papers in this hub.

Evidence: A Decision-Theoretic Formalisation of Steganography With Applications to LLM Monitoring , Hierarchical LLM-Based Multi-Agent Framework with Prompt Optimization for Multi-Robot Task Planning , Training Generalizable Collaborative Agents via Strategic Risk Aversion , A Hierarchical Multi-Agent System for Autonomous Discovery in Geoscientific Data Archives
Lawbench is a recurring benchmark anchor for cross-paper comparisons in this page.

Evidence: Multimodal Multi-Agent Empowered Legal Judgment Prediction , A Decision-Theoretic Formalisation of Steganography With Applications to LLM Monitoring , Hierarchical LLM-Based Multi-Agent Framework with Prompt Optimization for Multi-Robot Task Planning , Training Generalizable Collaborative Agents via Strategic Risk Aversion

Protocol Takeaways

Most common quality-control signal is adjudication (9.1% of papers).

Evidence: From Competition to Coordination: Market Making as a Scalable Framework for Safe and Aligned Multi-Agent LLM Systems , Team of Thoughts: Efficient Test-time Scaling of Agentic Systems through Orchestrated Tool Calling , A Decision-Theoretic Formalisation of Steganography With Applications to LLM Monitoring , Hierarchical LLM-Based Multi-Agent Framework with Prompt Optimization for Multi-Robot Task Planning
Rater context is mostly domain experts, and annotation is commonly Freeform; use this to scope replication staffing.

Evidence: Assessing Risks of Large Language Models in Mental Health Support: A Framework for Automated Clinical AI Red Teaming , Team of Thoughts: Efficient Test-time Scaling of Agentic Systems through Orchestrated Tool Calling , A Decision-Theoretic Formalisation of Steganography With Applications to LLM Monitoring , Hierarchical LLM-Based Multi-Agent Framework with Prompt Optimization for Multi-Robot Task Planning
Stratify by benchmark (Lawbench vs LiveCodeBench) before comparing methods.

Evidence: A Decision-Theoretic Formalisation of Steganography With Applications to LLM Monitoring , Hierarchical LLM-Based Multi-Agent Framework with Prompt Optimization for Multi-Robot Task Planning , Training Generalizable Collaborative Agents via Strategic Risk Aversion , A Hierarchical Multi-Agent System for Autonomous Discovery in Geoscientific Data Archives

Benchmark Interpretation

Lawbench appears in 9.1% of hub papers (1/11); use this cohort for benchmark-matched comparisons.
LiveCodeBench appears in 9.1% of hub papers (1/11); use this cohort for benchmark-matched comparisons.

Metric Interpretation

accuracy is reported in 18.2% of hub papers (2/11); compare with a secondary metric before ranking methods.
calibration is reported in 9.1% of hub papers (1/11); compare with a secondary metric before ranking methods.

Researcher Checklist

Close gap on Papers with explicit human feedback. Coverage is a replication risk (18.2% vs 45% target).
Tighten coverage on Papers reporting quality controls. Coverage is usable but incomplete (18.2% vs 30% target).
Close gap on Papers naming benchmarks/datasets. Coverage is a replication risk (18.2% vs 35% target).
Tighten coverage on Papers naming evaluation metrics. Coverage is usable but incomplete (27.3% vs 35% target).
Close gap on Papers with known rater population. Coverage is a replication risk (18.2% vs 35% target).
Close gap on Papers with known annotation unit. Coverage is a replication risk (9.1% vs 35% target).

Papers with explicit human feedback

Coverage is a replication risk (18.2% vs 45% target).

Papers reporting quality controls

Coverage is usable but incomplete (18.2% vs 30% target).

Papers naming benchmarks/datasets

Coverage is a replication risk (18.2% vs 35% target).

Papers naming evaluation metrics

Coverage is usable but incomplete (27.3% vs 35% target).

Papers with known rater population

Coverage is a replication risk (18.2% vs 35% target).

Papers with known annotation unit

Coverage is a replication risk (9.1% vs 35% target).

Known Limitations

Only 18.2% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (18.2% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Links

Benchmark Slice: Lawbench - Prioritizes benchmark-specific protocol comparisons.
Metric Slice: accuracy - Finds papers where reported metrics are directly comparable.
Recent High-Signal Papers - Keeps the hub connected to the latest HFEPX corpus updates.

automatic_metrics vs simulation_env

both=0, left_only=7, right_only=4

0 papers use both Automatic Metrics and Simulation Env.

Benchmark Brief

Lawbench

Coverage: 1 papers (9.1%)

1 papers (9.1%) mention Lawbench.

Examples: Multimodal Multi-Agent Empowered Legal Judgment Prediction

Benchmark Brief

LiveCodeBench

Coverage: 1 papers (9.1%)

1 papers (9.1%) mention LiveCodeBench.

Examples: Team of Thoughts: Efficient Test-time Scaling of Agentic Systems through Orchestrated Tool Calling

Metric Brief

accuracy

Coverage: 2 papers (18.2%)

2 papers (18.2%) mention accuracy.

Examples: Hierarchical LLM-Based Multi-Agent Framework with Prompt Optimization for Multi-Robot Task Planning , From Competition to Coordination: Market Making as a Scalable Framework for Safe and Aligned Multi-Agent LLM Systems

Metric Brief

calibration

Coverage: 1 papers (9.1%)

1 papers (9.1%) mention calibration.

Examples: Team of Thoughts: Efficient Test-time Scaling of Agentic Systems through Orchestrated Tool Calling

Metric Brief

success rate

Coverage: 1 papers (9.1%)

1 papers (9.1%) mention success rate.

Examples: Hierarchical LLM-Based Multi-Agent Framework with Prompt Optimization for Multi-Robot Task Planning

Most Cited In This Hub

Fast path to methods with the strongest citation traction in this scope.

Papers: A Decision-Theoretic Formalisation of Steganography With Applications to LLM Monitoring , Hierarchical LLM-Based Multi-Agent Framework with Prompt Optimization for Multi-Robot Task Planning , Training Generalizable Collaborative Agents via Strategic Risk Aversion

Most Recent

Fast path to latest protocol changes and newly published evaluation setups.

Best Protocol Detail

Papers with explicit rater/unit metadata and quality-control signals for reproducibility.

Top Papers

A Decision-Theoretic Formalisation of Steganography With Applications to LLM Monitoring
Usman Anwar, Julianna Piskorz, David D. Baek, David Africa, Jim Weatherall · Feb 26, 2026 · Citations: 0

Automatic Metrics

Our central insight is that steganography creates an asymmetry in usable information between agents who can and cannot decode the hidden content (present within a steganographic signal), and this otherwise latent asymmetry can be inferred f
Hierarchical LLM-Based Multi-Agent Framework with Prompt Optimization for Multi-Robot Task Planning
Tomoya Kawabe, Rin Takano · Feb 25, 2026 · Citations: 0

Automatic Metrics Long Horizon

We present a hierarchical multi-agent LLM-based planner with prompt optimization: an upper layer decomposes tasks and assigns them to lower-layer agents, which generate PDDL problems solved by a classical planner.
Training Generalizable Collaborative Agents via Strategic Risk Aversion
Chengrui Qu, Yizhou Zhang, Nicholas Lanzetti, Eric Mazumdar · Feb 25, 2026 · Citations: 0

Automatic Metrics Multi Agent

Many emerging agentic paradigms require agents to collaborate with one another (or people) to achieve shared goals.
A Hierarchical Multi-Agent System for Autonomous Discovery in Geoscientific Data Archives
Dmitrii Pantiukhin, Ivan Kuznetsov, Boris Shapkin, Antonia Anna Jost, Thomas Jung · Feb 24, 2026 · Citations: 0

Automatic Metrics Long Horizon

Here we present PANGAEA-GPT, a hierarchical multi-agent framework designed for autonomous data discovery and analysis.
Under the Influence: Quantifying Persuasion and Vigilance in Large Language Models
Sasha Robinson, Kerem Oktar, Katherine M. Collins, Ilia Sucholutsky, Kelsey R. Allen · Feb 24, 2026 · Citations: 0

Automatic Metrics

With increasing integration of Large Language Models (LLMs) into areas of high-stakes human decision-making, it is important to understand the risks they introduce as advisors.
Assessing Risks of Large Language Models in Mental Health Support: A Framework for Automated Clinical AI Red Teaming
Ian Steenstra, Paola Pedrelli, Weiyan Shi, Stacy Marsella, Timothy W. Bickmore · Feb 23, 2026 · Citations: 0

Red Team Simulation Env

Large Language Models (LLMs) are increasingly utilized for mental health support; however, current safety benchmarks often fail to detect the complex, longitudinal risks inherent in therapeutic dialogue.
Team of Thoughts: Efficient Test-time Scaling of Agentic Systems through Orchestrated Tool Calling
Jeffrey T. H. Wong, Zixi Zhang, Junyi Liu, Yiren Zhao · Feb 18, 2026 · Citations: 0

Expert Verification Automatic Metrics Multi Agent

Existing Multi-Agent Systems (MAS) typically rely on static, homogeneous model configurations, limiting their ability to exploit the distinct strengths of differently post-trained models.
Colosseum: Auditing Collusion in Cooperative Multi-Agent Systems
Mason Nakamura, Abhinav Kumar, Saswat Das, Sahar Abdelnabi, Saaduddin Mahmud · Feb 16, 2026 · Citations: 0

Simulation Env Multi Agent

Multi-agent systems, where LLM agents communicate through free-form language, enable sophisticated coordination for solving complex cooperative tasks.
Multimodal Multi-Agent Empowered Legal Judgment Prediction
Zhaolu Kang, Junhao Gong, Qingxi Chen, Hao Zhang, Jiaxin Liu · Jan 19, 2026 · Citations: 0

Simulation Env Multi Agent

Furthermore, we build JurisMM, a large dataset with over 100,000 recent Chinese judicial records, including both text and multimodal video-text data, enabling comprehensive evaluation.
From Competition to Coordination: Market Making as a Scalable Framework for Safe and Aligned Multi-Agent LLM Systems
Brendan Gho, Suman Muppavarapu, Afnan Shaik, Tyson Tsay, Atharva Mohan · Nov 18, 2025 · Citations: 0

Automatic Metrics Multi Agent

As foundation models are increasingly deployed as interacting agents in multi-agent systems, their collective behavior raises new challenges for trustworthiness, transparency, and accountability.
Multi-agent deep reinforcement learning with centralized training and decentralized execution for transportation infrastructure management
M. Saifullah, K. G. Papakonstantinou, A. Bhattacharya, S. M. Stoffels, C. P. Andriotis · Jan 23, 2024 · Citations: 0

Simulation Env Multi Agent

To tackle the high dimensionality of state and action spaces, we propose DDMAC-CTDE, a Deep Decentralized Multi-Agent Actor-Critic (DDMAC) reinforcement learning architecture with Centralized Training and Decentralized Execution (CTDE).

CS.MA Human Feedback And Eval Papers

Research Narrative

Why This Matters For Eval Research

Protocol Takeaways

Benchmark Interpretation

Metric Interpretation

Researcher Checklist

Suggested Reading Order

Known Limitations

Research Utility Links

Top Papers

Related Hubs