HFEPX Hub

CS.MA + Automatic Metrics Papers

Updated from current HFEPX corpus (Feb 27, 2026). 7 papers are grouped in this hub page. Common evaluation modes: Automatic Metrics. Most common rater population: Domain Experts. Frequent quality control: Adjudication. Frequently cited benchmark: LiveCodeBench. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Feb 26, 2026.

Papers: 7 Last published: Feb 26, 2026 Global RSS Tag RSS

Cs.MAAutomatic Metrics

Research Narrative

Grounded narrative Model: deterministic-grounded Source: persisted

Updated from current HFEPX corpus (Feb 27, 2026). This page tracks 7 papers for CS.MA + Automatic Metrics Papers. Dominant protocol signals include automatic metrics, with frequent benchmark focus on LiveCodeBench and metric focus on accuracy, calibration. Use the grounded sections below to prioritize reproducible protocol choices, benchmark-matched comparisons, and judge-vs-human evaluation checks.

Why This Matters For Eval Research

14.3% of papers report explicit human-feedback signals, led by expert verification.

Evidence: Team of Thoughts: Efficient Test-time Scaling of Agentic Systems through Orchestrated Tool Calling , A Decision-Theoretic Formalisation of Steganography With Applications to LLM Monitoring , Hierarchical LLM-Based Multi-Agent Framework with Prompt Optimization for Multi-Robot Task Planning , Training Generalizable Collaborative Agents via Strategic Risk Aversion
automatic metrics appears in 100% of papers in this hub.

Evidence: A Decision-Theoretic Formalisation of Steganography With Applications to LLM Monitoring , Hierarchical LLM-Based Multi-Agent Framework with Prompt Optimization for Multi-Robot Task Planning , Training Generalizable Collaborative Agents via Strategic Risk Aversion , A Hierarchical Multi-Agent System for Autonomous Discovery in Geoscientific Data Archives
LiveCodeBench is a recurring benchmark anchor for cross-paper comparisons in this page.

Evidence: Team of Thoughts: Efficient Test-time Scaling of Agentic Systems through Orchestrated Tool Calling , A Decision-Theoretic Formalisation of Steganography With Applications to LLM Monitoring , Hierarchical LLM-Based Multi-Agent Framework with Prompt Optimization for Multi-Robot Task Planning , Training Generalizable Collaborative Agents via Strategic Risk Aversion

Protocol Takeaways

Most common quality-control signal is adjudication (14.3% of papers).

Evidence: From Competition to Coordination: Market Making as a Scalable Framework for Safe and Aligned Multi-Agent LLM Systems , Team of Thoughts: Efficient Test-time Scaling of Agentic Systems through Orchestrated Tool Calling , A Decision-Theoretic Formalisation of Steganography With Applications to LLM Monitoring , Hierarchical LLM-Based Multi-Agent Framework with Prompt Optimization for Multi-Robot Task Planning
Rater context is mostly domain experts, and annotation is commonly mixed annotation units; use this to scope replication staffing.

Evidence: Team of Thoughts: Efficient Test-time Scaling of Agentic Systems through Orchestrated Tool Calling , A Decision-Theoretic Formalisation of Steganography With Applications to LLM Monitoring , Hierarchical LLM-Based Multi-Agent Framework with Prompt Optimization for Multi-Robot Task Planning , Training Generalizable Collaborative Agents via Strategic Risk Aversion
Track metric sensitivity by reporting both accuracy and calibration.

Evidence: A Decision-Theoretic Formalisation of Steganography With Applications to LLM Monitoring , Hierarchical LLM-Based Multi-Agent Framework with Prompt Optimization for Multi-Robot Task Planning , Training Generalizable Collaborative Agents via Strategic Risk Aversion , A Hierarchical Multi-Agent System for Autonomous Discovery in Geoscientific Data Archives

Benchmark Interpretation

LiveCodeBench appears in 14.3% of hub papers (1/7); use this cohort for benchmark-matched comparisons.

Metric Interpretation

accuracy is reported in 28.6% of hub papers (2/7); compare with a secondary metric before ranking methods.
calibration is reported in 14.3% of hub papers (1/7); compare with a secondary metric before ranking methods.

Researcher Checklist

Close gap on Papers with explicit human feedback. Coverage is a replication risk (14.3% vs 45% target).
Tighten coverage on Papers reporting quality controls. Coverage is usable but incomplete (28.6% vs 30% target).
Close gap on Papers naming benchmarks/datasets. Coverage is a replication risk (14.3% vs 35% target).
Maintain strength on Papers naming evaluation metrics. Coverage is strong (42.9% vs 35% target).
Close gap on Papers with known rater population. Coverage is a replication risk (14.3% vs 35% target).
Close gap on Papers with known annotation unit. Coverage is a replication risk (0% vs 35% target).

Papers with explicit human feedback

Coverage is a replication risk (14.3% vs 45% target).

Papers reporting quality controls

Coverage is usable but incomplete (28.6% vs 30% target).

Papers naming benchmarks/datasets

Coverage is a replication risk (14.3% vs 35% target).

Papers naming evaluation metrics

Coverage is strong (42.9% vs 35% target).

Papers with known rater population

Coverage is a replication risk (14.3% vs 35% target).

Papers with known annotation unit

Coverage is a replication risk (0% vs 35% target).

Known Limitations

Rater population is under-specified (14.3% coverage).
Annotation unit is under-specified (0% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Links

Benchmark Slice: LiveCodeBench - Prioritizes benchmark-specific protocol comparisons.
Metric Slice: accuracy - Finds papers where reported metrics are directly comparable.
Recent High-Signal Papers - Keeps the hub connected to the latest HFEPX corpus updates.

Benchmark Brief

LiveCodeBench

Coverage: 1 papers (14.3%)

1 papers (14.3%) mention LiveCodeBench.

Examples: Team of Thoughts: Efficient Test-time Scaling of Agentic Systems through Orchestrated Tool Calling

Metric Brief

accuracy

Coverage: 2 papers (28.6%)

2 papers (28.6%) mention accuracy.

Examples: Hierarchical LLM-Based Multi-Agent Framework with Prompt Optimization for Multi-Robot Task Planning , From Competition to Coordination: Market Making as a Scalable Framework for Safe and Aligned Multi-Agent LLM Systems

Metric Brief

calibration

Coverage: 1 papers (14.3%)

1 papers (14.3%) mention calibration.

Examples: Team of Thoughts: Efficient Test-time Scaling of Agentic Systems through Orchestrated Tool Calling

Metric Brief

success rate

Coverage: 1 papers (14.3%)

1 papers (14.3%) mention success rate.

Examples: Hierarchical LLM-Based Multi-Agent Framework with Prompt Optimization for Multi-Robot Task Planning

Most Cited In This Hub

Fast path to methods with the strongest citation traction in this scope.

Papers: A Decision-Theoretic Formalisation of Steganography With Applications to LLM Monitoring , Hierarchical LLM-Based Multi-Agent Framework with Prompt Optimization for Multi-Robot Task Planning , Training Generalizable Collaborative Agents via Strategic Risk Aversion

Most Recent

Fast path to latest protocol changes and newly published evaluation setups.

Best Protocol Detail

Papers with explicit rater/unit metadata and quality-control signals for reproducibility.

Top Papers

A Decision-Theoretic Formalisation of Steganography With Applications to LLM Monitoring
Usman Anwar, Julianna Piskorz, David D. Baek, David Africa, Jim Weatherall · Feb 26, 2026 · Citations: 0

Automatic Metrics

Our central insight is that steganography creates an asymmetry in usable information between agents who can and cannot decode the hidden content (present within a steganographic signal), and this otherwise latent asymmetry can be inferred f
Hierarchical LLM-Based Multi-Agent Framework with Prompt Optimization for Multi-Robot Task Planning
Tomoya Kawabe, Rin Takano · Feb 25, 2026 · Citations: 0

Automatic Metrics Long Horizon

We present a hierarchical multi-agent LLM-based planner with prompt optimization: an upper layer decomposes tasks and assigns them to lower-layer agents, which generate PDDL problems solved by a classical planner.
Training Generalizable Collaborative Agents via Strategic Risk Aversion
Chengrui Qu, Yizhou Zhang, Nicholas Lanzetti, Eric Mazumdar · Feb 25, 2026 · Citations: 0

Automatic Metrics Multi Agent

Many emerging agentic paradigms require agents to collaborate with one another (or people) to achieve shared goals.
A Hierarchical Multi-Agent System for Autonomous Discovery in Geoscientific Data Archives
Dmitrii Pantiukhin, Ivan Kuznetsov, Boris Shapkin, Antonia Anna Jost, Thomas Jung · Feb 24, 2026 · Citations: 0

Automatic Metrics Long Horizon

Here we present PANGAEA-GPT, a hierarchical multi-agent framework designed for autonomous data discovery and analysis.
Under the Influence: Quantifying Persuasion and Vigilance in Large Language Models
Sasha Robinson, Kerem Oktar, Katherine M. Collins, Ilia Sucholutsky, Kelsey R. Allen · Feb 24, 2026 · Citations: 0

Automatic Metrics

With increasing integration of Large Language Models (LLMs) into areas of high-stakes human decision-making, it is important to understand the risks they introduce as advisors.
Team of Thoughts: Efficient Test-time Scaling of Agentic Systems through Orchestrated Tool Calling
Jeffrey T. H. Wong, Zixi Zhang, Junyi Liu, Yiren Zhao · Feb 18, 2026 · Citations: 0

Expert Verification Automatic Metrics Multi Agent

Existing Multi-Agent Systems (MAS) typically rely on static, homogeneous model configurations, limiting their ability to exploit the distinct strengths of differently post-trained models.
From Competition to Coordination: Market Making as a Scalable Framework for Safe and Aligned Multi-Agent LLM Systems
Brendan Gho, Suman Muppavarapu, Afnan Shaik, Tyson Tsay, Atharva Mohan · Nov 18, 2025 · Citations: 0

Automatic Metrics Multi Agent

As foundation models are increasingly deployed as interacting agents in multi-agent systems, their collective behavior raises new challenges for trustworthiness, transparency, and accountability.

CS.MA + Automatic Metrics Papers

Research Narrative

Why This Matters For Eval Research

Protocol Takeaways

Benchmark Interpretation

Metric Interpretation

Researcher Checklist

Suggested Reading Order

Known Limitations

Research Utility Links

Top Papers

Related Hubs