HFEPX Hub

CS.MA + General Papers

Updated from current HFEPX corpus (Apr 12, 2026). 21 papers are grouped in this hub page.

Read Full Context

Updated from current HFEPX corpus (Apr 12, 2026). 21 papers are grouped in this hub page. Common evaluation modes: Automatic Metrics, Simulation Env. Most common rater population: Domain Experts. Common annotation unit: Trajectory. Frequent quality control: Adjudication. Frequently cited benchmark: ALFWorld. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Mar 19, 2026.

Papers: 21 Last published: Mar 19, 2026 Global RSS Tag RSS

Cs.MAGeneral

Researcher Quick Triage

This hub is best used for protocol triage and replication planning from abstract-level evidence. Quality band: Developing .

All Sampled Papers (21) Replication-Ready Only (2)

High-Signal Coverage

100.0%

21 / 21 sampled papers are not low-signal flagged.

Replication-Ready Set

Benchmark + metric + eval mode explicitly present.

Judge/Human Comparability

Papers containing both `human_eval` and `llm_as_judge`.

2 papers are replication-ready (benchmark + metric + explicit evaluation mode).
0 papers support judge-vs-human agreement analysis.
2 papers report explicit quality controls (calibration/adjudication/IAA).

Primary action: Use this page for scouting only; collect additional papers before attempting replication-critical comparisons.

Need evaluators for this research workflow?

Post a Job →

Why This Matters For Eval Research

23.8% of papers report explicit human-feedback signals, led by pairwise preferences.
automatic metrics appears in 52.4% of papers in this hub.
ALFWorld is a recurring benchmark anchor for cross-paper comparisons in this page.

Protocol Takeaways

Most common quality-control signal is adjudication (9.5% of papers).
Rater context is mostly domain experts, and annotation is commonly trajectory-level annotation; use this to scope replication staffing.
Pair this hub with llm_as_judge pages to benchmark automated-vs-human evaluation tradeoffs.

Benchmark Interpretation

ALFWorld appears in 4.8% of hub papers (1/21); use this cohort for benchmark-matched comparisons.
Furina-Bench appears in 4.8% of hub papers (1/21); use this cohort for benchmark-matched comparisons.

Metric Interpretation

accuracy is reported in 28.6% of hub papers (6/21); compare with a secondary metric before ranking methods.
precision is reported in 14.3% of hub papers (3/21); compare with a secondary metric before ranking methods.

Researcher Checklist (Expanded)

Researcher Checklist

Gap: Papers with explicit human feedback

Coverage is a replication risk (23.8% vs 45% target).
Gap: Papers reporting quality controls

Coverage is a replication risk (9.5% vs 30% target).
Gap: Papers naming benchmarks/datasets

Coverage is a replication risk (14.3% vs 35% target).
Strong: Papers naming evaluation metrics

Coverage is strong (57.1% vs 35% target).
Gap: Papers with known rater population

Coverage is a replication risk (9.5% vs 35% target).
Gap: Papers with known annotation unit

Coverage is a replication risk (19% vs 35% target).

Strengths

Agentic evaluation appears in 95.2% of papers.

Known Gaps

Only 9.5% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (9.5% coverage).
Annotation unit is under-specified (19% coverage).

Suggested Next Analyses

Pair this hub with llm_as_judge pages to benchmark automated-vs-human evaluation tradeoffs.
Stratify by benchmark (ALFWorld vs Furina-Bench) before comparing methods.
Track metric sensitivity by reporting both accuracy and precision.
Add inter-annotator agreement checks when reproducing these protocols.

Recommended Queries (Expanded)

Recommended Queries

Human Eval Protocols Benchmark Slice: ALFWorld Metric Slice: accuracy Recent High-Signal Papers

Start with These 3

Use these when you need one protocol anchor, one benchmark anchor, and one recent comparison point before reading the wider hub.

Strongest protocol reference

ReDAct: Uncertainty-Aware Deferral for LLM Agents

Highest protocol score with explicit human/eval signal plus ALFWorld.

Strongest benchmark reference

PASK: Toward Intent-Aware Proactive Agents with Long-Term Memory

Latentneeds-Bench with precision gives a fast comparison anchor.

Strongest recent paper

The Geometry of Dialogue: Graphing Language Models to Reveal Synergis…

Useful for current practice scanning; published Oct 30, 2025.

Start Here (Best First 6)

Ranked for protocol completeness (human signal, benchmark + metric anchors, quality controls, and judge/human overlap).

ReDAct: Uncertainty-Aware Deferral for LLM Agents
Apr 8, 2026 · Citations: 0 · Score: 6.0

HF: Not reported · Eval: Simulation Env · Benchmark: ALFWorld · Metric: Cost
PASK: Toward Intent-Aware Proactive Agents with Long-Term Memory
Apr 9, 2026 · Citations: 0 · Score: 6.0

HF: Not reported · Eval: Automatic Metrics · Benchmark: Latentneeds Bench · Metric: Precision
The Geometry of Dialogue: Graphing Language Models to Reveal Synergistic Teams for Multi-Agent Collaboration
Oct 30, 2025 · Citations: 0 · Score: 5.0

HF: Pairwise Preference · Eval: Automatic Metrics · Benchmark: Not Reported · Metric: Accuracy
I Can't Believe It's Corrupt: Evaluating Corruption in Multi-Agent Governance Systems
Mar 19, 2026 · Citations: 0 · Score: 4.5

HF: Rubric Rating · Eval: Simulation Env · Benchmark: Not Reported · Metric: Not Reported
From Competition to Coordination: Market Making as a Scalable Framework for Safe and Aligned Multi-Agent LLM Systems
Nov 18, 2025 · Citations: 0 · Score: 4.5

HF: Not reported · Eval: Automatic Metrics · Benchmark: Not Reported · Metric: Accuracy
Social Dynamics as Critical Vulnerabilities that Undermine Objective Decision-Making in LLM Collectives
Apr 7, 2026 · Citations: 0 · Score: 4.0

HF: Not reported · Eval: Automatic Metrics, Simulation Env · Benchmark: Not Reported · Metric: Accuracy

Protocol Matrix (Top 12)

Use this to quickly compare protocol ingredients instead of scanning long prose.

Paper	HF Signal	Eval Modes	Benchmarks	Metrics	QC
ReDAct: Uncertainty-Aware Deferral for LLM Agents Apr 8, 2026	No Not Reported	Simulation Env	ALFWorld	Cost , Token cost	Not Reported
PASK: Toward Intent-Aware Proactive Agents with Long-Term Memory Apr 9, 2026	No Not Reported	Automatic Metrics	Latentneeds Bench	Precision , Latency	Not Reported
The Geometry of Dialogue: Graphing Language Models to Reveal Synergistic Teams for Multi-Agent Collaboration Oct 30, 2025	Yes Pairwise Preference	Automatic Metrics	Not Reported	Accuracy , Coherence	Not Reported
I Can't Believe It's Corrupt: Evaluating Corruption in Multi-Agent Governance Systems Mar 19, 2026	Yes Rubric Rating	Simulation Env	Not Reported	Not Reported	Not Reported
From Competition to Coordination: Market Making as a Scalable Framework for Safe and Aligned Multi-Agent LLM Systems Nov 18, 2025	No Not Reported	Automatic Metrics	Not Reported	Accuracy	Adjudication
Social Dynamics as Critical Vulnerabilities that Undermine Objective Decision-Making in LLM Collectives Apr 7, 2026	No Not Reported	Automatic Metrics , Simulation Env	Not Reported	Accuracy	Not Reported
Exploring Plan Space through Conversation: An Agentic Framework for LLM-Mediated Explanations in Planning Mar 2, 2026	Yes Pairwise Preference	Not Reported	Not Reported	Not Reported	Not Reported
SEVADE: Self-Evolving Multi-Agent Analysis with Decoupled Evaluation for Hallucination-Resistant Irony Detection Aug 9, 2025	No Not Reported	Automatic Metrics	Not Reported	Accuracy , F1	Adjudication
AgentSwing: Adaptive Parallel Context Management Routing for Long-Horizon Web Agents Mar 29, 2026	No Not Reported	Automatic Metrics	Not Reported	Precision	Not Reported
Governed Memory: A Production Architecture for Multi-Agent Workflows Mar 18, 2026	No Not Reported	Automatic Metrics	Not Reported	Accuracy , Precision	Not Reported
COMIC: Agentic Sketch Comedy Generation Mar 11, 2026	Yes Pairwise Preference	Not Reported	Not Reported	Not Reported	Not Reported
MA-SAPO: Multi-Agent Reasoning for Score-Aware Prompt Optimization Oct 18, 2025	Yes Critique Edit	Not Reported	Not Reported	Not Reported	Not Reported

Protocol Diff (Top Papers)

Fast side-by-side comparison for the highest-ranked papers in this hub.

Signal	ReDAct: Uncertainty-Aware Deferral for LLM Agents	PASK: Toward Intent-Aware Proactive Agents with Lon…	The Geometry of Dialogue: Graphing Language Models…
Human Feedback	Not reported	Not reported	Pairwise Preference
Evaluation Modes	Simulation Env	Automatic Metrics	Automatic Metrics
Benchmarks	ALFWorld	Latentneeds Bench	Not reported
Metrics	Cost, Token cost	Precision, Latency	Accuracy, Coherence
Quality Controls	Not reported	Not reported	Not reported
Rater Population	Unknown	Unknown	Unknown
Annotation Unit	Trajectory	Unknown	Pairwise

Research Utility Snapshot

Human Feedback Mix

Pairwise Preference (3)
Critique Edit (1)
Rubric Rating (1)

Evaluation Modes

Automatic Metrics (11)
Simulation Env (7)
Human Eval (1)

Top Benchmarks

ALFWorld (1)
Furina Bench (1)
Latentneeds Bench (1)

Top Metrics

Accuracy (6)
Precision (3)
Coherence (2)
Cost (2)

Rater Population Mix

Domain Experts (2)

Quality Controls

Adjudication (2)

Coverage diagnostics (sample-based): human-feedback 23.8% · benchmarks 14.3% · metrics 57.1% · quality controls 9.5%.

Top Papers

I Can't Believe It's Corrupt: Evaluating Corruption in Multi-Agent Governance Systems
Vedanta S P, Ponnurangam Kumaraguru · Mar 19, 2026 · Citations: 0

Rubric Rating Simulation Env Multi Agent

Large language models are increasingly proposed as autonomous agents for high-stakes public workflows, yet we lack systematic evidence about whether they would follow institutional rules when granted authority.
The Geometry of Dialogue: Graphing Language Models to Reveal Synergistic Teams for Multi-Agent Collaboration
Kotaro Furuya, Yuichi Kitagawa · Oct 30, 2025 · Citations: 0

Pairwise Preference Automatic Metrics Multi Agent

While a multi-agent approach based on large language models (LLMs) represents a promising strategy to surpass the capabilities of single models, its success is critically dependent on synergistic team composition.
ReDAct: Uncertainty-Aware Deferral for LLM Agents
Dzianis Piatrashyn, Nikita Kotelevskii, Kirill Grishchenkov, Nikita Glazkov, Ivan Nasonov · Apr 8, 2026 · Citations: 0

Simulation Env Long Horizon

Recently, LLM-based agents have become increasingly popular across many applications, including complex sequential decision-making problems.
Social Dynamics as Critical Vulnerabilities that Undermine Objective Decision-Making in LLM Collectives
Changgeon Ko, Jisu Shin, Hoyun Song, Huije Lee, Eui Jun Hwang · Apr 7, 2026 · Citations: 0

Automatic MetricsSimulation Env Multi Agent

Large language model (LLM) agents are increasingly acting as human delegates in multi-agent environments, where a representative agent integrates diverse peer perspectives to make a final decision.
Exploring Plan Space through Conversation: An Agentic Framework for LLM-Mediated Explanations in Planning
Guilhem Fouilhé, Rebecca Eifler, Antonin Poché, Sylvie Thiébaux, Nicholas Asher · Mar 2, 2026 · Citations: 0

Pairwise Preference Multi Agent

When automating plan generation for a real-world sequential decision problem, the goal is often not to replace the human planner, but to facilitate an iterative reasoning and elicitation process, where the human's role is to guide the AI…
FURINA: A Fully Customizable Role-Playing Benchmark via Scalable Multi-Agent Collaboration Pipeline
Haotian Wu, Shufan Jiang, Chios Chen, Yiyang Feng, Hehai Lin · Oct 8, 2025 · Citations: 0

Human Eval Multi Agent

As large language models (LLMs) advance in role-playing (RP) tasks, existing benchmarks quickly become obsolete due to their narrow scope, outdated interaction paradigms, and limited adaptability across diverse application scenarios.
From Competition to Coordination: Market Making as a Scalable Framework for Safe and Aligned Multi-Agent LLM Systems
Brendan Gho, Suman Muppavarapu, Afnan Shaik, Tyson Tsay, Atharva Mohan · Nov 18, 2025 · Citations: 0

Automatic Metrics Multi Agent

As foundation models are increasingly deployed as interacting agents in multi-agent systems, their collective behavior raises new challenges for trustworthiness, transparency, and accountability.
SEVADE: Self-Evolving Multi-Agent Analysis with Decoupled Evaluation for Hallucination-Resistant Irony Detection
Ziqi Liu, Ziyang Zhou, Yilin Li, Mingxuan Hu, Yushan Pan · Aug 9, 2025 · Citations: 0

Automatic Metrics Multi Agent

To address these challenges, we propose **SEVADE**, a novel **S**elf-**Ev**olving multi-agent **A**nalysis framework with **D**ecoupled **E**valuation for hallucination-resistant sarcasm detection.
StoryBox: Collaborative Multi-Agent Simulation for Hybrid Bottom-Up Long-Form Story Generation Using Large Language Models
Zehao Chen, Rong Pan, Haoran Li · Oct 13, 2025 · Citations: 0

Simulation Env Multi Agent

Human writers often begin their stories with an overarching mental scene, where they envision the interactions between characters and their environment.
MA-SAPO: Multi-Agent Reasoning for Score-Aware Prompt Optimization
Wonduk Seo, Juhyeon Lee, Junseo Koh, Wonseok Choi, Hyunjin An · Oct 18, 2025 · Citations: 0

Critique Edit Multi Agent

However, most existing frameworks treat evaluation as a black box, relying solely on outcome scores without explaining why prompts succeed or fail.
PASK: Toward Intent-Aware Proactive Agents with Long-Term Memory
Zhifei Xie, Zongzheng Hu, Fangda Ye, Xin Zhang, Haobo Chai · Apr 9, 2026 · Citations: 0

Automatic Metrics Long Horizon

Prior work remains largely confined to laboratory settings, leaving a clear gap in real-world proactive agent: depth, complexity, ambiguity, precision and real-time constraints.
AgentSwing: Adaptive Parallel Context Management Routing for Long-Horizon Web Agents
Zhaopeng Feng, Liangcai Su, Zhen Zhang, Xinyu Wang, Xiaotian Zhang · Mar 29, 2026 · Citations: 0

Automatic Metrics Long Horizon

As large language models (LLMs) evolve into autonomous agents for long-horizon information-seeking, managing finite context capacity has become a critical bottleneck.
Heterogeneous Debate Engine: Identity-Grounded Cognitive Architecture for Resilient LLM-Based Ethical Tutoring
Jakub Masłowski, Jarosław A. Chudziak · Mar 28, 2026 · Citations: 0

Simulation Env Multi Agent

Large Language Models (LLMs) are being increasingly used as autonomous agents in complex reasoning tasks, opening the niche for dialectical interactions.
Influencing LLM Multi-Agent Dialogue via Policy-Parameterized Prompts
Hongbo Bo, Jingyu Hu, Weiru Liu · Mar 10, 2026 · Citations: 0

Simulation Env Multi Agent

Large Language Models (LLMs) have emerged as a new paradigm for multi-agent systems.
Verifiable Semantics for Agent-to-Agent Communication
Philipp Schoenegger, Matt Carlson, Chris Schneider, Chris Daly · Feb 18, 2026 · Citations: 0

Simulation Env Multi Agent

Multiagent AI systems require consistent communication, but we lack methods to verify that agents share the same understanding of the terms used.
Governed Memory: A Production Architecture for Multi-Agent Workflows
Hamed Taheri · Mar 18, 2026 · Citations: 0

Automatic Metrics Long Horizon

Enterprise AI deploys dozens of autonomous agent nodes across workflows, each acting on the same entities with no shared memory and no common governance.
From Debate to Deliberation: Structured Collective Reasoning with Typed Epistemic Acts
Sunil Prakash · Mar 12, 2026 · Citations: 0

Automatic Metrics Multi Agent

Multi-agent LLM systems increasingly tackle complex reasoning, yet their interaction patterns remain limited to voting, unstructured debate, or pipeline orchestration.
Hierarchical LLM-Based Multi-Agent Framework with Prompt Optimization for Multi-Robot Task Planning
Tomoya Kawabe, Rin Takano · Feb 25, 2026 · Citations: 0

Automatic Metrics Long Horizon

We present a hierarchical multi-agent LLM-based planner with prompt optimization: an upper layer decomposes tasks and assigns them to lower-layer agents, which generate PDDL problems solved by a classical planner.
WideSeek-R1: Exploring Width Scaling for Broad Information Seeking via Multi-Agent Reinforcement Learning
Zelai Xu, Zhexuan Xu, Ruize Zhang, Chunyang Zhu, Shi Yu · Feb 4, 2026 · Citations: 0

Automatic Metrics Tool Use

To bridge this gap, we propose WideSeek-R1, a lead-agent-subagent framework trained via multi-agent reinforcement learning (MARL) to synergize scalable orchestration and parallel execution.
COMIC: Agentic Sketch Comedy Generation
Susung Hong, Brian Curless, Ira Kemelmacher-Shlizerman, Steve Seitz · Mar 11, 2026 · Citations: 0

Pairwise Preference

Starting with character references, the system employs a population of agents loosely based on real production studio roles, structured to optimize the quality and diversity of ideas and outputs through iterative competition, evaluation,…
Training Generalizable Collaborative Agents via Strategic Risk Aversion
Chengrui Qu, Yizhou Zhang, Nicholas Lanzetti, Eric Mazumdar · Feb 25, 2026 · Citations: 0

Automatic Metrics Multi Agent

Many emerging agentic paradigms require agents to collaborate with one another (or people) to achieve shared goals.

Related Hubs

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.

Self-Service

Post a Job

Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.

Create Account & Post a Job

Managed Service

For Large Projects

Done-for-You

We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.

Learn About Managed Service

For Freelancers

Join as an AI Trainer

Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.

Join Now