HFEPX Hub

Multi Agent + General (Last 90 Days)

Updated from current HFEPX corpus (Apr 17, 2026). 44 papers are grouped in this hub page.

Read Full Context

Updated from current HFEPX corpus (Apr 17, 2026). 44 papers are grouped in this hub page. Common evaluation modes: Automatic Metrics, Simulation Env. Most common rater population: Domain Experts. Common annotation unit: Trajectory. Frequent quality control: Calibration. Frequently cited benchmark: AlpacaEval. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Mar 19, 2026.

Papers: 44 Last published: Mar 19, 2026 Global RSS Tag RSS

Multi AgentGeneralLast 90d

Researcher Quick Triage

This hub is best used for protocol triage and replication planning from abstract-level evidence. Quality band: Developing .

All Sampled Papers (44) Replication-Ready Only (4)

High-Signal Coverage

100.0%

44 / 44 sampled papers are not low-signal flagged.

Replication-Ready Set

Benchmark + metric + eval mode explicitly present.

Judge/Human Comparability

Papers containing both `human_eval` and `llm_as_judge`.

4 papers are replication-ready (benchmark + metric + explicit evaluation mode).
0 papers support judge-vs-human agreement analysis.
1 papers report explicit quality controls (calibration/adjudication/IAA).

Primary action: Start with the top 2 papers in “Start Here”, then validate assumptions in the protocol matrix.

Need evaluators for this research workflow?

Post a Job →

Why This Matters For Eval Research

27.3% of papers report explicit human-feedback signals, led by pairwise preferences.
automatic metrics appears in 52.3% of papers in this hub.
AlpacaEval is a recurring benchmark anchor for cross-paper comparisons in this page.

Protocol Takeaways

Most common quality-control signal is rater calibration (2.3% of papers).
Rater context is mostly domain experts, and annotation is commonly trajectory-level annotation; use this to scope replication staffing.
Pair this hub with a human_eval-heavy hub to validate judge-model calibration.

Benchmark Interpretation

AlpacaEval appears in 2.3% of hub papers (1/44); use this cohort for benchmark-matched comparisons.
AlpacaEval 2.0 appears in 2.3% of hub papers (1/44); use this cohort for benchmark-matched comparisons.

Metric Interpretation

accuracy is reported in 29.5% of hub papers (13/44); compare with a secondary metric before ranking methods.
cost is reported in 9.1% of hub papers (4/44); compare with a secondary metric before ranking methods.

Researcher Checklist (Expanded)

Researcher Checklist

Moderate: Papers with explicit human feedback

Coverage is usable but incomplete (27.3% vs 45% target).
Gap: Papers reporting quality controls

Coverage is a replication risk (2.3% vs 30% target).
Gap: Papers naming benchmarks/datasets

Coverage is a replication risk (18.2% vs 35% target).
Strong: Papers naming evaluation metrics

Coverage is strong (52.3% vs 35% target).
Gap: Papers with known rater population

Coverage is a replication risk (13.6% vs 35% target).
Gap: Papers with known annotation unit

Coverage is a replication risk (20.5% vs 35% target).

Strengths

Agentic evaluation appears in 100% of papers.

Known Gaps

Only 2.3% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (13.6% coverage).
Annotation unit is under-specified (20.5% coverage).

Suggested Next Analyses

Pair this hub with a human_eval-heavy hub to validate judge-model calibration.
Stratify by benchmark (AlpacaEval vs AlpacaEval 2.0) before comparing methods.
Track metric sensitivity by reporting both accuracy and cost.
Add inter-annotator agreement checks when reproducing these protocols.

Recommended Queries (Expanded)

Recommended Queries

LLM-as-Judge Protocols Benchmark Slice: AlpacaEval Metric Slice: accuracy Recent High-Signal Papers

Start with These 3

Use these when you need one protocol anchor, one benchmark anchor, and one recent comparison point before reading the wider hub.

Strongest protocol reference

SODIUM: From Open Web Data to Queryable Databases

Highest protocol score with explicit human/eval signal plus Sodium-Bench.

Strongest benchmark reference

Elo-Evolve: A Co-evolutionary Framework for Language Model Alignment

MT-Bench with elo gives a fast comparison anchor.

Strongest recent paper

Meanings and Measurements: Multi-Agent Probabilistic Grounding for Vi…

Useful for current practice scanning; published Mar 19, 2026.

Start Here (Best First 6)

Ranked for protocol completeness (human signal, benchmark + metric anchors, quality controls, and judge/human overlap).

SODIUM: From Open Web Data to Queryable Databases
Mar 19, 2026 · Citations: 0 · Score: 8.0

HF: Expert Verification · Eval: Automatic Metrics · Benchmark: Sodium Bench · Metric: Accuracy
Elo-Evolve: A Co-evolutionary Framework for Language Model Alignment
Feb 14, 2026 · Citations: 0 · Score: 7.5

HF: Pairwise Preference · Eval: Automatic Metrics · Benchmark: MT Bench · Metric: Elo
Meanings and Measurements: Multi-Agent Probabilistic Grounding for Vision-Language Navigation
Mar 19, 2026 · Citations: 0 · Score: 6.5

HF: Demonstrations · Eval: Simulation Env · Benchmark: Mapg Bench · Metric: Not Reported
WebWeaver: Breaking Topology Confidentiality in LLM Multi-Agent Systems with Stealthy Context-Based Inference
Mar 11, 2026 · Citations: 0 · Score: 5.5

HF: Red Team · Eval: Automatic Metrics · Benchmark: Not Reported · Metric: Accuracy
Diff-KD: Diffusion-based Knowledge Distillation for Collaborative Perception under Corruptions
Apr 2, 2026 · Citations: 0 · Score: 5.5

HF: Not reported · Eval: Automatic Metrics · Benchmark: Not Reported · Metric: Accuracy
QChunker: Learning Question-Aware Text Chunking for Domain RAG via Multi-Agent Debate
Mar 12, 2026 · Citations: 0 · Score: 5.5

HF: Not reported · Eval: Automatic Metrics · Benchmark: Understanding Retrieval · Metric: Coherence

Protocol Matrix (Top 12)

Use this to quickly compare protocol ingredients instead of scanning long prose.

Paper	HF Signal	Eval Modes	Benchmarks	Metrics	QC
SODIUM: From Open Web Data to Queryable Databases Mar 19, 2026	Yes Expert Verification	Automatic Metrics	Sodium Bench	Accuracy	Not Reported
Elo-Evolve: A Co-evolutionary Framework for Language Model Alignment Feb 14, 2026	Yes Pairwise Preference	Automatic Metrics	MT Bench , AlpacaEval	Elo	Not Reported
Meanings and Measurements: Multi-Agent Probabilistic Grounding for Vision-Language Navigation Mar 19, 2026	Yes Demonstrations	Simulation Env	Mapg Bench	Not Reported	Not Reported
WebWeaver: Breaking Topology Confidentiality in LLM Multi-Agent Systems with Stealthy Context-Based Inference Mar 11, 2026	Yes Red Team	Automatic Metrics	Not Reported	Accuracy	Not Reported
Diff-KD: Diffusion-based Knowledge Distillation for Collaborative Perception under Corruptions Apr 2, 2026	No Not Reported	Automatic Metrics	Not Reported	Accuracy	Calibration
QChunker: Learning Question-Aware Text Chunking for Domain RAG via Multi-Agent Debate Mar 12, 2026	No Not Reported	Automatic Metrics	Understanding Retrieval	Coherence	Not Reported
RLShield: Practical Multi-Agent RL for Financial Cyber Defense with Attack-Surface MDPs and Real-Time Response Orchestration Feb 26, 2026	No Not Reported	Automatic Metrics	APPS	Not Reported	Not Reported
I Can't Believe It's Corrupt: Evaluating Corruption in Multi-Agent Governance Systems Mar 19, 2026	Yes Rubric Rating	Simulation Env	Not Reported	Not Reported	Not Reported
OccuBench: Evaluating AI Agents on Real-World Professional Tasks via Language Environment Simulation Apr 13, 2026	No Not Reported	Simulation Env	Occubench	Not Reported	Not Reported
Build, Judge, Optimize: A Blueprint for Continuous Improvement of Multi-Agent Consumer Assistants Mar 3, 2026	Yes Pairwise Preference , Rubric Rating	Llm As Judge , Simulation Env	Not Reported	Not Reported	Not Reported
Multi-Agent Comedy Club: Investigating Community Discussion Effects on LLM Humor Generation Feb 16, 2026	Yes Pairwise Preference , Rubric Rating	Not Reported	Not Reported	Not Reported	Not Reported
World-Model-Augmented Web Agents with Action Correction Feb 17, 2026	No Not Reported	Llm As Judge , Simulation Env	VisualWebArena , Mind2Web	Not Reported	Not Reported

Protocol Diff (Top Papers)

Fast side-by-side comparison for the highest-ranked papers in this hub.

Signal	SODIUM: From Open Web Data to Queryable Databases	Elo-Evolve: A Co-evolutionary Framework for Languag…	Meanings and Measurements: Multi-Agent Probabilisti…
Human Feedback	Expert Verification	Pairwise Preference	Demonstrations
Evaluation Modes	Automatic Metrics	Automatic Metrics	Simulation Env
Benchmarks	Sodium Bench	MT Bench, AlpacaEval	Mapg Bench
Metrics	Accuracy	Elo	Not reported
Quality Controls	Not reported	Not reported	Not reported
Rater Population	Domain Experts	Unknown	Unknown
Annotation Unit	Unknown	Pairwise	Unknown

Research Utility Snapshot

Human Feedback Mix

Pairwise Preference (7)
Rubric Rating (3)
Expert Verification (2)
Demonstrations (1)

Evaluation Modes

Automatic Metrics (23)
Simulation Env (17)
Llm As Judge (4)

Top Benchmarks

AlpacaEval (1)
AlpacaEval 2.0 (1)
APPS (1)
Mapg Bench (1)

Top Metrics

Accuracy (13)
Cost (4)
F1 (2)
Relevance (2)

Rater Population Mix

Domain Experts (6)

Quality Controls

Calibration (1)

Coverage diagnostics (sample-based): human-feedback 27.3% · benchmarks 18.2% · metrics 52.3% · quality controls 2.3%.

Top Papers

SODIUM: From Open Web Data to Queryable Databases
Chuxuan Hu, Philip Li, Maxwell Yang, Daniel Kang · Mar 19, 2026 · Citations: 0

Expert Verification Automatic Metrics Multi Agent

Existing systems struggle with SODIUM tasks: we evaluate 6 advanced AI agents on SODIUM-Bench, with the strongest baseline achieving only 46.5% accuracy.
Elo-Evolve: A Co-evolutionary Framework for Language Model Alignment
Jing Zhao, Ting Zhen, Junwei Bao, Hongfei Jiang, Yang Song · Feb 14, 2026 · Citations: 0

Pairwise Preference Automatic Metrics Multi Agent

Current alignment methods for Large Language Models (LLMs) rely on compressing vast amounts of human preference data into static, absolute reward functions, leading to data scarcity, noise sensitivity, and training instability.
Meanings and Measurements: Multi-Agent Probabilistic Grounding for Vision-Language Navigation
Swagat Padhan, Lakshya Jain, Bhavya Minesh Shah, Omkar Patil, Thao Nguyen · Mar 19, 2026 · Citations: 0

Demonstrations Simulation Env Multi Agent

To address this limitation, we propose MAPG (Multi-Agent Probabilistic Grounding), an agentic framework that decomposes language queries into structured subcomponents and queries a VLM to ground each component.
I Can't Believe It's Corrupt: Evaluating Corruption in Multi-Agent Governance Systems
Vedanta S P, Ponnurangam Kumaraguru · Mar 19, 2026 · Citations: 0

Rubric Rating Simulation Env Multi Agent

Large language models are increasingly proposed as autonomous agents for high-stakes public workflows, yet we lack systematic evidence about whether they would follow institutional rules when granted authority.
Build, Judge, Optimize: A Blueprint for Continuous Improvement of Multi-Agent Consumer Assistants
Alejandro Breen Herrera, Aayush Sheth, Steven G. Xu, Zhucheng Zhan, Charles Wright · Mar 3, 2026 · Citations: 0

Pairwise PreferenceRubric Rating Llm As JudgeSimulation Env Long Horizon

Conversational shopping assistants (CSAs) represent a compelling application of agentic AI, but moving from prototype to production reveals two underexplored challenges: how to evaluate multi-turn interactions and how to optimize tightly…
WebWeaver: Breaking Topology Confidentiality in LLM Multi-Agent Systems with Stealthy Context-Based Inference
Zixun Xiong, Gaoyi Wu, Lingfeng Yao, Miao Pan, Xiaojiang Du · Mar 11, 2026 · Citations: 0

Red Team Automatic Metrics Multi Agent

Communication topology is a critical factor in the utility and safety of LLM-based multi-agent systems (LLM-MAS), making it a high-value intellectual property (IP) whose confidentiality remains insufficiently studied.
Multi-Agent Comedy Club: Investigating Community Discussion Effects on LLM Humor Generation
Shiwei Hong, Lingyao Li, Ethan Z. Rong, Chenxinran Shen, Zhicong Lu · Feb 16, 2026 · Citations: 0

Pairwise PreferenceRubric Rating Multi Agent

Prior work has explored multi-turn interaction and feedback for LLM writing, but evaluations still largely center on prompts and localized feedback, leaving persistent public reception in online communities underexamined.
World-Model-Augmented Web Agents with Action Correction
Zhouzhou Shen, Xueyu Hu, Xiyun Li, Tianqing Fang, Juncheng Li · Feb 17, 2026 · Citations: 0

Llm As JudgeSimulation Env Multi Agent

To address these challenges, we propose WAC, a web agent that integrates model collaboration, consequence simulation, and feedback-driven action refinement.
Social Dynamics as Critical Vulnerabilities that Undermine Objective Decision-Making in LLM Collectives
Changgeon Ko, Jisu Shin, Hoyun Song, Huije Lee, Eui Jun Hwang · Apr 7, 2026 · Citations: 0

Automatic MetricsSimulation Env Multi Agent

Large language model (LLM) agents are increasingly acting as human delegates in multi-agent environments, where a representative agent integrates diverse peer perspectives to make a final decision.
Mobile-Agent-v3.5: Multi-platform Fundamental GUI Agents
Haiyang Xu, Xi Zhang, Haowei Liu, Junyang Wang, Zhaozai Zhu · Feb 15, 2026 · Citations: 0

Simulation Env Long Horizon

The paper introduces GUI-Owl-1.5, the latest native GUI agent model that features instruct/thinking variants in multiple sizes (2B/4B/8B/32B/235B) and supports a range of platforms (desktop, mobile, browser, and more) to enable cloud-edge…
Spatio-Temporal Attention Enhanced Multi-Agent DRL for UAV-Assisted Wireless Networks with Limited Communications
Che Chen, Lanhua Li, Shimin Gong, Yu Zhao, Yuming Fang · Mar 23, 2026 · Citations: 0

Simulation Env Long Horizon

To maximize the overall throughput, we first propose a delay-tolerant multi-agent deep reinforcement learning (MADRL) algorithm that integrates a delay-penalized reward to encourage information sharing among UAVs, while jointly optimizing…
Exploring Plan Space through Conversation: An Agentic Framework for LLM-Mediated Explanations in Planning
Guilhem Fouilhé, Rebecca Eifler, Antonin Poché, Sylvie Thiébaux, Nicholas Asher · Mar 2, 2026 · Citations: 0

Pairwise Preference Multi Agent

When automating plan generation for a real-world sequential decision problem, the goal is often not to replace the human planner, but to facilitate an iterative reasoning and elicitation process, where the human's role is to guide the AI…
Toward Expert Investment Teams:A Multi-Agent LLM System with Fine-Grained Trading Tasks
Kunihiro Miyazaki, Takanobu Kawahara, Stephen Roberts, Stefan Zohren · Feb 26, 2026 · Citations: 0

Pairwise Preference Multi Agent

While mainstream approaches deploy multi-agent systems mimicking analyst and manager roles, they often rely on abstract instructions that overlook the intricacies of real-world workflows, which can lead to degraded inference performance and…
Decentralized Ranking Aggregation: Gossip Algorithms for Borda and Copeland Consensus
Anna Van Elst, Kerrian Le Caillec, Igor Colin, Stephan Clémençon · Feb 26, 2026 · Citations: 0

Pairwise Preference Multi Agent

The concept of ranking aggregation plays a central role in preference analysis, and numerous algorithms for calculating median rankings, often originating in social choice theory, have been documented in the literature, offering theoretical…
OccuBench: Evaluating AI Agents on Real-World Professional Tasks via Language Environment Simulation
Xiaomeng Hu, Yinger Zhang, Fei Huang, Jianhong Tu, Yang Su · Apr 13, 2026 · Citations: 0

Simulation Env Multi Agent

We introduce OccuBench, a benchmark covering 100 real-world professional task scenarios across 10 industry categories and 65 specialized domains, enabled by Language Environment Simulators (LESs) that simulate domain-specific environments…
ActionParty: Multi-Subject Action Binding in Generative Video Games
Alexander Pondaven, Ziyi Wu, Igor Gilitschenski, Philip Torr, Sergey Tulyakov · Apr 2, 2026 · Citations: 0

Automatic MetricsSimulation Env Multi Agent

However, these models are largely restricted to single-agent settings, failing to control multiple agents simultaneously in a scene.
Multi-Agent Dialectical Refinement for Enhanced Argument Classification
Jakub Bąba, Jarosław A. Chudziak · Mar 29, 2026 · Citations: 0

Llm As JudgeAutomatic Metrics Multi Agent

We introduce MAD-ACC (Multi-Agent Debate for Argument Component Classification), a framework that leverages dialectical refinement to resolve classification uncertainty.
The Emergence of Lab-Driven Alignment Signatures: A Psychometric Framework for Auditing Latent Bias and Compounding Risk in Generative AI
Dusan Bosnjakovic · Feb 19, 2026 · Citations: 0

Llm As JudgeAutomatic Metrics Multi Agent

As Large Language Models (LLMs) transition from standalone chat interfaces to foundational reasoning layers in multi-agent systems and recursive evaluation loops (LLM-as-a-judge), the detection of durable, provider-level behavioral…
Diff-KD: Diffusion-based Knowledge Distillation for Collaborative Perception under Corruptions
Pengcheng Lyu, Chaokun Zhang, Gong Chen, Tao Tang, Zhaoxiang Luo · Apr 2, 2026 · Citations: 0

Automatic Metrics Multi Agent

Multi-agent collaborative perception enables autonomous systems to overcome individual sensing limits through collective intelligence.
Learning to Negotiate: Multi-Agent Deliberation for Collective Value Alignment in LLMs
Panatchakorn Anantaprayoon, Nataliia Babina, Nima Asgharbeygi, Jad Tarifi · Mar 11, 2026 · Citations: 0

Rlaif Or Synthetic Feedback Multi Agent

The alignment of large language models (LLMs) has progressed substantially in single-agent settings through paradigms such as RLHF and Constitutional AI, with recent work exploring scalable alternatives such as RLAIF and evolving alignment…
S5-SHB Agent: Society 5.0 enabled Multi-model Agentic Blockchain Framework for Smart Home
Janani Rangila, Akila Siriweera, Incheon Paik, Keitaro Naruse, Isuru Jayanada · Mar 5, 2026 · Citations: 0

Pairwise Preference Multi Agent

The smart home is a key application domain within the Society 5.0 vision for a human-centered society.
QChunker: Learning Question-Aware Text Chunking for Domain RAG via Multi-Agent Debate
Jihao Zhao, Daixuan Li, Pengfei Li, Shuaishuai Zu, Biao Qin · Mar 12, 2026 · Citations: 0

Automatic Metrics Multi Agent

Drawing inspiration from Hal Gregersen's "Questions Are the Answer" theory, we design a multi-agent debate framework comprising four specialized components: a question outline generator, text segmenter, integrity reviewer, and knowledge…
RLShield: Practical Multi-Agent RL for Financial Cyber Defense with Attack-Surface MDPs and Real-Time Response Orchestration
Srikumar Nayak · Feb 26, 2026 · Citations: 0

Automatic Metrics Multi Agent

This paper proposes RLShield, a practical multi-agent RL pipeline for financial cyber defense.
From Self-Evolving Synthetic Data to Verifiable-Reward RL: Post-Training Multi-turn Interactive Tool-Using Agents
Jiaxuan Gao, Jiaao Chen, Chuyi He, Shusheng Xu, Di Jin · Jan 30, 2026 · Citations: 0

Simulation Env Long Horizon

Interactive tool-using agents must solve real-world tasks via multi-turn interaction with both humans and external environments, requiring dialogue state tracking, multi-step tool execution, while following complex instructions.
Evaluating Chain-of-Thought Reasoning through Reusability and Verifiability
Shashank Aggarwal, Ram Vikas Mishra, Amit Awekar · Feb 19, 2026 · Citations: 0

Automatic Metrics Multi Agent

In multi-agent IR pipelines for tasks such as search and ranking, LLM-based agents exchange intermediate reasoning in terms of Chain-of-Thought (CoT) with each other.
Heterogeneous Debate Engine: Identity-Grounded Cognitive Architecture for Resilient LLM-Based Ethical Tutoring
Jakub Masłowski, Jarosław A. Chudziak · Mar 28, 2026 · Citations: 0

Simulation Env Multi Agent

Large Language Models (LLMs) are being increasingly used as autonomous agents in complex reasoning tasks, opening the niche for dialectical interactions.
GameplayQA: A Benchmarking Framework for Decision-Dense POV-Synced Multi-Video Understanding of 3D Virtual Agents
Yunzhe Wang, Runhui Xu, Kexin Zheng, Tianyi Zhang, Jayavibhav Niranjan Kogundi · Mar 25, 2026 · Citations: 0

Simulation Env Multi Agent

Multimodal LLMs are increasingly deployed as perceptual backbones for autonomous agents in 3D environments, from robotics to virtual worlds.
Influencing LLM Multi-Agent Dialogue via Policy-Parameterized Prompts
Hongbo Bo, Jingyu Hu, Weiru Liu · Mar 10, 2026 · Citations: 0

Simulation Env Multi Agent

Large Language Models (LLMs) have emerged as a new paradigm for multi-agent systems.
Alignment Backfire: Language-Dependent Reversal of Safety Interventions Across 16 Languages in LLM Multi-Agent Systems
Hiroki Fukui · Mar 5, 2026 · Citations: 0

Simulation Env Multi Agent

We report four preregistered studies (1,584 multi-agent simulations across 16 languages and three model families) demonstrating that alignment interventions in large language models produce a structurally analogous phenomenon: surface…
Cooperative-Competitive Team Play of Real-World Craft Robots
Rui Zhao, Xihui Li, Yizheng Zhang, Yuzhen Liu, Zhong Zhang · Feb 24, 2026 · Citations: 0

Simulation Env Multi Agent

Multi-agent deep Reinforcement Learning (RL) has made significant progress in developing intelligent game-playing agents in recent years.
Architecting AgentOS: From Token-Level Context to Emergent System-Level Intelligence
ChengYou Li, XiaoDong Liu, XiangBao Meng, XinYu Zhao · Feb 24, 2026 · Citations: 0

Simulation Env Multi Agent

The paradigm of Large Language Models is undergoing a fundamental transition from static inference engines to dynamic autonomous cognitive systems.While current research primarily focuses on scaling context windows or optimizing prompt engi
Verifiable Semantics for Agent-to-Agent Communication
Philipp Schoenegger, Matt Carlson, Chris Schneider, Chris Daly · Feb 18, 2026 · Citations: 0

Simulation Env Multi Agent

Multiagent AI systems require consistent communication, but we lack methods to verify that agents share the same understanding of the terms used.
Learning to Interrupt in Language-based Multi-agent Communication
Danqing Wang, Da Yin, Ruta Desai, Lei Li, Asli Celikyilmaz · Apr 7, 2026 · Citations: 0

Automatic Metrics Multi Agent

Motivated by this, we propose an interruptible communication framework that allows the agent who is listening to interrupt the current speaker.
Towards Automated Community Notes Generation with Large Vision Language Models for Combating Contextual Deception
Jin Ma, Jingwen Yan, Mohammed Aldeen, Ethan Anderson, Taran Kavuru · Mar 23, 2026 · Citations: 0

Automatic Metrics Multi Agent

However, its reliance on human contributors limits both the timeliness and scalability.
Governed Memory: A Production Architecture for Multi-Agent Workflows
Hamed Taheri · Mar 18, 2026 · Citations: 0

Automatic Metrics Long Horizon

Enterprise AI deploys dozens of autonomous agent nodes across workflows, each acting on the same entities with no shared memory and no common governance.
Semantic Invariance in Agentic AI
I. de Zarzà, J. de Curtò, Jordi Cabot, Pietro Manzoni, Carlos T. Calafate · Mar 13, 2026 · Citations: 0

Automatic Metrics Long Horizon

Standard benchmark evaluations, which assess accuracy on fixed, canonical problem formulations, fail to capture this critical reliability dimension.
From Debate to Deliberation: Structured Collective Reasoning with Typed Epistemic Acts
Sunil Prakash · Mar 12, 2026 · Citations: 0

Automatic Metrics Multi Agent

Multi-agent LLM systems increasingly tackle complex reasoning, yet their interaction patterns remain limited to voting, unstructured debate, or pipeline orchestration.
Chow-Liu Ordering for Long-Context Reasoning in Chain-of-Agents
Naman Gupta, Vaibhav Singh, Arun Iyer, Kirankumar Shiragur, Pratham Grover · Mar 10, 2026 · Citations: 0

Automatic Metrics Multi Agent

Sequential multi-agent reasoning frameworks such as Chain-of-Agents (CoA) handle long-context queries by decomposing inputs into chunks and processing them sequentially using LLM-based worker agents that read from and update a bounded…
LieCraft: A Multi-Agent Framework for Evaluating Deceptive Capabilities in Language Models
Matthew Lyle Olson, Neale Ratzlaff, Musashi Hinck, Tri Nguyen, Vasudev Lal · Mar 6, 2026 · Citations: 0

Automatic Metrics Multi Agent

Large Language Models (LLMs) exhibit impressive general-purpose capabilities but also introduce serious safety risks, particularly the potential for deception as models acquire increased agency and human oversight diminishes.
CiteAudit: You Cited It, But Did You Read It? A Benchmark for Verifying Scientific References in the LLM Era
Zhengqing Yuan, Kaiwen Shi, Zheyuan Zhang, Lichao Sun, Nitesh V. Chawla · Feb 26, 2026 · Citations: 0

Automatic Metrics Multi Agent

Meanwhile, rapidly growing reference lists make manual verification impractical, and existing automated tools remain fragile to noisy and heterogeneous citation formats and lack standardized evaluation.
Hierarchical LLM-Based Multi-Agent Framework with Prompt Optimization for Multi-Robot Task Planning
Tomoya Kawabe, Rin Takano · Feb 25, 2026 · Citations: 0

Automatic Metrics Long Horizon

We present a hierarchical multi-agent LLM-based planner with prompt optimization: an upper layer decomposes tasks and assigns them to lower-layer agents, which generate PDDL problems solved by a classical planner.
The Headless Firm: How AI Reshapes Enterprise Boundaries
Tassilo Klein, Sebastian Wieczorek · Feb 24, 2026 · Citations: 0

Automatic Metrics Multi Agent

We argue that agentic AI induces a structural change in how coordination costs scale: in prior modular systems, integration cost grew with interaction topology (O(n^2) in the number of components); in protocol-mediated agentic systems, inte
WideSeek-R1: Exploring Width Scaling for Broad Information Seeking via Multi-Agent Reinforcement Learning
Zelai Xu, Zhexuan Xu, Ruize Zhang, Chunyang Zhu, Shi Yu · Feb 4, 2026 · Citations: 0

Automatic Metrics Tool Use

To bridge this gap, we propose WideSeek-R1, a lead-agent-subagent framework trained via multi-agent reinforcement learning (MARL) to synergize scalable orchestration and parallel execution.
Training Generalizable Collaborative Agents via Strategic Risk Aversion
Chengrui Qu, Yizhou Zhang, Nicholas Lanzetti, Eric Mazumdar · Feb 25, 2026 · Citations: 0

Automatic Metrics Multi Agent

Many emerging agentic paradigms require agents to collaborate with one another (or people) to achieve shared goals.

Related Hubs

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.

Self-Service

Post a Job

Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.

Create Account & Post a Job

Managed Service

For Large Projects

Done-for-You

We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.

Learn About Managed Service

For Freelancers

Join as an AI Trainer

Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.

Join Now