Mastering Multi-Drone Volleyball through Hierarchical Co-Self-Play Reinforcement Learning

Ruize Zhang, Sirui Xiang, Zelai Xu, Feng Gao, Shilong Ji, Wenhao Tang, Wenbo Ding, Chao Yu, Yu Wang · May 7, 2025 · Citations: 0

Automatic Metrics Demonstrations General Long Horizon Multi Agent

Open arXiv RSS feed

Abstract

In this paper, we tackle the problem of learning to play 3v3 multi-drone volleyball, a new embodied competitive task that requires both high-level strategic coordination and low-level agile control. The task is turn-based, multi-agent, and physically grounded, posing significant challenges due to its long-horizon dependencies, tight inter-agent coupling, and the underactuated dynamics of quadrotors. To address this, we propose Hierarchical Co-Self-Play (HCSP), a hierarchical reinforcement learning framework that separates centralized high-level strategic decision-making from decentralized low-level motion control. We design a three-stage population-based training pipeline to enable both strategy and skill to emerge from scratch without expert demonstrations: (I) training diverse low-level skills, (II) learning high-level strategy via self-play with fixed low-level skills, and (III) joint fine-tuning through co-self-play. Experiments show that HCSP achieves superior performance, outperforming non-hierarchical self-play and rule-based hierarchical baselines with an average 82.9% win rate and a 71.5% win rate against the two-stage variant. Moreover, co-self-play leads to emergent team behaviors such as role switching and coordinated formations, demonstrating the effectiveness of our hierarchical design and training scheme. The project page is at https://hi-co-self-play.github.io.

HFEPX Relevance Assessment

This paper has direct human-feedback and/or evaluation protocol signal and is likely useful for eval pipeline design.

Eval-Fit Score

65/100 • Medium

Useful as a secondary reference; validate protocol details against neighboring papers.

Human Feedback Signal

Detected

Evaluation Signal

Detected

HFEPX Fit

High-confidence candidate

If you are doing eval pipeline work, start here:

Human Eval Hub LLM-as-Judge Hub Pairwise Preference Hub Tool-Use Eval Hub

Human Data Lens

Uses human feedback: Yes
Feedback types: Demonstrations
Rater population: Domain Experts
Unit of annotation: Unknown
Expertise required: General
Extraction source: Persisted extraction

Evaluation Lens

Evaluation modes: Automatic Metrics
Agentic eval: Long Horizon, Multi Agent
Quality controls: Not reported
Confidence: 0.70
Flags: None

Protocol And Measurement Signals

Benchmarks / Datasets

No benchmark or dataset names were extracted from the available abstract.

Reported Metrics

win rate

Research Brief

Deterministic synthesis

The task is turn-based, multi-agent, and physically grounded, posing significant challenges due to its long-horizon dependencies, tight inter-agent coupling, and the underactuated dynamics of quadrotors. HFEPX signals include Demonstrations, Automatic Metrics, Long Horizon with confidence 0.70. Updated from current HFEPX corpus.

Generated Mar 2, 2026, 10:18 PM · Grounded in abstract + metadata only

Key Takeaways

The task is turn-based, multi-agent, and physically grounded, posing significant challenges due to its long-horizon dependencies, tight inter-agent coupling, and the underactuated…
To address this, we propose Hierarchical Co-Self-Play (HCSP), a hierarchical reinforcement learning framework that separates centralized high-level strategic decision-making from…

Researcher Actions

Compare its human-feedback setup against pairwise and rubric hubs.
Identify benchmark choices from full text before operationalizing conclusions.
Validate metric comparability (win rate).

Caveats

Generated from title, abstract, and extracted metadata only; full-paper implementation details are not parsed.
Extraction confidence is probabilistic and should be validated for critical decisions.

Recommended Queries

human-eval protocol design agent eval benchmark comparison inter-rater agreement adjudication

Research Summary

Contribution Summary

The task is turn-based, multi-agent, and physically grounded, posing significant challenges due to its long-horizon dependencies, tight inter-agent coupling, and the underactuated dynamics of quadrotors.
To address this, we propose Hierarchical Co-Self-Play (HCSP), a hierarchical reinforcement learning framework that separates centralized high-level strategic decision-making from decentralized low-level motion control.
Experiments show that HCSP achieves superior performance, outperforming non-hierarchical self-play and rule-based hierarchical baselines with an average 82.9% win rate and a 71.5% win rate against the two-stage variant.

Why It Matters For Eval

The task is turn-based, multi-agent, and physically grounded, posing significant challenges due to its long-horizon dependencies, tight inter-agent coupling, and the underactuated dynamics of quadrotors.

Researcher Checklist

Pass: Human feedback protocol is explicit

Detected: Demonstrations
Pass: Evaluation mode is explicit

Detected: Automatic Metrics
Gap: Quality control reporting appears

No calibration/adjudication/IAA control explicitly detected.
Gap: Benchmark or dataset anchors are present

No benchmark/dataset anchor extracted from abstract.
Pass: Metric reporting is present

Detected: win rate

Related Papers

Papers are ranked by protocol overlap, extraction signal alignment, and semantic proximity.

VolleyBots: A Testbed for Multi-Drone Volleyball Game Combining Motion Control and Strategic Play Protocol Overlap

Citations: 0 Relevance: 8.70 Shared tag: DemonstrationsShared tag: Multi Agent
- Shared 2 HFEPX protocol tags
- Aligned human feedback protocol
- Aligned agent-evaluation setup
- Shared metric mentions
MoMaGen: Generating Demonstrations under Soft and Hard Constraints for Multi-Step Bimanual Mobile Manipulation Protocol Overlap

Citations: 0 Relevance: 7.80 Shared tag: DemonstrationsShared tag: Long Horizon
- Shared 2 HFEPX protocol tags
- Aligned human feedback protocol
- Aligned agent-evaluation setup
SPACeR: Self-Play Anchoring with Centralized Reference Models Protocol Overlap

Citations: 0 Relevance: 7.80 Shared tag: DemonstrationsShared tag: Multi Agent
- Shared 2 HFEPX protocol tags
- Aligned human feedback protocol
- Aligned agent-evaluation setup
Supervised Reinforcement Learning: From Expert Trajectories to Step-wise Reasoning Protocol Overlap

Citations: 0 Relevance: 7.80 Shared tag: DemonstrationsShared tag: Long Horizon
- Shared 2 HFEPX protocol tags
- Aligned human feedback protocol
- Aligned agent-evaluation setup
A Hierarchical Multi-Agent System for Autonomous Discovery in Geoscientific Data Archives Protocol Overlap

Citations: 0 Relevance: 7.40 Shared tag: Long HorizonShared tag: Multi Agent
- Shared 2 HFEPX protocol tags
- Aligned agent-evaluation setup
CoAct-1: Computer-using Multi-Agent System with Coding Actions Protocol Overlap

Citations: 0 Relevance: 7.40 Shared tag: Long HorizonShared tag: Multi Agent
- Shared 2 HFEPX protocol tags
- Aligned agent-evaluation setup
Hierarchical LLM-Based Multi-Agent Framework with Prompt Optimization for Multi-Robot Task Planning Protocol Overlap

Citations: 0 Relevance: 7.40 Shared tag: Long HorizonShared tag: Multi Agent
- Shared 2 HFEPX protocol tags
- Aligned agent-evaluation setup
Mobile-Agent-v3.5: Multi-platform Fundamental GUI Agents Protocol Overlap

Citations: 0 Relevance: 7.40 Shared tag: Long HorizonShared tag: Multi Agent
- Shared 2 HFEPX protocol tags
- Aligned agent-evaluation setup
AuditBench: Evaluating Alignment Auditing Techniques on Models with Hidden Behaviors Protocol Overlap

Citations: 0 Relevance: 4.10 Shared tag: Demonstrations
- Shared HFEPX protocol tags
- Aligned human feedback protocol
Explore-on-Graph: Incentivizing Autonomous Exploration of Large Language Models on Knowledge Graphs with Path-refined Reward Modeling Protocol Overlap

Citations: 0 Relevance: 4.10 Shared tag: Demonstrations
- Shared HFEPX protocol tags
- Aligned human feedback protocol
FewMMBench: A Benchmark for Multimodal Few-Shot Learning Protocol Overlap

Citations: 0 Relevance: 4.10 Shared tag: Demonstrations
- Shared HFEPX protocol tags
- Aligned human feedback protocol
Fine-Tuning Without Forgetting In-Context Learning: A Theoretical Analysis of Linear Attention Models Protocol Overlap

Citations: 0 Relevance: 4.10 Shared tag: Demonstrations
- Shared HFEPX protocol tags
- Aligned human feedback protocol

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.

Post a Job Get a Quote