Skip to content
← Back to explorer

Mastering Multi-Drone Volleyball through Hierarchical Co-Self-Play Reinforcement Learning

Ruize Zhang, Sirui Xiang, Zelai Xu, Feng Gao, Shilong Ji, Wenhao Tang, Wenbo Ding, Chao Yu, Yu Wang · May 7, 2025 · Citations: 0

Abstract

In this paper, we tackle the problem of learning to play 3v3 multi-drone volleyball, a new embodied competitive task that requires both high-level strategic coordination and low-level agile control. The task is turn-based, multi-agent, and physically grounded, posing significant challenges due to its long-horizon dependencies, tight inter-agent coupling, and the underactuated dynamics of quadrotors. To address this, we propose Hierarchical Co-Self-Play (HCSP), a hierarchical reinforcement learning framework that separates centralized high-level strategic decision-making from decentralized low-level motion control. We design a three-stage population-based training pipeline to enable both strategy and skill to emerge from scratch without expert demonstrations: (I) training diverse low-level skills, (II) learning high-level strategy via self-play with fixed low-level skills, and (III) joint fine-tuning through co-self-play. Experiments show that HCSP achieves superior performance, outperforming non-hierarchical self-play and rule-based hierarchical baselines with an average 82.9% win rate and a 71.5% win rate against the two-stage variant. Moreover, co-self-play leads to emergent team behaviors such as role switching and coordinated formations, demonstrating the effectiveness of our hierarchical design and training scheme. The project page is at https://hi-co-self-play.github.io.

HFEPX Relevance Assessment

This paper has direct human-feedback and/or evaluation protocol signal and is likely useful for eval pipeline design.

Eval-Fit Score

65/100 • Medium

Useful as a secondary reference; validate protocol details against neighboring papers.

Human Feedback Signal

Detected

Evaluation Signal

Detected

HFEPX Fit

High-confidence candidate

Human Data Lens

  • Uses human feedback: Yes
  • Feedback types: Demonstrations
  • Rater population: Domain Experts
  • Unit of annotation: Unknown
  • Expertise required: General
  • Extraction source: Persisted extraction

Evaluation Lens

  • Evaluation modes: Automatic Metrics
  • Agentic eval: Long Horizon, Multi Agent
  • Quality controls: Not reported
  • Confidence: 0.70
  • Flags: None

Protocol And Measurement Signals

Benchmarks / Datasets

No benchmark or dataset names were extracted from the available abstract.

Reported Metrics

win rate

Research Brief

Deterministic synthesis

The task is turn-based, multi-agent, and physically grounded, posing significant challenges due to its long-horizon dependencies, tight inter-agent coupling, and the underactuated dynamics of quadrotors. HFEPX signals include Demonstrations, Automatic Metrics, Long Horizon with confidence 0.70. Updated from current HFEPX corpus.

Generated Mar 2, 2026, 10:18 PM · Grounded in abstract + metadata only

Key Takeaways

  • The task is turn-based, multi-agent, and physically grounded, posing significant challenges due to its long-horizon dependencies, tight inter-agent coupling, and the underactuated…
  • To address this, we propose Hierarchical Co-Self-Play (HCSP), a hierarchical reinforcement learning framework that separates centralized high-level strategic decision-making from…

Researcher Actions

  • Compare its human-feedback setup against pairwise and rubric hubs.
  • Identify benchmark choices from full text before operationalizing conclusions.
  • Validate metric comparability (win rate).

Caveats

  • Generated from title, abstract, and extracted metadata only; full-paper implementation details are not parsed.
  • Extraction confidence is probabilistic and should be validated for critical decisions.

Research Summary

Contribution Summary

  • The task is turn-based, multi-agent, and physically grounded, posing significant challenges due to its long-horizon dependencies, tight inter-agent coupling, and the underactuated dynamics of quadrotors.
  • To address this, we propose Hierarchical Co-Self-Play (HCSP), a hierarchical reinforcement learning framework that separates centralized high-level strategic decision-making from decentralized low-level motion control.
  • Experiments show that HCSP achieves superior performance, outperforming non-hierarchical self-play and rule-based hierarchical baselines with an average 82.9% win rate and a 71.5% win rate against the two-stage variant.

Why It Matters For Eval

  • The task is turn-based, multi-agent, and physically grounded, posing significant challenges due to its long-horizon dependencies, tight inter-agent coupling, and the underactuated dynamics of quadrotors.

Researcher Checklist

  • Pass: Human feedback protocol is explicit

    Detected: Demonstrations

  • Pass: Evaluation mode is explicit

    Detected: Automatic Metrics

  • Gap: Quality control reporting appears

    No calibration/adjudication/IAA control explicitly detected.

  • Gap: Benchmark or dataset anchors are present

    No benchmark/dataset anchor extracted from abstract.

  • Pass: Metric reporting is present

    Detected: win rate

Related Papers

Papers are ranked by protocol overlap, extraction signal alignment, and semantic proximity.

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.