HFEPX Hub

CS.LG + Multi Agent Papers

Updated from current HFEPX corpus (Mar 8, 2026). 13 papers are grouped in this hub page.

Read Full Context

Updated from current HFEPX corpus (Mar 8, 2026). 13 papers are grouped in this hub page. Common evaluation modes: Automatic Metrics, Simulation Env. Most common rater population: Domain Experts. Common annotation unit: Pairwise. Frequently cited benchmark: AdvBench. Common metric signal: cost. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Feb 27, 2026.

Papers: 13 Last published: Feb 27, 2026 Global RSS Tag RSS

Cs.LGMulti Agent

Researcher Quick Triage

This hub is best used for protocol triage and replication planning from abstract-level evidence. Quality band: Developing .

All Sampled Papers (13) Replication-Ready Only (2)

High-Signal Coverage

100.0%

13 / 13 sampled papers are not low-signal flagged.

Replication-Ready Set

Benchmark + metric + eval mode explicitly present.

Judge/Human Comparability

Papers containing both `human_eval` and `llm_as_judge`.

2 papers are replication-ready (benchmark + metric + explicit evaluation mode).
0 papers support judge-vs-human agreement analysis.
0 papers report explicit quality controls (calibration/adjudication/IAA).

Primary action: Use this page for scouting only; collect additional papers before attempting replication-critical comparisons.

Why This Matters For Eval Research

50% of papers report explicit human-feedback signals, led by pairwise preferences.
automatic metrics appears in 38.5% of papers in this hub.
AdvBench is a recurring benchmark anchor for cross-paper comparisons in this page.

Protocol Takeaways

Quality-control reporting is sparse in this slice; prioritize papers with explicit calibration or adjudication steps.
Rater context is mostly domain experts, and annotation is commonly pairwise annotation; use this to scope replication staffing.
Pair this hub with a human_eval-heavy hub to validate judge-model calibration.

Benchmark Interpretation

AdvBench appears in 8.3% of hub papers (1/13); use this cohort for benchmark-matched comparisons.
APPS appears in 8.3% of hub papers (1/13); use this cohort for benchmark-matched comparisons.

Metric Interpretation

cost is reported in 16.7% of hub papers (2/13); compare with a secondary metric before ranking methods.
jailbreak success rate is reported in 8.3% of hub papers (1/13); compare with a secondary metric before ranking methods.

Researcher Checklist (Expanded)

Researcher Checklist

Strong: Papers with explicit human feedback

Coverage is strong (50% vs 45% target).
Gap: Papers reporting quality controls

Coverage is a replication risk (0% vs 30% target).
Gap: Papers naming benchmarks/datasets

Coverage is a replication risk (16.7% vs 35% target).
Moderate: Papers naming evaluation metrics

Coverage is usable but incomplete (33.3% vs 35% target).
Gap: Papers with known rater population

Coverage is a replication risk (16.7% vs 35% target).
Moderate: Papers with known annotation unit

Coverage is usable but incomplete (25% vs 35% target).

Strengths

Strong human-feedback signal (50% of papers).
Agentic evaluation appears in 100% of papers.

Known Gaps

Only 0% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (16.7% coverage).
Benchmark coverage is thin (16.7% of papers mention benchmarks/datasets).

Suggested Next Analyses

Pair this hub with a human_eval-heavy hub to validate judge-model calibration.
Stratify by benchmark (AdvBench vs APPS) before comparing methods.
Track metric sensitivity by reporting both cost and jailbreak success rate.

Recommended Queries (Expanded)

Recommended Queries

LLM-as-Judge Protocols Benchmark Slice: AdvBench Metric Slice: cost Recent High-Signal Papers

Start with These 3

Use these when you need one protocol anchor, one benchmark anchor, and one recent comparison point before reading the wider hub.

Strongest protocol reference

Jailbreak Foundry: From Papers to Runnable Attacks for Reproducible B…

Highest protocol score with explicit human/eval signal plus AdvBench.

Strongest benchmark reference

RLShield: Practical Multi-Agent RL for Financial Cyber Defense with A…

APPS with cost gives a fast comparison anchor.

Strongest recent paper

Decentralized Ranking Aggregation: Gossip Algorithms for Borda and Co…

Useful for current practice scanning; published Feb 26, 2026.

Start Here (Best First 6)

Ranked for protocol completeness (human signal, benchmark + metric anchors, quality controls, and judge/human overlap).

Jailbreak Foundry: From Papers to Runnable Attacks for Reproducible Benchmarking
Feb 27, 2026 · Citations: 0 · Score: 8.0

HF: Red Team · Eval: Llm As Judge · Benchmark: AdvBench · Metric: Success rate
RLShield: Practical Multi-Agent RL for Financial Cyber Defense with Attack-Surface MDPs and Real-Time Response Orchestration
Feb 26, 2026 · Citations: 0 · Score: 6.0

HF: Not reported · Eval: Automatic Metrics · Benchmark: APPS · Metric: Cost
Decentralized Ranking Aggregation: Gossip Algorithms for Borda and Copeland Consensus
Feb 26, 2026 · Citations: 0 · Score: 4.5

HF: Pairwise Preference · Eval: Not reported · Benchmark: Not Reported · Metric: Not Reported
The Vision Wormhole: Latent-Space Communication in Heterogeneous Multi-Agent Systems
Feb 17, 2026 · Citations: 0 · Score: 4.5

HF: Pairwise Preference · Eval: Not reported · Benchmark: Not Reported · Metric: Not Reported
GenDB: The Next Generation of Query Processing -- Synthesized, Not Engineered
Mar 2, 2026 · Citations: 0 · Score: 4.0

HF: Not reported · Eval: Automatic Metrics · Benchmark: Not Reported · Metric: Cost
VolleyBots: A Testbed for Multi-Drone Volleyball Game Combining Motion Control and Strategic Play
Feb 4, 2025 · Citations: 0 · Score: 3.9

HF: Demonstrations · Eval: Automatic Metrics, Simulation Env · Benchmark: Not Reported · Metric: Win rate

Protocol Matrix (Top 12)

Use this to quickly compare protocol ingredients instead of scanning long prose.

Paper	HF Signal	Eval Modes	Benchmarks	Metrics	QC
Jailbreak Foundry: From Papers to Runnable Attacks for Reproducible Benchmarking Feb 27, 2026	Yes Red Team	Llm As Judge	AdvBench , Jbf Eval	Success rate , Jailbreak success rate	Not Reported
RLShield: Practical Multi-Agent RL for Financial Cyber Defense with Attack-Surface MDPs and Real-Time Response Orchestration Feb 26, 2026	No Not Reported	Automatic Metrics	APPS	Cost	Not Reported
Decentralized Ranking Aggregation: Gossip Algorithms for Borda and Copeland Consensus Feb 26, 2026	Yes Pairwise Preference	Not Reported	Not Reported	Not Reported	Not Reported
The Vision Wormhole: Latent-Space Communication in Heterogeneous Multi-Agent Systems Feb 17, 2026	Yes Pairwise Preference	Not Reported	Not Reported	Not Reported	Not Reported
GenDB: The Next Generation of Query Processing -- Synthesized, Not Engineered Mar 2, 2026	No Not Reported	Automatic Metrics	Not Reported	Cost	Not Reported
VolleyBots: A Testbed for Multi-Drone Volleyball Game Combining Motion Control and Strategic Play Feb 4, 2025	Yes Demonstrations	Automatic Metrics , Simulation Env	Not Reported	Win rate	Not Reported
SPACeR: Self-Play Anchoring with Centralized Reference Models Oct 20, 2025	Yes Demonstrations	Simulation Env	Not Reported	Not Reported	Not Reported
CORE: Measuring Multi-Agent LLM Interaction Quality under Game-Theoretic Pressures Aug 16, 2025	Yes Pairwise Preference	Not Reported	Not Reported	Not Reported	Not Reported
MALLVI: A Multi-Agent Framework for Integrated Generalized Robotics Manipulation Feb 18, 2026	No Not Reported	Simulation Env	Not Reported	Not Reported	Not Reported
Training Generalizable Collaborative Agents via Strategic Risk Aversion Feb 25, 2026	No Not Reported	Automatic Metrics	Not Reported	Not Reported	Not Reported
Can Multimodal LLMs Perform Time Series Anomaly Detection? Feb 25, 2025	No Not Reported	Automatic Metrics	Not Reported	Not Reported	Not Reported
Multi-agent deep reinforcement learning with centralized training and decentralized execution for transportation infrastructure management Jan 23, 2024	No Not Reported	Simulation Env	Not Reported	Not Reported	Not Reported

Protocol Diff (Top Papers)

Fast side-by-side comparison for the highest-ranked papers in this hub.

Signal	Jailbreak Foundry: From Papers to Runnable Attacks…	RLShield: Practical Multi-Agent RL for Financial Cy…	Decentralized Ranking Aggregation: Gossip Algorithm…
Human Feedback	Red Team	Not reported	Pairwise Preference
Evaluation Modes	Llm As Judge	Automatic Metrics	Not reported
Benchmarks	AdvBench, Jbf Eval	APPS	Not reported
Metrics	Success rate, Jailbreak success rate	Cost	Not reported
Quality Controls	Not reported	Not reported	Not reported
Rater Population	Unknown	Unknown	Unknown
Annotation Unit	Unknown	Unknown	Ranking

Research Utility Snapshot

Human Feedback Mix

Pairwise Preference (3)
Demonstrations (2)
Red Team (1)

Evaluation Modes

Automatic Metrics (5)
Simulation Env (4)
Llm As Judge (1)

Top Benchmarks

AdvBench (1)
APPS (1)
Jbf Eval (1)

Top Metrics

Cost (2)
Jailbreak success rate (1)
Success rate (1)
Win rate (1)

Rater Population Mix

Domain Experts (2)

Quality Controls

Coverage diagnostics (sample-based): human-feedback 46.2% · benchmarks 15.4% · metrics 30.8% · quality controls 0.0%.

Top Papers

Jailbreak Foundry: From Papers to Runnable Attacks for Reproducible Benchmarking
Zhicheng Fang, Jingjie Zheng, Chenxu Fu, Wei Xu · Feb 27, 2026 · Citations: 0

Red Team Llm As Judge Multi Agent

Jailbreak techniques for large language models (LLMs) evolve faster than benchmarks, making robustness estimates stale and difficult to compare across papers due to drift in datasets, harnesses, and judging protocols.
VolleyBots: A Testbed for Multi-Drone Volleyball Game Combining Motion Control and Strategic Play
Zelai Xu, Ruize Zhang, Chao Yu, Huining Yuan, Xiangmin Yi · Feb 4, 2025 · Citations: 0

Demonstrations Automatic MetricsSimulation Env Multi Agent

We provide a comprehensive suite of tasks ranging from single-drone drills to multi-drone cooperative and competitive tasks, accompanied by baseline evaluations of representative reinforcement learning (RL), multi-agent reinforcement…
SPACeR: Self-Play Anchoring with Centralized Reference Models
Wei-Jer Chang, Akshay Rangesh, Kevin Joseph, Matthew Strong, Masayoshi Tomizuka · Oct 20, 2025 · Citations: 0

Demonstrations Simulation Env Multi Agent

Developing autonomous vehicles (AVs) requires not only safety and efficiency, but also realistic, human-like behaviors that are socially aware and predictable.
Decentralized Ranking Aggregation: Gossip Algorithms for Borda and Copeland Consensus
Anna Van Elst, Kerrian Le Caillec, Igor Colin, Stephan Clémençon · Feb 26, 2026 · Citations: 0

Pairwise Preference Multi Agent

The concept of ranking aggregation plays a central role in preference analysis, and numerous algorithms for calculating median rankings, often originating in social choice theory, have been documented in the literature, offering theoretical…
The Vision Wormhole: Latent-Space Communication in Heterogeneous Multi-Agent Systems
Xiaoze Liu, Ruowang Zhang, Weichen Yu, Siheng Xiong, Liu He · Feb 17, 2026 · Citations: 0

Pairwise Preference Multi Agent

Multi-Agent Systems (MAS) powered by Large Language Models have unlocked advanced collaborative reasoning, yet they remain shackled by the inefficiency of discrete text communication, which imposes significant runtime overhead and…
CORE: Measuring Multi-Agent LLM Interaction Quality under Game-Theoretic Pressures
Punya Syon Pandey, Yongjin Yang, Jiarui Liu, Zhijing Jin · Aug 16, 2025 · Citations: 0

Pairwise Preference Multi Agent

Game-theoretic interactions between agents with Large Language Models (LLMs) have revealed many emergent capabilities, yet the linguistic diversity of these interactions has not been sufficiently quantified.
RLShield: Practical Multi-Agent RL for Financial Cyber Defense with Attack-Surface MDPs and Real-Time Response Orchestration
Srikumar Nayak · Feb 26, 2026 · Citations: 0

Automatic Metrics Multi Agent

This paper proposes RLShield, a practical multi-agent RL pipeline for financial cyber defense.
GenDB: The Next Generation of Query Processing -- Synthesized, Not Engineered
Jiale Lao, Immanuel Trummer · Mar 2, 2026 · Citations: 0

Automatic Metrics Multi Agent

As a proof of concept, we present GenDB, an LLM-powered agentic system that generates instance-optimized and customized query execution code tailored to specific data, workloads, and hardware resources.
MALLVI: A Multi-Agent Framework for Integrated Generalized Robotics Manipulation
Iman Ahmadi, Mehrshad Taji, Arad Mahdinezhad Kashani, AmirHossein Jadidi, Saina Kashani · Feb 18, 2026 · Citations: 0

Simulation Env Multi Agent

MALLVI presents a Multi Agent Large Language and Vision framework that enables closed-loop feedback driven robotic manipulation.
Multi-agent deep reinforcement learning with centralized training and decentralized execution for transportation infrastructure management
M. Saifullah, K. G. Papakonstantinou, A. Bhattacharya, S. M. Stoffels, C. P. Andriotis · Jan 23, 2024 · Citations: 0

Simulation Env Multi Agent

To tackle the high dimensionality of state and action spaces, we propose DDMAC-CTDE, a Deep Decentralized Multi-Agent Actor-Critic (DDMAC) reinforcement learning architecture with Centralized Training and Decentralized Execution (CTDE).
Training Generalizable Collaborative Agents via Strategic Risk Aversion
Chengrui Qu, Yizhou Zhang, Nicholas Lanzetti, Eric Mazumdar · Feb 25, 2026 · Citations: 0

Automatic Metrics Multi Agent

Many emerging agentic paradigms require agents to collaborate with one another (or people) to achieve shared goals.
Can Multimodal LLMs Perform Time Series Anomaly Detection?
Xiongxiao Xu, Haoran Wang, Yueqing Liang, Philip S. Yu, Yue Zhao · Feb 25, 2025 · Citations: 0

Automatic Metrics Multi Agent

One natural way for humans to detect time series anomalies is through visualization and textual description.
SPIRAL: Self-Play on Zero-Sum Games Incentivizes Reasoning via Multi-Agent Multi-Turn Reinforcement Learning
Bo Liu, Leon Guertler, Simon Yu, Zichen Liu, Penghui Qi · Jun 30, 2025 · Citations: 0

Multi Agent

We introduce SPIRAL, a self-play framework where models learn by playing multi-turn, zero-sum games against continuously improving versions of themselves, generating an automatic curriculum of stronger opponents, and eliminating the need…

Related Hubs

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.

Post a Job Get a Quote