HFEPX Hub

Multi Agent + Coding (Last 120 Days)

Updated from current HFEPX corpus (Mar 8, 2026). 13 papers are grouped in this hub page.

Read Full Context

Updated from current HFEPX corpus (Mar 8, 2026). 13 papers are grouped in this hub page. Common evaluation modes: Automatic Metrics, Simulation Env. Most common rater population: Domain Experts. Common annotation unit: Multi Dim Rubric. Frequent quality control: Calibration. Frequently cited benchmark: AdvBench. Common metric signal: cost. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Feb 27, 2026.

Papers: 13 Last published: Feb 27, 2026 Global RSS Tag RSS

Multi AgentCodingLast 120d

Researcher Quick Triage

This hub is best used for protocol triage and replication planning from abstract-level evidence. Quality band: Developing .

All Sampled Papers (13) Replication-Ready Only (2)

High-Signal Coverage

100.0%

13 / 13 sampled papers are not low-signal flagged.

Replication-Ready Set

Benchmark + metric + eval mode explicitly present.

Judge/Human Comparability

Papers containing both `human_eval` and `llm_as_judge`.

2 papers are replication-ready (benchmark + metric + explicit evaluation mode).
0 papers support judge-vs-human agreement analysis.
1 papers report explicit quality controls (calibration/adjudication/IAA).

Primary action: Use this page for scouting only; collect additional papers before attempting replication-critical comparisons.

Why This Matters For Eval Research

53.8% of papers report explicit human-feedback signals, led by expert verification.
automatic metrics appears in 53.8% of papers in this hub.
AdvBench is a recurring benchmark anchor for cross-paper comparisons in this page.

Protocol Takeaways

Most common quality-control signal is rater calibration (7.7% of papers).
Rater context is mostly domain experts, and annotation is commonly multi-dimensional rubrics; use this to scope replication staffing.
Pair this hub with a human_eval-heavy hub to validate judge-model calibration.

Benchmark Interpretation

AdvBench appears in 7.7% of hub papers (1/13); use this cohort for benchmark-matched comparisons.
Jbf-Eval appears in 7.7% of hub papers (1/13); use this cohort for benchmark-matched comparisons.

Metric Interpretation

cost is reported in 15.4% of hub papers (2/13); compare with a secondary metric before ranking methods.
success rate is reported in 15.4% of hub papers (2/13); compare with a secondary metric before ranking methods.

Researcher Checklist (Expanded)

Researcher Checklist

Strong: Papers with explicit human feedback

Coverage is strong (53.8% vs 45% target).
Gap: Papers reporting quality controls

Coverage is a replication risk (7.7% vs 30% target).
Moderate: Papers naming benchmarks/datasets

Coverage is usable but incomplete (23.1% vs 35% target).
Strong: Papers naming evaluation metrics

Coverage is strong (53.8% vs 35% target).
Moderate: Papers with known rater population

Coverage is usable but incomplete (30.8% vs 35% target).
Moderate: Papers with known annotation unit

Coverage is usable but incomplete (23.1% vs 35% target).

Strengths

Strong human-feedback signal (53.8% of papers).
Agentic evaluation appears in 100% of papers.

Known Gaps

Only 7.7% of papers report quality controls; prioritize calibration/adjudication evidence.
Annotation unit is under-specified (23.1% coverage).
LLM-as-judge appears without enough inter-annotator agreement reporting.

Suggested Next Analyses

Pair this hub with a human_eval-heavy hub to validate judge-model calibration.
Stratify by benchmark (AdvBench vs Jbf-Eval) before comparing methods.
Track metric sensitivity by reporting both cost and success rate.
Add inter-annotator agreement checks when reproducing these protocols.

Recommended Queries (Expanded)

Recommended Queries

LLM-as-Judge Protocols Benchmark Slice: AdvBench Metric Slice: cost Recent High-Signal Papers

Start with These 3

Use these when you need one protocol anchor, one benchmark anchor, and one recent comparison point before reading the wider hub.

Strongest protocol reference

Jailbreak Foundry: From Papers to Runnable Attacks for Reproducible B…

Highest protocol score with explicit human/eval signal plus AdvBench.

Strongest benchmark reference

StitchCUDA: An Automated Multi-Agents End-to-End GPU Programing Frame…

Kernelbench with success rate gives a fast comparison anchor.

Strongest recent paper

Team of Thoughts: Efficient Test-time Scaling of Agentic Systems thro…

Useful for current practice scanning; published Feb 18, 2026.

Start Here (Best First 6)

Ranked for protocol completeness (human signal, benchmark + metric anchors, quality controls, and judge/human overlap).

Jailbreak Foundry: From Papers to Runnable Attacks for Reproducible Benchmarking
Feb 27, 2026 · Citations: 0 · Score: 8.0

HF: Red Team · Eval: Llm As Judge · Benchmark: AdvBench · Metric: Success rate
StitchCUDA: An Automated Multi-Agents End-to-End GPU Programing Framework with Rubric-based Agentic Reinforcement Learning
Mar 3, 2026 · Citations: 0 · Score: 8.0

HF: Rubric Rating · Eval: Automatic Metrics · Benchmark: Kernelbench · Metric: Success rate
Team of Thoughts: Efficient Test-time Scaling of Agentic Systems through Orchestrated Tool Calling
Feb 18, 2026 · Citations: 0 · Score: 8.0

HF: Expert Verification · Eval: Not reported · Benchmark: LiveCodeBench · Metric: Not Reported
SparkMe: Adaptive Semi-Structured Interviewing for Qualitative Insight Discovery
Feb 24, 2026 · Citations: 0 · Score: 6.0

HF: Expert Verification · Eval: Automatic Metrics · Benchmark: Not Reported · Metric: Cost
PrivAct: Internalizing Contextual Privacy Preservation via Multi-Agent Preference Training
Feb 14, 2026 · Citations: 0 · Score: 6.0

HF: Pairwise Preference · Eval: Automatic Metrics · Benchmark: Not Reported · Metric: Helpfulness
Demonstrating ViviDoc: Generating Interactive Documents through Human-Agent Collaboration
Mar 2, 2026 · Citations: 0 · Score: 4.5

HF: Expert Verification · Eval: Not reported · Benchmark: Not Reported · Metric: Not Reported

Protocol Matrix (Top 12)

Use this to quickly compare protocol ingredients instead of scanning long prose.

Paper	HF Signal	Eval Modes	Benchmarks	Metrics	QC
Jailbreak Foundry: From Papers to Runnable Attacks for Reproducible Benchmarking Feb 27, 2026	Yes Red Team	Llm As Judge	AdvBench , Jbf Eval	Success rate , Jailbreak success rate	Not Reported
StitchCUDA: An Automated Multi-Agents End-to-End GPU Programing Framework with Rubric-based Agentic Reinforcement Learning Mar 3, 2026	Yes Rubric Rating	Automatic Metrics	Kernelbench	Success rate	Not Reported
Team of Thoughts: Efficient Test-time Scaling of Agentic Systems through Orchestrated Tool Calling Feb 18, 2026	Yes Expert Verification	Not Reported	LiveCodeBench	Not Reported	Calibration
SparkMe: Adaptive Semi-Structured Interviewing for Qualitative Insight Discovery Feb 24, 2026	Yes Expert Verification	Automatic Metrics	Not Reported	Cost	Not Reported
PrivAct: Internalizing Contextual Privacy Preservation via Multi-Agent Preference Training Feb 14, 2026	Yes Pairwise Preference	Automatic Metrics	Not Reported	Helpfulness	Not Reported
Demonstrating ViviDoc: Generating Interactive Documents through Human-Agent Collaboration Mar 2, 2026	Yes Expert Verification	Not Reported	Not Reported	Not Reported	Not Reported
The Vision Wormhole: Latent-Space Communication in Heterogeneous Multi-Agent Systems Feb 17, 2026	Yes Pairwise Preference	Not Reported	Not Reported	Not Reported	Not Reported
GenDB: The Next Generation of Query Processing -- Synthesized, Not Engineered Mar 2, 2026	No Not Reported	Automatic Metrics	Not Reported	Cost	Not Reported
AgentDropoutV2: Optimizing Information Flow in Multi-Agent Systems via Test-Time Rectify-or-Reject Pruning Feb 26, 2026	No Not Reported	Automatic Metrics	Not Reported	Accuracy	Not Reported
BLUFF: Benchmarking the Detection of False and Synthetic Content across 58 Low-Resource Languages Feb 28, 2026	No Not Reported	Automatic Metrics	Not Reported	F1	Not Reported
MALLVI: A Multi-Agent Framework for Integrated Generalized Robotics Manipulation Feb 18, 2026	No Not Reported	Simulation Env	Not Reported	Not Reported	Not Reported
OR-Agent: Bridging Evolutionary Search and Structured Research for Automated Algorithm Discovery Feb 14, 2026	No Not Reported	Simulation Env	Not Reported	Not Reported	Not Reported

Protocol Diff (Top Papers)

Fast side-by-side comparison for the highest-ranked papers in this hub.

Signal	Jailbreak Foundry: From Papers to Runnable Attacks…	StitchCUDA: An Automated Multi-Agents End-to-End GP…	Team of Thoughts: Efficient Test-time Scaling of Ag…
Human Feedback	Red Team	Rubric Rating	Expert Verification
Evaluation Modes	Llm As Judge	Automatic Metrics	Not reported
Benchmarks	AdvBench, Jbf Eval	Kernelbench	LiveCodeBench
Metrics	Success rate, Jailbreak success rate	Success rate	Not reported
Quality Controls	Not reported	Not reported	Calibration
Rater Population	Unknown	Unknown	Domain Experts
Annotation Unit	Unknown	Multi Dim Rubric	Unknown

Research Utility Snapshot

Human Feedback Mix

Expert Verification (3)
Pairwise Preference (2)
Red Team (1)
Rubric Rating (1)

Evaluation Modes

Automatic Metrics (7)
Simulation Env (2)
Llm As Judge (1)

Top Benchmarks

AdvBench (1)
Jbf Eval (1)
Kernelbench (1)
LiveCodeBench (1)

Top Metrics

Cost (2)
Success rate (2)
Accuracy (1)
F1 (1)

Rater Population Mix

Domain Experts (4)

Quality Controls

Calibration (1)

Coverage diagnostics (sample-based): human-feedback 53.8% · benchmarks 23.1% · metrics 53.8% · quality controls 7.7%.

Top Papers

Jailbreak Foundry: From Papers to Runnable Attacks for Reproducible Benchmarking
Zhicheng Fang, Jingjie Zheng, Chenxu Fu, Wei Xu · Feb 27, 2026 · Citations: 0

Red Team Llm As Judge Multi Agent

Jailbreak techniques for large language models (LLMs) evolve faster than benchmarks, making robustness estimates stale and difficult to compare across papers due to drift in datasets, harnesses, and judging protocols.
StitchCUDA: An Automated Multi-Agents End-to-End GPU Programing Framework with Rubric-based Agentic Reinforcement Learning
Shiyang Li, Zijian Zhang, Winson Chen, Yuebo Luo, Mingyi Hong · Mar 3, 2026 · Citations: 0

Rubric Rating Automatic Metrics Multi Agent

To address the challenge, in this work, we propose StitchCUDA, a multi-agent framework for end-to-end GPU program generation, with three specialized agents: a Planner to orchestrate whole system design, a Coder dedicated to implementing it…
Team of Thoughts: Efficient Test-time Scaling of Agentic Systems through Orchestrated Tool Calling
Jeffrey T. H. Wong, Zixi Zhang, Junyi Liu, Yiren Zhao · Feb 18, 2026 · Citations: 0

Expert Verification Multi Agent

Existing Multi-Agent Systems (MAS) typically rely on static, homogeneous model configurations, limiting their ability to exploit the distinct strengths of differently post-trained models.
SparkMe: Adaptive Semi-Structured Interviewing for Qualitative Insight Discovery
David Anugraha, Vishakh Padmakumar, Diyi Yang · Feb 24, 2026 · Citations: 0

Expert Verification Automatic Metrics Multi Agent

Based on this formulation, we introduce SparkMe, a multi-agent LLM interviewer that performs deliberative planning via simulated conversation rollouts to select questions with high expected utility.
PrivAct: Internalizing Contextual Privacy Preservation via Multi-Agent Preference Training
Yuhan Cheng, Hancheng Ye, Hai Helen Li, Jingwei Sun, Yiran Chen · Feb 14, 2026 · Citations: 0

Pairwise Preference Automatic Metrics Multi Agent

We propose PrivAct, a contextual privacy-aware multi-agent learning framework that internalizes contextual privacy preservation directly into models' generation behavior for privacy-compliant agentic actions.
Demonstrating ViviDoc: Generating Interactive Documents through Human-Agent Collaboration
Yinghao Tang, Yupeng Xie, Yingchaojie Feng, Tingfeng Lan, Wei Chen · Mar 2, 2026 · Citations: 0

Expert Verification Multi Agent

Recent LLM-based agents can automate content creation, but naively applying them yields uncontrollable and unverifiable outputs.
The Vision Wormhole: Latent-Space Communication in Heterogeneous Multi-Agent Systems
Xiaoze Liu, Ruowang Zhang, Weichen Yu, Siheng Xiong, Liu He · Feb 17, 2026 · Citations: 0

Pairwise Preference Multi Agent

Multi-Agent Systems (MAS) powered by Large Language Models have unlocked advanced collaborative reasoning, yet they remain shackled by the inefficiency of discrete text communication, which imposes significant runtime overhead and…
GenDB: The Next Generation of Query Processing -- Synthesized, Not Engineered
Jiale Lao, Immanuel Trummer · Mar 2, 2026 · Citations: 0

Automatic Metrics Multi Agent

As a proof of concept, we present GenDB, an LLM-powered agentic system that generates instance-optimized and customized query execution code tailored to specific data, workloads, and hardware resources.
MALLVI: A Multi-Agent Framework for Integrated Generalized Robotics Manipulation
Iman Ahmadi, Mehrshad Taji, Arad Mahdinezhad Kashani, AmirHossein Jadidi, Saina Kashani · Feb 18, 2026 · Citations: 0

Simulation Env Multi Agent

MALLVI presents a Multi Agent Large Language and Vision framework that enables closed-loop feedback driven robotic manipulation.
OR-Agent: Bridging Evolutionary Search and Structured Research for Automated Algorithm Discovery
Qi Liu, Ruochen Hao, Can Li, Wanjing Ma · Feb 14, 2026 · Citations: 0

Simulation Env Multi Agent

We present OR-Agent, a configurable multi-agent research framework designed for automated exploration in rich experimental environments.
AgentDropoutV2: Optimizing Information Flow in Multi-Agent Systems via Test-Time Rectify-or-Reject Pruning
Yutong Wang, Siyuan Xiong, Xuebo Liu, Wenkang Zhou, Liang Ding · Feb 26, 2026 · Citations: 0

Automatic Metrics Multi Agent

We propose AgentDropoutV2, a test-time rectify-or-reject pruning framework designed to dynamically optimize MAS information flow without retraining.
BLUFF: Benchmarking the Detection of False and Synthetic Content across 58 Low-Resource Languages
Jason Lucas, Matt Murtagh-White, Adaku Uchendu, Ali Al-Lawati, Michiharu Yamashita · Feb 28, 2026 · Citations: 0

Automatic Metrics Multi Agent

We introduce BLUFF, a comprehensive benchmark for detecting false and synthetic content, spanning 79 languages with over 202K samples, combining human-written fact-checked content (122K+ samples across 57 languages) and LLM-generated…
A Hierarchical Multi-Agent System for Autonomous Discovery in Geoscientific Data Archives
Dmitrii Pantiukhin, Ivan Kuznetsov, Boris Shapkin, Antonia Anna Jost, Thomas Jung · Feb 24, 2026 · Citations: 0

Automatic Metrics Long Horizon

Here we present PANGAEA-GPT, a hierarchical multi-agent framework designed for autonomous data discovery and analysis.

Related Hubs

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.

Post a Job Get a Quote