HFEPX Hub

Multi Agent + Automatic Metrics (Last 60 Days)

Updated from current HFEPX corpus (Mar 1, 2026). 12 papers are grouped in this hub page.

Read Full Context

Updated from current HFEPX corpus (Mar 1, 2026). 12 papers are grouped in this hub page. Common evaluation modes: Automatic Metrics, Llm As Judge. Most common rater population: Domain Experts. Common annotation unit: Ranking. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Feb 24, 2026.

Papers: 12 Last published: Feb 24, 2026 Global RSS Tag RSS

Multi AgentAutomatic MetricsLast 60d

Researcher Quick Triage

This hub is best used for protocol triage and replication planning from abstract-level evidence. Quality band: Developing .

All Sampled Papers (12) Replication-Ready Only (0)

High-Signal Coverage

100.0%

12 / 12 sampled papers are not low-signal flagged.

Replication-Ready Set

Benchmark + metric + eval mode explicitly present.

Judge/Human Comparability

Papers containing both `human_eval` and `llm_as_judge`.

0 papers are replication-ready (benchmark + metric + explicit evaluation mode).
0 papers support judge-vs-human agreement analysis.
0 papers report explicit quality controls (calibration/adjudication/IAA).

Primary action: Use this page for scouting only; collect additional papers before attempting replication-critical comparisons.

Why This Matters (Expanded)

Why This Matters For Eval Research

16.7% of papers report explicit human-feedback signals, led by expert verification.
automatic metrics appears in 100% of papers in this hub.
multi-agent setups appears in 100% of papers, indicating agentic evaluation demand.

Protocol Notes (Expanded)

Protocol Takeaways

Quality-control reporting is sparse in this slice; prioritize papers with explicit calibration or adjudication steps.
Rater context is mostly domain experts, and annotation is commonly ranking annotation; use this to scope replication staffing.
Pair this hub with a human_eval-heavy hub to validate judge-model calibration.

Metric Interpretation

accuracy is reported in 50% of hub papers (6/12); compare with a secondary metric before ranking methods.
cost is reported in 16.7% of hub papers (2/12); compare with a secondary metric before ranking methods.

Researcher Checklist (Expanded)

Researcher Checklist

Gap: Papers with explicit human feedback

Coverage is a replication risk (16.7% vs 45% target).
Gap: Papers reporting quality controls

Coverage is a replication risk (0% vs 30% target).
Gap: Papers naming benchmarks/datasets

Coverage is a replication risk (0% vs 35% target).
Strong: Papers naming evaluation metrics

Coverage is strong (83.3% vs 35% target).
Gap: Papers with known rater population

Coverage is a replication risk (8.3% vs 35% target).
Gap: Papers with known annotation unit

Coverage is a replication risk (16.7% vs 35% target).

Strengths

Agentic evaluation appears in 100% of papers.

Known Gaps

Only 0% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (8.3% coverage).
Annotation unit is under-specified (16.7% coverage).

Suggested Next Analyses

Pair this hub with a human_eval-heavy hub to validate judge-model calibration.
Track metric sensitivity by reporting both accuracy and cost.

Recommended Queries (Expanded)

Recommended Queries

LLM-as-Judge Protocols Metric Slice: accuracy Recent High-Signal Papers

Start Here (Best First 6)

Ranked for protocol completeness (human signal, benchmark + metric anchors, quality controls, and judge/human overlap).

SparkMe: Adaptive Semi-Structured Interviewing for Qualitative Insight Discovery
Feb 24, 2026 · Citations: 0 · Score: 6.0

HF: Expert Verification · Eval: Automatic Metrics · Benchmark: Not Reported · Metric: Cost
PrivAct: Internalizing Contextual Privacy Preservation via Multi-Agent Preference Training
Feb 14, 2026 · Citations: 0 · Score: 6.0

HF: Pairwise Preference · Eval: Automatic Metrics · Benchmark: Not Reported · Metric: Helpfulness
The Emergence of Lab-Driven Alignment Signatures: A Psychometric Framework for Auditing Latent Bias and Compounding Risk in Generative AI
Feb 19, 2026 · Citations: 0 · Score: 4.0

HF: Not reported · Eval: Llm As Judge, Automatic Metrics · Benchmark: Not Reported · Metric: Accuracy
Evaluating Chain-of-Thought Reasoning through Reusability and Verifiability
Feb 19, 2026 · Citations: 0 · Score: 4.0

HF: Not reported · Eval: Automatic Metrics · Benchmark: Not Reported · Metric: Accuracy
AgentDropoutV2: Optimizing Information Flow in Multi-Agent Systems via Test-Time Rectify-or-Reject Pruning
Feb 26, 2026 · Citations: 0 · Score: 4.0

HF: Not reported · Eval: Automatic Metrics · Benchmark: Not Reported · Metric: Accuracy
A Multi-Agent Framework for Medical AI: Leveraging Fine-Tuned GPT, LLaMA, and DeepSeek R1 for Evidence-Based and Bias-Aware Clinical Query Processing
Feb 15, 2026 · Citations: 0 · Score: 4.0

HF: Not reported · Eval: Automatic Metrics · Benchmark: Not Reported · Metric: Accuracy

Protocol Matrix (Top 12)

Use this to quickly compare protocol ingredients instead of scanning long prose.

Paper	HF Signal	Eval Modes	Benchmarks	Metrics	QC
SparkMe: Adaptive Semi-Structured Interviewing for Qualitative Insight Discovery Feb 24, 2026	Yes Expert Verification	Automatic Metrics	Not Reported	Cost	Not Reported
PrivAct: Internalizing Contextual Privacy Preservation via Multi-Agent Preference Training Feb 14, 2026	Yes Pairwise Preference	Automatic Metrics	Not Reported	Helpfulness	Not Reported
The Emergence of Lab-Driven Alignment Signatures: A Psychometric Framework for Auditing Latent Bias and Compounding Risk in Generative AI Feb 19, 2026	No Not Reported	Llm As Judge , Automatic Metrics	Not Reported	Accuracy	Not Reported
Evaluating Chain-of-Thought Reasoning through Reusability and Verifiability Feb 19, 2026	No Not Reported	Automatic Metrics	Not Reported	Accuracy	Not Reported
AgentDropoutV2: Optimizing Information Flow in Multi-Agent Systems via Test-Time Rectify-or-Reject Pruning Feb 26, 2026	No Not Reported	Automatic Metrics	Not Reported	Accuracy	Not Reported
A Multi-Agent Framework for Medical AI: Leveraging Fine-Tuned GPT, LLaMA, and DeepSeek R1 for Evidence-Based and Bias-Aware Clinical Query Processing Feb 15, 2026	No Not Reported	Automatic Metrics	Not Reported	Accuracy , Bleu	Not Reported
Hierarchical LLM-Based Multi-Agent Framework with Prompt Optimization for Multi-Robot Task Planning Feb 25, 2026	No Not Reported	Automatic Metrics	Not Reported	Accuracy , Success rate	Not Reported
The Headless Firm: How AI Reshapes Enterprise Boundaries Feb 24, 2026	No Not Reported	Automatic Metrics	Not Reported	Throughput , Cost	Not Reported
SAMAS: A Spectrum-Guided Multi-Agent System for Achieving Style Fidelity in Literary Translation Feb 23, 2026	No Not Reported	Automatic Metrics	Not Reported	Accuracy	Not Reported
Whisper: Courtside Edition Enhancing ASR Performance Through LLM-Driven Context Generation Feb 21, 2026	No Not Reported	Automatic Metrics	Not Reported	Error rate , Wer	Not Reported
Training Generalizable Collaborative Agents via Strategic Risk Aversion Feb 25, 2026	No Not Reported	Automatic Metrics	Not Reported	Not Reported	Not Reported
A Hierarchical Multi-Agent System for Autonomous Discovery in Geoscientific Data Archives Feb 24, 2026	No Not Reported	Automatic Metrics	Not Reported	Not Reported	Not Reported

Protocol Diff (Top Papers)

Fast side-by-side comparison for the highest-ranked papers in this hub.

Signal	SparkMe: Adaptive Semi-Structured Interviewing for…	PrivAct: Internalizing Contextual Privacy Preservat…	The Emergence of Lab-Driven Alignment Signatures: A…
Human Feedback	Expert Verification	Pairwise Preference	Not reported
Evaluation Modes	Automatic Metrics	Automatic Metrics	Llm As Judge, Automatic Metrics
Benchmarks	Not reported	Not reported	Not reported
Metrics	Cost	Helpfulness	Accuracy
Quality Controls	Not reported	Not reported	Not reported
Rater Population	Domain Experts	Unknown	Unknown
Annotation Unit	Trajectory	Unknown	Unknown

Research Utility Snapshot

Human Feedback Mix

Expert Verification (1)
Pairwise Preference (1)

Evaluation Modes

Automatic Metrics (12)
Llm As Judge (1)

Top Benchmarks

Top Metrics

Accuracy (6)
Cost (2)
Bleu (1)
Error rate (1)

Rater Population Mix

Domain Experts (1)

Quality Controls

Coverage diagnostics (sample-based): human-feedback 16.7% · benchmarks 0.0% · metrics 83.3% · quality controls 0.0%.

Top Papers

SparkMe: Adaptive Semi-Structured Interviewing for Qualitative Insight Discovery
David Anugraha, Vishakh Padmakumar, Diyi Yang · Feb 24, 2026 · Citations: 0

Expert Verification Automatic Metrics Multi Agent

Based on this formulation, we introduce SparkMe, a multi-agent LLM interviewer that performs deliberative planning via simulated conversation rollouts to select questions with high expected utility.
PrivAct: Internalizing Contextual Privacy Preservation via Multi-Agent Preference Training
Yuhan Cheng, Hancheng Ye, Hai Helen Li, Jingwei Sun, Yiran Chen · Feb 14, 2026 · Citations: 0

Pairwise Preference Automatic Metrics Multi Agent

We propose PrivAct, a contextual privacy-aware multi-agent learning framework that internalizes contextual privacy preservation directly into models' generation behavior for privacy-compliant agentic actions.
The Emergence of Lab-Driven Alignment Signatures: A Psychometric Framework for Auditing Latent Bias and Compounding Risk in Generative AI
Dusan Bosnjakovic · Feb 19, 2026 · Citations: 0

Llm As JudgeAutomatic Metrics Multi Agent

As Large Language Models (LLMs) transition from standalone chat interfaces to foundational reasoning layers in multi-agent systems and recursive evaluation loops (LLM-as-a-judge), the detection of durable, provider-level behavioral…
Evaluating Chain-of-Thought Reasoning through Reusability and Verifiability
Shashank Aggarwal, Ram Vikas Mishra, Amit Awekar · Feb 19, 2026 · Citations: 0

Automatic Metrics Multi Agent

In multi-agent IR pipelines for tasks such as search and ranking, LLM-based agents exchange intermediate reasoning in terms of Chain-of-Thought (CoT) with each other.
AgentDropoutV2: Optimizing Information Flow in Multi-Agent Systems via Test-Time Rectify-or-Reject Pruning
Yutong Wang, Siyuan Xiong, Xuebo Liu, Wenkang Zhou, Liang Ding · Feb 26, 2026 · Citations: 0

Automatic Metrics Multi Agent

We propose AgentDropoutV2, a test-time rectify-or-reject pruning framework designed to dynamically optimize MAS information flow without retraining.
A Multi-Agent Framework for Medical AI: Leveraging Fine-Tuned GPT, LLaMA, and DeepSeek R1 for Evidence-Based and Bias-Aware Clinical Query Processing
Naeimeh Nourmohammadi, Md Meem Hossain, The Anh Han, Safina Showkat Ara, Zia Ush Shamszaman · Feb 15, 2026 · Citations: 0

Automatic Metrics Multi Agent

We propose a multi-agent medical QA framework that combines complementary LLMs with evidence retrieval, uncertainty estimation, and bias checks to improve answer reliability.
Hierarchical LLM-Based Multi-Agent Framework with Prompt Optimization for Multi-Robot Task Planning
Tomoya Kawabe, Rin Takano · Feb 25, 2026 · Citations: 0

Automatic Metrics Long Horizon

We present a hierarchical multi-agent LLM-based planner with prompt optimization: an upper layer decomposes tasks and assigns them to lower-layer agents, which generate PDDL problems solved by a classical planner.
The Headless Firm: How AI Reshapes Enterprise Boundaries
Tassilo Klein, Sebastian Wieczorek · Feb 24, 2026 · Citations: 0

Automatic Metrics Multi Agent

We argue that agentic AI induces a structural change in how coordination costs scale: in prior modular systems, integration cost grew with interaction topology (O(n^2) in the number of components); in protocol-mediated agentic systems, inte
SAMAS: A Spectrum-Guided Multi-Agent System for Achieving Style Fidelity in Literary Translation
Jingzhuo Wu, Jiajun Zhang, Keyan Jin, Dehua Ma, Junbo Wang · Feb 23, 2026 · Citations: 0

Automatic Metrics Multi Agent

To address this, we introduce the Style-Adaptive Multi-Agent System (SAMAS), a novel framework that treats style preservation as a signal processing task.
Whisper: Courtside Edition Enhancing ASR Performance Through LLM-Driven Context Generation
Yonathan Ron, Shiri Gilboa, Tammuz Dubnov · Feb 21, 2026 · Citations: 0

Automatic Metrics Multi Agent

We introduce Whisper: Courtside Edition, a novel multi-agent large language model (LLM) pipeline that enhances Whisper transcriptions without retraining.
Training Generalizable Collaborative Agents via Strategic Risk Aversion
Chengrui Qu, Yizhou Zhang, Nicholas Lanzetti, Eric Mazumdar · Feb 25, 2026 · Citations: 0

Automatic Metrics Multi Agent

Many emerging agentic paradigms require agents to collaborate with one another (or people) to achieve shared goals.
A Hierarchical Multi-Agent System for Autonomous Discovery in Geoscientific Data Archives
Dmitrii Pantiukhin, Ivan Kuznetsov, Boris Shapkin, Antonia Anna Jost, Thomas Jung · Feb 24, 2026 · Citations: 0

Automatic Metrics Long Horizon

Here we present PANGAEA-GPT, a hierarchical multi-agent framework designed for autonomous data discovery and analysis.

Related Hubs

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.

Post a Job Get a Quote