Skip to content
← Back to explorer

Metric Hub

Success Rate + Automatic Metrics Metric Papers

Updated from current HFEPX corpus (Feb 27, 2026). 12 papers are grouped in this metric page. Common evaluation modes: Automatic Metrics. Most common rater population: Domain Experts. Frequently cited benchmark: AIME. Common metric signal: success rate. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Feb 26, 2026.

Papers: 12 Last published: Feb 26, 2026 Global RSS

Research Narrative

Grounded narrative Model: deterministic-grounded Source: persisted

Updated from current HFEPX corpus (Feb 27, 2026). This page tracks 12 papers for Success Rate + Automatic Metrics Metric Papers. Dominant protocol signals include automatic metrics, with frequent benchmark focus on AIME, Re-Bench and metric focus on success rate, jailbreak success rate. Use the grounded sections below to prioritize reproducible protocol choices, benchmark-matched comparisons, and judge-vs-human evaluation checks.

Why This Matters For Eval Research

Protocol Takeaways

Benchmark Interpretation

  • AIME appears in 8.3% of hub papers (1/12); use this cohort for benchmark-matched comparisons.
  • Re-Bench appears in 8.3% of hub papers (1/12); use this cohort for benchmark-matched comparisons.

Metric Interpretation

  • success rate is reported in 100% of hub papers (12/12); compare with a secondary metric before ranking methods.
  • jailbreak success rate is reported in 50% of hub papers (6/12); compare with a secondary metric before ranking methods.

Abstract Evidence Highlights

Direct snippets from paper abstracts to ground protocol and benchmark interpretation.

Human-eval abstract signal: Current approaches to developing persuasive dialogue agents often rely on a limited set of predefined persuasive strategies that fail to capture the complexity of real-world interactions.

Human-eval abstract signal: Multi-robot task planning requires decomposing natural-language instructions into executable actions for heterogeneous robot teams.

AIME benchmark signal: We validated our proposed framework through experiments on two distinct datasets: the Persuasion for Good dataset, which represents a specific in-domain scenario, and the DailyPersuasion dataset, which encompasses a wide range of scenarios.

success rate metric signal: When plans fail, the system applies TextGrad-inspired textual-gradient updates to optimize each agent's prompt and thereby improve planning accuracy.

Protocol abstract signal: Current stateless defences for multimodal agentic RAG fail to detect adversarial strategies that distribute malicious semantics across retrieval, planning, and generation components.

Protocol abstract signal: Editing Large language models (LLMs) with real-world, unstructured knowledge is essential for correcting and updating their internal parametric knowledge.

Protocol abstract signal: Defending LLMs against adversarial jailbreak attacks remains an open challenge.

Protocol abstract signal: To estimate safety risk at training time, we explore two distinct approaches: a judge-based Safety Critic that assigns high-level harm scores to training batches, and an activation-based risk predictor built with a lightweight classifier...

Researcher Checklist

  • Tighten coverage on Papers with explicit human feedback. Coverage is usable but incomplete (41.7% vs 45% target).
  • Close gap on Papers reporting quality controls. Coverage is a replication risk (0% vs 30% target).
  • Tighten coverage on Papers naming benchmarks/datasets. Coverage is usable but incomplete (25% vs 35% target).
  • Maintain strength on Papers naming evaluation metrics. Coverage is strong (100% vs 35% target).
  • Close gap on Papers with known rater population. Coverage is a replication risk (8.3% vs 35% target).
  • Close gap on Papers with known annotation unit. Coverage is a replication risk (0% vs 35% target).

Papers with explicit human feedback

Coverage is usable but incomplete (41.7% vs 45% target).

Papers reporting quality controls

Coverage is a replication risk (0% vs 30% target).

Papers naming benchmarks/datasets

Coverage is usable but incomplete (25% vs 35% target).

Papers naming evaluation metrics

Coverage is strong (100% vs 35% target).

Papers with known rater population

Coverage is a replication risk (8.3% vs 35% target).

Papers with known annotation unit

Coverage is a replication risk (0% vs 35% target).

Suggested Reading Order

  1. 1. Enhancing Persuasive Dialogue Agents by Synthesizing Cross-Disciplinary Communication Strategies

    Start here for detailed protocol reporting, including rater and quality-control evidence.

  2. 2. Hierarchical LLM-Based Multi-Agent Framework with Prompt Optimization for Multi-Robot Task Planning

    Start here for detailed protocol reporting, including rater and quality-control evidence.

  3. 3. Adversarial Intent is a Latent Variable: Stateful Trust Inference for Securing Multimodal Agentic RAG

    Start here for detailed protocol reporting, including rater and quality-control evidence.

  4. 4. Uncovering Context Reliance in Unstructured Knowledge Editing

    Adds automatic metrics for broader coverage within this hub.

  5. 5. MANATEE: Inference-Time Lightweight Diffusion Based Safety Defense for LLMs

    Adds automatic metrics with red-team protocols for broader coverage within this hub.

  6. 6. Learning to Stay Safe: Adaptive Regularization Against Safety Degradation during Fine-Tuning

    Adds automatic metrics for broader coverage within this hub.

  7. 7. Evolutionary System Prompt Learning for Reinforcement Learning in LLMs

    Adds automatic metrics for broader coverage within this hub.

  8. 8. What Matters For Safety Alignment?

    Adds automatic metrics with red-team protocols for broader coverage within this hub.

Known Limitations

  • Only 0% of papers report quality controls; prioritize calibration/adjudication evidence.
  • Rater population is under-specified (8.3% coverage).
  • Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Links

Benchmark Brief

AIME

Coverage: 1 papers (8.3%)

1 papers (8.3%) mention AIME.

Examples: Evolutionary System Prompt Learning for Reinforcement Learning in LLMs

Benchmark Brief

Re-Bench

Coverage: 1 papers (8.3%)

1 papers (8.3%) mention Re-Bench.

Examples: Measuring AI Ability to Complete Long Software Tasks

Benchmark Brief

Retrieval

Coverage: 1 papers (8.3%)

1 papers (8.3%) mention Retrieval.

Examples: Adversarial Intent is a Latent Variable: Stateful Trust Inference for Securing Multimodal Agentic RAG

Top Papers Reporting This Metric

Other Metric Hubs