Metric Hub

Success Rate In CS.AI Papers

Updated from current HFEPX corpus (Feb 27, 2026). 13 papers are grouped in this metric page. Common evaluation modes: Automatic Metrics, Simulation Env. Most common rater population: Domain Experts. Common annotation unit: Trajectory. Frequently cited benchmark: AIME. Common metric signal: success rate. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Feb 25, 2026.

Papers: 13 Last published: Feb 25, 2026 Global RSS

Research Narrative

Grounded narrative Model: deterministic-grounded Source: persisted

Updated from current HFEPX corpus (Feb 27, 2026). This page tracks 13 papers for Success Rate In CS.AI Papers. Dominant protocol signals include automatic metrics, simulation environments, LLM-as-judge, with frequent benchmark focus on AIME, APPS and metric focus on success rate, jailbreak success rate. Use the grounded sections below to prioritize reproducible protocol choices, benchmark-matched comparisons, and judge-vs-human evaluation checks.

Why This Matters For Eval Research

46.2% of papers report explicit human-feedback signals, led by red-team protocols.

Evidence: MANATEE: Inference-Time Lightweight Diffusion Based Safety Defense for LLMs , What Matters For Safety Alignment? , Reasoning Up the Instruction Ladder for Controllable Language Models , When Style Breaks Safety: Defending LLMs Against Superficial Style Alignment
automatic metrics appears in 61.5% of papers in this hub.

Evidence: Hierarchical LLM-Based Multi-Agent Framework with Prompt Optimization for Multi-Robot Task Planning , Adversarial Intent is a Latent Variable: Stateful Trust Inference for Securing Multimodal Agentic RAG , MANATEE: Inference-Time Lightweight Diffusion Based Safety Defense for LLMs , Evolutionary System Prompt Learning for Reinforcement Learning in LLMs
AIME is a recurring benchmark anchor for cross-paper comparisons in this page.

Evidence: Evolutionary System Prompt Learning for Reinforcement Learning in LLMs , Hierarchical LLM-Based Multi-Agent Framework with Prompt Optimization for Multi-Robot Task Planning , Self-Correcting VLA: Online Action Refinement via Sparse World Imagination , LiLo-VLA: Compositional Long-Horizon Manipulation via Linked Object-Centric Policies

Protocol Takeaways

Quality-control reporting is sparse in this slice; prioritize papers with explicit calibration or adjudication steps.

Evidence: Hierarchical LLM-Based Multi-Agent Framework with Prompt Optimization for Multi-Robot Task Planning , Self-Correcting VLA: Online Action Refinement via Sparse World Imagination , LiLo-VLA: Compositional Long-Horizon Manipulation via Linked Object-Centric Policies , Adversarial Intent is a Latent Variable: Stateful Trust Inference for Securing Multimodal Agentic RAG
Rater context is mostly domain experts, and annotation is commonly trajectory-level annotation; use this to scope replication staffing.

Evidence: EpidemIQs: Prompt-to-Paper LLM Agents for Epidemic Modeling and Analysis , Measuring AI Ability to Complete Long Software Tasks , Hierarchical LLM-Based Multi-Agent Framework with Prompt Optimization for Multi-Robot Task Planning , Self-Correcting VLA: Online Action Refinement via Sparse World Imagination
Pair this hub with a human_eval-heavy hub to validate judge-model calibration.

Evidence: Hierarchical LLM-Based Multi-Agent Framework with Prompt Optimization for Multi-Robot Task Planning , Self-Correcting VLA: Online Action Refinement via Sparse World Imagination , LiLo-VLA: Compositional Long-Horizon Manipulation via Linked Object-Centric Policies , Adversarial Intent is a Latent Variable: Stateful Trust Inference for Securing Multimodal Agentic RAG

Benchmark Interpretation

AIME appears in 7.7% of hub papers (1/13); use this cohort for benchmark-matched comparisons.
APPS appears in 7.7% of hub papers (1/13); use this cohort for benchmark-matched comparisons.

Metric Interpretation

success rate is reported in 100% of hub papers (13/13); compare with a secondary metric before ranking methods.
jailbreak success rate is reported in 46.2% of hub papers (6/13); compare with a secondary metric before ranking methods.

Researcher Checklist

Maintain strength on Papers with explicit human feedback. Coverage is strong (46.2% vs 45% target).
Close gap on Papers reporting quality controls. Coverage is a replication risk (0% vs 30% target).
Tighten coverage on Papers naming benchmarks/datasets. Coverage is usable but incomplete (30.8% vs 35% target).
Maintain strength on Papers naming evaluation metrics. Coverage is strong (100% vs 35% target).
Close gap on Papers with known rater population. Coverage is a replication risk (15.4% vs 35% target).
Close gap on Papers with known annotation unit. Coverage is a replication risk (7.7% vs 35% target).

Papers with explicit human feedback

Coverage is strong (46.2% vs 45% target).

Papers reporting quality controls

Coverage is a replication risk (0% vs 30% target).

Papers naming benchmarks/datasets

Coverage is usable but incomplete (30.8% vs 35% target).

Papers naming evaluation metrics

Coverage is strong (100% vs 35% target).

Papers with known rater population

Coverage is a replication risk (15.4% vs 35% target).

Papers with known annotation unit

Coverage is a replication risk (7.7% vs 35% target).

Known Limitations

Only 0% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (15.4% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Links

LLM-as-Judge Protocols - Finds judge-based evaluation setups to compare calibration and drift risks.
Benchmark Slice: AIME - Prioritizes benchmark-specific protocol comparisons.
Metric Slice: success rate - Finds papers where reported metrics are directly comparable.
Recent High-Signal Papers - Keeps the hub connected to the latest HFEPX corpus updates.

llm_as_judge vs automatic_metrics

both=0, left_only=1, right_only=8

0 papers use both Llm As Judge and Automatic Metrics.

automatic_metrics vs simulation_env

both=0, left_only=8, right_only=5

0 papers use both Automatic Metrics and Simulation Env.

simulation_env vs llm_as_judge

both=1, left_only=4, right_only=0

1 papers use both Simulation Env and Llm As Judge.

Benchmark Brief

AIME

Coverage: 1 papers (7.7%)

1 papers (7.7%) mention AIME.

Examples: Evolutionary System Prompt Learning for Reinforcement Learning in LLMs

Benchmark Brief

APPS

Coverage: 1 papers (7.7%)

1 papers (7.7%) mention APPS.

Examples: The Tool Decathlon: Benchmarking Language Agents for Diverse, Realistic, and Long-Horizon Task Execution

Benchmark Brief

Re-Bench

Coverage: 1 papers (7.7%)

1 papers (7.7%) mention Re-Bench.

Examples: Measuring AI Ability to Complete Long Software Tasks

Metric Brief

success rate

Coverage: 13 papers (100%)

13 papers (100%) mention success rate.

Examples: Hierarchical LLM-Based Multi-Agent Framework with Prompt Optimization for Multi-Robot Task Planning , Self-Correcting VLA: Online Action Refinement via Sparse World Imagination , LiLo-VLA: Compositional Long-Horizon Manipulation via Linked Object-Centric Policies

Metric Brief

jailbreak success rate

Coverage: 6 papers (46.2%)

6 papers (46.2%) mention jailbreak success rate.

Examples: Adversarial Intent is a Latent Variable: Stateful Trust Inference for Securing Multimodal Agentic RAG , AdapTools: Adaptive Tool-based Indirect Prompt Injection Attacks on Agentic LLMs , MANATEE: Inference-Time Lightweight Diffusion Based Safety Defense for LLMs

Metric Brief

cost

Coverage: 2 papers (15.4%)

2 papers (15.4%) mention cost.

Examples: Adversarial Intent is a Latent Variable: Stateful Trust Inference for Securing Multimodal Agentic RAG , EpidemIQs: Prompt-to-Paper LLM Agents for Epidemic Modeling and Analysis

Most Cited In This Hub

Fast path to methods with the strongest citation traction in this scope.

Papers: Hierarchical LLM-Based Multi-Agent Framework with Prompt Optimization for Multi-Robot Task Planning , Self-Correcting VLA: Online Action Refinement via Sparse World Imagination , LiLo-VLA: Compositional Long-Horizon Manipulation via Linked Object-Centric Policies

Most Recent

Fast path to latest protocol changes and newly published evaluation setups.

Best Protocol Detail

Papers with explicit rater/unit metadata and quality-control signals for reproducibility.

Top Papers Reporting This Metric

Hierarchical LLM-Based Multi-Agent Framework with Prompt Optimization for Multi-Robot Task Planning
Tomoya Kawabe, Rin Takano · Feb 25, 2026

Automatic Metrics General

We present a hierarchical multi-agent LLM-based planner with prompt optimization: an upper layer decomposes tasks and assigns them to lower-layer agents, which generate PDDL problems solved by a classical planner.
Self-Correcting VLA: Online Action Refinement via Sparse World Imagination
Chenyv Liu, Wentao Tan, Lei Zhu, Fengling Li, Jingjing Li · Feb 25, 2026

Simulation Env Coding

Reinforcement learning enhances physical grounding through exploration yet typically relies on external reward signals that remain isolated from the agent's internal states.
LiLo-VLA: Compositional Long-Horizon Manipulation via Linked Object-Centric Policies
Yue Yang, Shuo Cheng, Yu Fang, Homanga Bharadhwaj, Mingyu Ding · Feb 25, 2026

Simulation Env General

We introduce a 21-task simulation benchmark consisting of two challenging suites: LIBERO-Long++ and Ultra-Long.
Adversarial Intent is a Latent Variable: Stateful Trust Inference for Securing Multimodal Agentic RAG
Inderjeet Singh, Vikas Pahuja, Aishvariya Priya Rathina Sabapathy, Chiara Picardi, Amit Giloni · Feb 24, 2026

Automatic Metrics General

Current stateless defences for multimodal agentic RAG fail to detect adversarial strategies that distribute malicious semantics across retrieval, planning, and generation components.
AdapTools: Adaptive Tool-based Indirect Prompt Injection Attacks on Agentic LLMs
Che Wang, Jiaming Zhang, Ziqi Zhang, Zijie Wang, Yinghui Wang · Feb 24, 2026

Simulation Env General

The integration of external data services (e.g., Model Context Protocol, MCP) has made large language model-based agents increasingly powerful for complex task execution.
MANATEE: Inference-Time Lightweight Diffusion Based Safety Defense for LLMs
Chun Yan Ryan Kan, Tommy Tran, Vedant Yadav, Ava Cai, Kevin Zhu · Feb 21, 2026

Automatic Metrics General

Defending LLMs against adversarial jailbreak attacks remains an open challenge.
Evolutionary System Prompt Learning for Reinforcement Learning in LLMs
Lunjun Zhang, Ryan Chen, Bradly C. Stadie · Feb 16, 2026

Automatic Metrics Coding

Building agentic systems that can autonomously self-improve from experience is a longstanding goal of AI.
What Matters For Safety Alignment?
Xing Li, Hui-Ling Zhen, Lihao Yin, Xianzhi Yu, Zhenhua Dong · Jan 7, 2026

Automatic Metrics General

This paper presents a comprehensive empirical study on the safety alignment capabilities.
Reasoning Up the Instruction Ladder for Controllable Language Models
Zishuo Zheng, Vidhisha Balachandran, Chan Young Park, Faeze Brahman, Sachin Kumar · Oct 30, 2025

Automatic Metrics General

Our finetuned models achieve consistent improvements on instruction following and instruction hierarchy benchmarks, achieving roughly a 20% improvement on the IHEval conflict setup.
The Tool Decathlon: Benchmarking Language Agents for Diverse, Realistic, and Long-Horizon Task Execution
Junlong Li, Wenshuo Zhao, Jian Zhao, Weihao Zeng, Haoze Wu · Oct 29, 2025

Simulation Env General

Real-world language agents must handle complex, multi-step workflows across diverse Apps.
EpidemIQs: Prompt-to-Paper LLM Agents for Epidemic Modeling and Analysis
Mohammad Hossein Samaei, Faryad Darabi Sahneh, Lee W. Cohnstaedt, Caterina Scoglio · Sep 24, 2025

Llm As JudgeSimulation Env General

We introduce EpidemIQs, a novel multi-agent LLM framework that integrates user inputs and autonomously conducts literature review, analytical derivation, network modeling, mechanistic modeling, stochastic simulations, data visualization and
When Style Breaks Safety: Defending LLMs Against Superficial Style Alignment
Yuxin Xiao, Sana Tonekaboni, Walter Gerych, Vinith Suriyakumar, Marzyeh Ghassemi · Jun 9, 2025

Automatic Metrics General

In this work, we seek to understand whether style patterns compromise LLM safety, how superficial style alignment increases model vulnerability, and how best to mitigate these risks during alignment.
Measuring AI Ability to Complete Long Software Tasks
Thomas Kwa, Ben West, Joel Becker, Amy Deng, Katharyn Garcia · Mar 18, 2025

Automatic Metrics General

Despite rapid progress on AI benchmarks, the real-world meaning of benchmark performance remains unclear.

Success Rate In CS.AI Papers

Research Narrative

Why This Matters For Eval Research

Protocol Takeaways

Benchmark Interpretation

Metric Interpretation

Researcher Checklist

Suggested Reading Order

Known Limitations

Research Utility Links

Top Papers Reporting This Metric

Other Metric Hubs