HFEPX Metric Hub

Safety & Risk Metric Papers In CS.AI

Updated from current HFEPX corpus (Mar 8, 2026). 12 papers are grouped in this metric page.

Read Full Context

Updated from current HFEPX corpus (Mar 8, 2026). 12 papers are grouped in this metric page. Common evaluation modes: Automatic Metrics, Llm As Judge. Common annotation unit: Trajectory. Frequently cited benchmark: AdvBench. Common metric signal: jailbreak success rate. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Feb 27, 2026.

Papers: 12 Last published: Feb 27, 2026 Global RSS

Researcher Quick Triage

Use this page to compare metric behavior across protocols and benchmarks before selecting your reporting stack. Quality band: Medium .

Metric Coverage

50.0%

6 sampled papers include metric names.

Benchmark Anchoring

8.3%

Papers with explicit dataset/benchmark anchors for fair comparison.

Quality Controls

0.0%

0 papers report calibration/adjudication/IAA controls.

12 papers are not low-signal flagged in this sample.
Use the protocol matrix below to avoid comparing metrics across incompatible eval setups.

Primary action: Treat this as directional signal only; metric reporting is present but benchmark anchoring is still thin.

Why This Matters (Expanded)

Why This Matters For Eval Research

71.4% of papers report explicit human-feedback signals, led by red-team protocols.
automatic metrics appears in 50% of papers in this hub.
AdvBench is a recurring benchmark anchor for cross-paper comparisons in this page.

Metric Notes (Expanded)

Metric-Driven Protocol Takeaways

Quality-control reporting is sparse in this slice; prioritize papers with explicit calibration or adjudication steps.
Rater context is mostly unspecified rater pools, and annotation is commonly trajectory-level annotation; use this to scope replication staffing.
Pair this hub with a human_eval-heavy hub to validate judge-model calibration.

Metric Interpretation

jailbreak success rate is reported in 85.7% of hub papers (6/12); compare with a secondary metric before ranking methods.
success rate is reported in 71.4% of hub papers (5/12); compare with a secondary metric before ranking methods.

Benchmark Context

AdvBench appears in 14.3% of hub papers (1/12); use this cohort for benchmark-matched comparisons.
Jbf-Eval appears in 14.3% of hub papers (1/12); use this cohort for benchmark-matched comparisons.

Start Here (Metric-Reliable First 6)

Ranked for metric reporting completeness and comparability.

Jailbreak Foundry: From Papers to Runnable Attacks for Reproducible Benchmarking
Feb 27, 2026 · Citations: 0 · Score: 9.0

Metrics: Success rate, Jailbreak success rate · Eval: Llm As Judge
MANATEE: Inference-Time Lightweight Diffusion Based Safety Defense for LLMs
Feb 21, 2026 · Citations: 0 · Score: 7.5

Metrics: Success rate, Jailbreak success rate · Eval: Automatic Metrics
What Matters For Safety Alignment?
Jan 7, 2026 · Citations: 0 · Score: 7.0

Metrics: Success rate, Jailbreak success rate · Eval: Automatic Metrics
Reasoning Up the Instruction Ladder for Controllable Language Models
Oct 30, 2025 · Citations: 0 · Score: 6.5

Metrics: Success rate, Jailbreak success rate · Eval: Automatic Metrics
Luna-2: Scalable Single-Token Evaluation with Small Language Models
Feb 20, 2026 · Citations: 0 · Score: 6.5

Metrics: Accuracy, Latency · Eval: Llm As Judge, Automatic Metrics
When Style Breaks Safety: Defending LLMs Against Superficial Style Alignment
Jun 9, 2025 · Citations: 0 · Score: 6.0

Metrics: Success rate, Jailbreak success rate · Eval: Automatic Metrics

Metric Protocol Matrix (Top 10)

Compare metric, benchmark, and evaluation context side by side.

Paper	Metrics	Benchmarks	Eval Modes	Quality Controls
Jailbreak Foundry: From Papers to Runnable Attacks for Reproducible Benchmarking Feb 27, 2026	Success rate, Jailbreak success rate	AdvBench, Jbf Eval	Llm As Judge	Not reported
MANATEE: Inference-Time Lightweight Diffusion Based Safety Defense for LLMs Feb 21, 2026	Success rate, Jailbreak success rate	Not reported	Automatic Metrics	Not reported
What Matters For Safety Alignment? Jan 7, 2026	Success rate, Jailbreak success rate	Not reported	Automatic Metrics	Not reported
Reasoning Up the Instruction Ladder for Controllable Language Models Oct 30, 2025	Success rate, Jailbreak success rate	Not reported	Automatic Metrics	Not reported
Luna-2: Scalable Single-Token Evaluation with Small Language Models Feb 20, 2026	Accuracy, Latency	Not reported	Llm As Judge, Automatic Metrics	Not reported
When Style Breaks Safety: Defending LLMs Against Superficial Style Alignment Jun 9, 2025	Success rate, Jailbreak success rate	Not reported	Automatic Metrics	Not reported
ICON: Indirect Prompt Injection Defense for Agents based on Inference-Time Correction Feb 24, 2026	Not reported	Not reported	Automatic Metrics	Not reported
PersianPunc: A Large-Scale Dataset and BERT-Based Approach for Persian Punctuation Restoration Mar 5, 2026	Not reported	Not reported	Not reported	Not reported
Measuring the Redundancy of Decoder Layers in SpeechLLMs Mar 5, 2026	Not reported	Not reported	Not reported	Not reported
MUTEX: Leveraging Multilingual Transformers and Conditional Random Fields for Enhanced Urdu Toxic Span Detection Mar 5, 2026	Not reported	Not reported	Not reported	Not reported

Researcher Workflow (Detailed)

Checklist

Strong: Papers with explicit human feedback

Coverage is strong (71.4% vs 45% target).
Gap: Papers reporting quality controls

Coverage is a replication risk (0% vs 30% target).
Gap: Papers naming benchmarks/datasets

Coverage is a replication risk (14.3% vs 35% target).
Strong: Papers naming evaluation metrics

Coverage is strong (100% vs 35% target).
Gap: Papers with known rater population

Coverage is a replication risk (0% vs 35% target).
Gap: Papers with known annotation unit

Coverage is a replication risk (14.3% vs 35% target).

Strengths

Strong human-feedback signal (71.4% of papers).
Agentic evaluation appears in 42.9% of papers.

Known Gaps

Only 0% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (0% coverage).
Annotation unit is under-specified (14.3% coverage).

Suggested Next Analyses

Pair this hub with a human_eval-heavy hub to validate judge-model calibration.
Stratify by benchmark (AdvBench vs Jbf-Eval) before comparing methods.
Track metric sensitivity by reporting both jailbreak success rate and success rate.

Recommended Queries

LLM-as-Judge Protocols Benchmark Slice: AdvBench Metric Slice: jailbreak success rate Recent High-Signal Papers

Known Limitations

Only 0% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (0% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Snapshot (Detailed)

Top Metrics

Jailbreak success rate (6)
Success rate (5)
Accuracy (1)
Cost (1)

Evaluation Modes

Automatic Metrics (6)
Llm As Judge (2)

Top Benchmarks

AdvBench (1)
Jbf Eval (1)

Agentic Mix

Long Horizon (1)
Multi Agent (1)
Tool Use (1)

Top Papers Reporting This Metric

Jailbreak Foundry: From Papers to Runnable Attacks for Reproducible Benchmarking
Zhicheng Fang, Jingjie Zheng, Chenxu Fu, Wei Xu · Feb 27, 2026 · Citations: 0

Llm As Judge CodingMultilingual

Jailbreak techniques for large language models (LLMs) evolve faster than benchmarks, making robustness estimates stale and difficult to compare across papers due to drift in datasets, harnesses, and judging protocols.
What Matters For Safety Alignment?
Xing Li, Hui-Ling Zhen, Lihao Yin, Xianzhi Yu, Zhenhua Dong · Jan 7, 2026 · Citations: 0

Automatic Metrics General

This paper presents a comprehensive empirical study on the safety alignment capabilities.
MANATEE: Inference-Time Lightweight Diffusion Based Safety Defense for LLMs
Chun Yan Ryan Kan, Tommy Tran, Vedant Yadav, Ava Cai, Kevin Zhu · Feb 21, 2026 · Citations: 0

Automatic Metrics General

We propose MANATEE, an inference-time defense that uses density estimation over a benign representation manifold.
Reasoning Up the Instruction Ladder for Controllable Language Models
Zishuo Zheng, Vidhisha Balachandran, Chan Young Park, Faeze Brahman, Sachin Kumar · Oct 30, 2025 · Citations: 0

Automatic Metrics General

Our finetuned models achieve consistent improvements on instruction following and instruction hierarchy benchmarks, achieving roughly a 20% improvement on the IHEval conflict setup.
When Style Breaks Safety: Defending LLMs Against Superficial Style Alignment
Yuxin Xiao, Sana Tonekaboni, Walter Gerych, Vinith Suriyakumar, Marzyeh Ghassemi · Jun 9, 2025 · Citations: 0

Automatic Metrics General

In this work, we seek to understand whether style patterns compromise LLM safety, how superficial style alignment increases model vulnerability, and how best to mitigate these risks during alignment.
ICON: Indirect Prompt Injection Defense for Agents based on Inference-Time Correction
Che Wang, Fuyao Zhang, Jiaming Zhang, Ziqi Zhang, Yinghui Wang · Feb 24, 2026 · Citations: 0

Automatic Metrics General

Large Language Model (LLM) agents are susceptible to Indirect Prompt Injection (IPI) attacks, where malicious instructions in retrieved content hijack the agent's execution.
Luna-2: Scalable Single-Token Evaluation with Small Language Models
Vatsal Goel, Rishon Dsouza, Nikhil Ega, Amey Ramesh Rambatla, Rob Friel · Feb 20, 2026 · Citations: 0

Llm As JudgeAutomatic Metrics General

We present Luna-2, a novel architecture that leverages decoder-only small language models (SLMs) into a deterministic evaluation model to reliably compute complex task-specific LLMAJ metrics (e.g.
PersianPunc: A Large-Scale Dataset and BERT-Based Approach for Persian Punctuation Restoration
Mohammad Javad Ranjbar Kalahroodi, Heshaam Faili, Azadeh Shakery · Mar 5, 2026 · Citations: 0
Measuring the Redundancy of Decoder Layers in SpeechLLMs
Adel Moumen, Guangzhi Sun, Philip C Woodland · Mar 5, 2026 · Citations: 0
MUTEX: Leveraging Multilingual Transformers and Conditional Random Fields for Enhanced Urdu Toxic Span Detection
Inayat Arshad, Fajar Saleem, Ijaz Hussain · Mar 5, 2026 · Citations: 0
A Holistic Framework for Robust Bangla ASR and Speaker Diarization with Optimized VAD and CTC Alignment
Zarif Ishmam, Zarif Mahir, Shafnan Wasif, Md. Ishtiak Moin · Feb 26, 2026 · Citations: 0
TSPC: A Two-Stage Phoneme-Centric Architecture for code-switching Vietnamese-English Speech Recognition
Tran Nguyen Anh, Truong Dinh Dung, Vo Van Nam, Minh N. H. Nguyen · Sep 7, 2025 · Citations: 0

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.

Post a Job Get a Quote