HFEPX Metric Hub

Safety & Risk Metric Papers

Updated from current HFEPX corpus (Mar 8, 2026). 22 papers are grouped in this metric page.

Read Full Context

Updated from current HFEPX corpus (Mar 8, 2026). 22 papers are grouped in this metric page. Common evaluation modes: Automatic Metrics, Llm As Judge. Common annotation unit: Trajectory. Frequently cited benchmark: AdvBench. Common metric signal: jailbreak success rate. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Feb 27, 2026.

Papers: 22 Last published: Feb 27, 2026 Global RSS

Researcher Quick Triage

Use this page to compare metric behavior across protocols and benchmarks before selecting your reporting stack. Quality band: Medium .

Metric Coverage

50.0%

11 sampled papers include metric names.

Benchmark Anchoring

13.6%

Papers with explicit dataset/benchmark anchors for fair comparison.

Quality Controls

0.0%

0 papers report calibration/adjudication/IAA controls.

22 papers are not low-signal flagged in this sample.
Use the protocol matrix below to avoid comparing metrics across incompatible eval setups.

Primary action: Treat this as directional signal only; metric reporting is present but benchmark anchoring is still thin.

Why This Matters (Expanded)

Why This Matters For Eval Research

66.7% of papers report explicit human-feedback signals, led by red-team protocols.
automatic metrics appears in 50% of papers in this hub.
AdvBench is a recurring benchmark anchor for cross-paper comparisons in this page.

Metric Notes (Expanded)

Metric-Driven Protocol Takeaways

Quality-control reporting is sparse in this slice; prioritize papers with explicit calibration or adjudication steps.
Rater context is mostly unspecified rater pools, and annotation is commonly trajectory-level annotation; use this to scope replication staffing.
Pair this hub with a human_eval-heavy hub to validate judge-model calibration.

Metric Interpretation

jailbreak success rate is reported in 83.3% of hub papers (10/22); compare with a secondary metric before ranking methods.
success rate is reported in 58.3% of hub papers (7/22); compare with a secondary metric before ranking methods.

Benchmark Context

AdvBench appears in 8.3% of hub papers (1/22); use this cohort for benchmark-matched comparisons.
Jbf-Eval appears in 8.3% of hub papers (1/22); use this cohort for benchmark-matched comparisons.

Start Here (Metric-Reliable First 6)

Ranked for metric reporting completeness and comparability.

Jailbreak Foundry: From Papers to Runnable Attacks for Reproducible Benchmarking
Feb 27, 2026 · Citations: 0 · Score: 9.0

Metrics: Success rate, Jailbreak success rate · Eval: Llm As Judge
ThaiSafetyBench: Assessing Language Model Safety in Thai Cultural Contexts
Mar 5, 2026 · Citations: 0 · Score: 8.0

Metrics: F1, F1 weighted · Eval: Llm As Judge, Automatic Metrics
RedTeamCUA: Realistic Adversarial Testing of Computer-Use Agents in Hybrid Web-OS Environments
May 28, 2025 · Citations: 0 · Score: 7.5

Metrics: Jailbreak success rate · Eval: Automatic Metrics
MUSE: A Run-Centric Platform for Multimodal Unified Safety Evaluation of Large Language Models
Mar 3, 2026 · Citations: 0 · Score: 7.5

Metrics: Success rate, Jailbreak success rate · Eval: Automatic Metrics
MANATEE: Inference-Time Lightweight Diffusion Based Safety Defense for LLMs
Feb 21, 2026 · Citations: 0 · Score: 7.5

Metrics: Success rate, Jailbreak success rate · Eval: Automatic Metrics
Bridging the Multilingual Safety Divide: Efficient, Culturally-Aware Alignment for Global South Languages
Feb 14, 2026 · Citations: 0 · Score: 7.5

Metrics: Toxicity · Eval: Automatic Metrics

Metric Protocol Matrix (Top 10)

Compare metric, benchmark, and evaluation context side by side.

Paper	Metrics	Benchmarks	Eval Modes	Quality Controls
Jailbreak Foundry: From Papers to Runnable Attacks for Reproducible Benchmarking Feb 27, 2026	Success rate, Jailbreak success rate	AdvBench, Jbf Eval	Llm As Judge	Not reported
ThaiSafetyBench: Assessing Language Model Safety in Thai Cultural Contexts Mar 5, 2026	F1, F1 weighted	Thaisafetybench	Llm As Judge, Automatic Metrics	Not reported
RedTeamCUA: Realistic Adversarial Testing of Computer-Use Agents in Hybrid Web-OS Environments May 28, 2025	Jailbreak success rate	Rtc Bench	Automatic Metrics	Not reported
MUSE: A Run-Centric Platform for Multimodal Unified Safety Evaluation of Large Language Models Mar 3, 2026	Success rate, Jailbreak success rate	Not reported	Automatic Metrics	Not reported
MANATEE: Inference-Time Lightweight Diffusion Based Safety Defense for LLMs Feb 21, 2026	Success rate, Jailbreak success rate	Not reported	Automatic Metrics	Not reported
Bridging the Multilingual Safety Divide: Efficient, Culturally-Aware Alignment for Global South Languages Feb 14, 2026	Toxicity	Not reported	Automatic Metrics	Not reported
What Matters For Safety Alignment? Jan 7, 2026	Success rate, Jailbreak success rate	Not reported	Automatic Metrics	Not reported
Reasoning Up the Instruction Ladder for Controllable Language Models Oct 30, 2025	Success rate, Jailbreak success rate	Not reported	Automatic Metrics	Not reported
Luna-2: Scalable Single-Token Evaluation with Small Language Models Feb 20, 2026	Accuracy, Latency	Not reported	Llm As Judge, Automatic Metrics	Not reported
Whisper: Courtside Edition Enhancing ASR Performance Through LLM-Driven Context Generation Feb 21, 2026	Error rate, Wer	Not reported	Automatic Metrics	Not reported

Researcher Workflow (Detailed)

Checklist

Strong: Papers with explicit human feedback

Coverage is strong (66.7% vs 45% target).
Gap: Papers reporting quality controls

Coverage is a replication risk (0% vs 30% target).
Moderate: Papers naming benchmarks/datasets

Coverage is usable but incomplete (25% vs 35% target).
Strong: Papers naming evaluation metrics

Coverage is strong (100% vs 35% target).
Gap: Papers with known rater population

Coverage is a replication risk (0% vs 35% target).
Gap: Papers with known annotation unit

Coverage is a replication risk (8.3% vs 35% target).

Strengths

Strong human-feedback signal (66.7% of papers).
Agentic evaluation appears in 50% of papers.

Known Gaps

Only 0% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (0% coverage).
Annotation unit is under-specified (8.3% coverage).

Suggested Next Analyses

Pair this hub with a human_eval-heavy hub to validate judge-model calibration.
Stratify by benchmark (AdvBench vs Jbf-Eval) before comparing methods.
Track metric sensitivity by reporting both jailbreak success rate and success rate.

Recommended Queries

LLM-as-Judge Protocols Benchmark Slice: AdvBench Metric Slice: jailbreak success rate Recent High-Signal Papers

Known Limitations

Only 0% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (0% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Snapshot (Detailed)

Top Metrics

Jailbreak success rate (10)
Success rate (7)
Cost (2)
Toxicity (2)

Evaluation Modes

Automatic Metrics (11)
Llm As Judge (3)

Top Benchmarks

AdvBench (1)
Jbf Eval (1)
Rtc Bench (1)
Thaisafetybench (1)

Agentic Mix

Multi Agent (2)
Web Browsing (2)
Long Horizon (1)
Tool Use (1)

Top Papers Reporting This Metric

Jailbreak Foundry: From Papers to Runnable Attacks for Reproducible Benchmarking
Zhicheng Fang, Jingjie Zheng, Chenxu Fu, Wei Xu · Feb 27, 2026 · Citations: 0

Llm As Judge CodingMultilingual

Jailbreak techniques for large language models (LLMs) evolve faster than benchmarks, making robustness estimates stale and difficult to compare across papers due to drift in datasets, harnesses, and judging protocols.
RedTeamCUA: Realistic Adversarial Testing of Computer-Use Agents in Hybrid Web-OS Environments
Zeyi Liao, Jaylen Jones, Linxi Jiang, Yuting Ning, Eric Fosler-Lussier · May 28, 2025 · Citations: 0

Automatic Metrics General

Using RedTeamCUA, we develop RTC-Bench, a comprehensive benchmark with 864 examples that investigate realistic, hybrid web-OS attack scenarios and fundamental security vulnerabilities.
MUSE: A Run-Centric Platform for Multimodal Unified Safety Evaluation of Large Language Models
Zhongxi Wang, Yueqian Lin, Jingyang Zhang, Hai Helen Li, Yiran Chen · Mar 3, 2026 · Citations: 0

Automatic Metrics General

Safety evaluation and red-teaming of large language models remain predominantly text-centric, and existing frameworks lack the infrastructure to systematically test whether alignment generalizes to audio, image, and video inputs.
What Matters For Safety Alignment?
Xing Li, Hui-Ling Zhen, Lihao Yin, Xianzhi Yu, Zhenhua Dong · Jan 7, 2026 · Citations: 0

Automatic Metrics General

This paper presents a comprehensive empirical study on the safety alignment capabilities.
MANATEE: Inference-Time Lightweight Diffusion Based Safety Defense for LLMs
Chun Yan Ryan Kan, Tommy Tran, Vedant Yadav, Ava Cai, Kevin Zhu · Feb 21, 2026 · Citations: 0

Automatic Metrics General

We propose MANATEE, an inference-time defense that uses density estimation over a benign representation manifold.
Bridging the Multilingual Safety Divide: Efficient, Culturally-Aware Alignment for Global South Languages
Somnath Banerjee, Rima Hazra, Animesh Mukherjee · Feb 14, 2026 · Citations: 0

Automatic Metrics CodingMultilingual

Yet safety pipelines, benchmarks, and alignment still largely target English and a handful of high-resource languages, implicitly assuming safety and factuality ''transfer'' across languages.
Reasoning Up the Instruction Ladder for Controllable Language Models
Zishuo Zheng, Vidhisha Balachandran, Chan Young Park, Faeze Brahman, Sachin Kumar · Oct 30, 2025 · Citations: 0

Automatic Metrics General

Our finetuned models achieve consistent improvements on instruction following and instruction hierarchy benchmarks, achieving roughly a 20% improvement on the IHEval conflict setup.
When Style Breaks Safety: Defending LLMs Against Superficial Style Alignment
Yuxin Xiao, Sana Tonekaboni, Walter Gerych, Vinith Suriyakumar, Marzyeh Ghassemi · Jun 9, 2025 · Citations: 0

Automatic Metrics General

In this work, we seek to understand whether style patterns compromise LLM safety, how superficial style alignment increases model vulnerability, and how best to mitigate these risks during alignment.
ThaiSafetyBench: Assessing Language Model Safety in Thai Cultural Contexts
Trapoom Ukarapol, Nut Chukamphaeng, Kunat Pipatanakul, Pakhapoom Sarapat · Mar 5, 2026 · Citations: 0

Llm As JudgeAutomatic Metrics General

Using ThaiSafetyBench, we evaluate 24 LLMs, with GPT-4.1 and Gemini-2.5-Pro serving as LLM-as-a-judge evaluators.
ICON: Indirect Prompt Injection Defense for Agents based on Inference-Time Correction
Che Wang, Fuyao Zhang, Jiaming Zhang, Ziqi Zhang, Yinghui Wang · Feb 24, 2026 · Citations: 0

Automatic Metrics General

Large Language Model (LLM) agents are susceptible to Indirect Prompt Injection (IPI) attacks, where malicious instructions in retrieved content hijack the agent's execution.
Luna-2: Scalable Single-Token Evaluation with Small Language Models
Vatsal Goel, Rishon Dsouza, Nikhil Ega, Amey Ramesh Rambatla, Rob Friel · Feb 20, 2026 · Citations: 0

Llm As JudgeAutomatic Metrics General

We present Luna-2, a novel architecture that leverages decoder-only small language models (SLMs) into a deterministic evaluation model to reliably compute complex task-specific LLMAJ metrics (e.g.
Whisper: Courtside Edition Enhancing ASR Performance Through LLM-Driven Context Generation
Yonathan Ron, Shiri Gilboa, Tammuz Dubnov · Feb 21, 2026 · Citations: 0

Automatic Metrics Law

We introduce Whisper: Courtside Edition, a novel multi-agent large language model (LLM) pipeline that enhances Whisper transcriptions without retraining.
Exploring the potential and limitations of Model Merging for Multi-Domain Adaptation in ASR
Carlos Carvalho, Francisco Teixeira, Thomas Rolland, Alberto Abad · Mar 5, 2026 · Citations: 0
PersianPunc: A Large-Scale Dataset and BERT-Based Approach for Persian Punctuation Restoration
Mohammad Javad Ranjbar Kalahroodi, Heshaam Faili, Azadeh Shakery · Mar 5, 2026 · Citations: 0
Measuring the Redundancy of Decoder Layers in SpeechLLMs
Adel Moumen, Guangzhi Sun, Philip C Woodland · Mar 5, 2026 · Citations: 0
MUTEX: Leveraging Multilingual Transformers and Conditional Random Fields for Enhanced Urdu Toxic Span Detection
Inayat Arshad, Fajar Saleem, Ijaz Hussain · Mar 5, 2026 · Citations: 0
RO-N3WS: Enhancing Generalization in Low-Resource ASR with Diverse Romanian Speech Benchmarks
Alexandra Diaconu, Mădălina Vînaga, Bogdan Alexe · Mar 2, 2026 · Citations: 0
End-to-End Simultaneous Dysarthric Speech Reconstruction with Frame-Level Adaptor and Multiple Wait-k Knowledge Distillation
Minghui Wu, Haitao Tang, Jiahuan Fan, Ruizhi Liao, Yanyong Zhang · Mar 2, 2026 · Citations: 0
DARS: Dysarthria-Aware Rhythm-Style Synthesis for ASR Enhancement
Minghui Wu, Xueling Liu, Jiahuan Fan, Haitao Tang, Yanyong Zhang · Mar 2, 2026 · Citations: 0
A Holistic Framework for Robust Bangla ASR and Speaker Diarization with Optimized VAD and CTC Alignment
Zarif Ishmam, Zarif Mahir, Shafnan Wasif, Md. Ishtiak Moin · Feb 26, 2026 · Citations: 0
TSPC: A Two-Stage Phoneme-Centric Architecture for code-switching Vietnamese-English Speech Recognition
Tran Nguyen Anh, Truong Dinh Dung, Vo Van Nam, Minh N. H. Nguyen · Sep 7, 2025 · Citations: 0
Chain of Correction for Full-text Speech Recognition with Large Language Models
Zhiyuan Tang, Dong Wang, Zhikai Zhou, Yong Liu, Shen Huang · Apr 2, 2025 · Citations: 0

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.

Post a Job Get a Quote