HFEPX Hub

CS.CR + Automatic Metrics Papers

Updated from current HFEPX corpus (Mar 8, 2026). 18 papers are grouped in this hub page.

Read Full Context

Updated from current HFEPX corpus (Mar 8, 2026). 18 papers are grouped in this hub page. Common evaluation modes: Automatic Metrics. Most common rater population: Domain Experts. Common annotation unit: Trajectory. Frequently cited benchmark: AdvBench. Common metric signal: jailbreak success rate. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Jan 7, 2026.

Papers: 18 Last published: Jan 7, 2026 Global RSS Tag RSS

Cs.CRAutomatic Metrics

Researcher Quick Triage

This hub is best used for protocol triage and replication planning from abstract-level evidence. Quality band: Developing .

All Sampled Papers (18) Replication-Ready Only (4)

High-Signal Coverage

100.0%

18 / 18 sampled papers are not low-signal flagged.

Replication-Ready Set

Benchmark + metric + eval mode explicitly present.

Judge/Human Comparability

Papers containing both `human_eval` and `llm_as_judge`.

4 papers are replication-ready (benchmark + metric + explicit evaluation mode).
0 papers support judge-vs-human agreement analysis.
0 papers report explicit quality controls (calibration/adjudication/IAA).

Primary action: Start with the top 2 papers in “Start Here”, then validate assumptions in the protocol matrix.

Why This Matters For Eval Research

75% of papers report explicit human-feedback signals, led by red-team protocols.
automatic metrics appears in 44.4% of papers in this hub.
AdvBench is a recurring benchmark anchor for cross-paper comparisons in this page.

Protocol Takeaways

Quality-control reporting is sparse in this slice; prioritize papers with explicit calibration or adjudication steps.
Rater context is mostly domain experts, and annotation is commonly trajectory-level annotation; use this to scope replication staffing.
Stratify by benchmark (AdvBench vs APPS) before comparing methods.

Benchmark Interpretation

AdvBench appears in 12.5% of hub papers (1/18); use this cohort for benchmark-matched comparisons.
APPS appears in 12.5% of hub papers (1/18); use this cohort for benchmark-matched comparisons.

Metric Interpretation

jailbreak success rate is reported in 37.5% of hub papers (3/18); compare with a secondary metric before ranking methods.
success rate is reported in 25% of hub papers (2/18); compare with a secondary metric before ranking methods.

Researcher Checklist (Expanded)

Researcher Checklist

Strong: Papers with explicit human feedback

Coverage is strong (75% vs 45% target).
Gap: Papers reporting quality controls

Coverage is a replication risk (0% vs 30% target).
Moderate: Papers naming benchmarks/datasets

Coverage is usable but incomplete (25% vs 35% target).
Strong: Papers naming evaluation metrics

Coverage is strong (75% vs 35% target).
Gap: Papers with known rater population

Coverage is a replication risk (12.5% vs 35% target).
Gap: Papers with known annotation unit

Coverage is a replication risk (12.5% vs 35% target).

Strengths

Strong human-feedback signal (75% of papers).
Agentic evaluation appears in 37.5% of papers.

Known Gaps

Only 0% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (12.5% coverage).
Annotation unit is under-specified (12.5% coverage).

Suggested Next Analyses

Stratify by benchmark (AdvBench vs APPS) before comparing methods.
Track metric sensitivity by reporting both jailbreak success rate and success rate.

Recommended Queries (Expanded)

Recommended Queries

Benchmark Slice: AdvBench Metric Slice: jailbreak success rate Recent High-Signal Papers

Start with These 3

Use these when you need one protocol anchor, one benchmark anchor, and one recent comparison point before reading the wider hub.

Strongest protocol reference

A Simple and Efficient Jailbreak Method Exploiting LLMs' Helpfulness

Highest protocol score with explicit human/eval signal plus AdvBench.

Strongest benchmark reference

Obscure but Effective: Classical Chinese Jailbreak Prompt Optimizatio…

Reported benchmark with accuracy gives a fast comparison anchor.

Strongest recent paper

MANATEE: Inference-Time Lightweight Diffusion Based Safety Defense fo…

Useful for current practice scanning; published Feb 21, 2026.

Start Here (Best First 6)

Ranked for protocol completeness (human signal, benchmark + metric anchors, quality controls, and judge/human overlap).

A Simple and Efficient Jailbreak Method Exploiting LLMs' Helpfulness
Sep 17, 2025 · Citations: 0 · Score: 7.0

HF: Red Team · Eval: Automatic Metrics · Benchmark: AdvBench · Metric: Helpfulness
Obscure but Effective: Classical Chinese Jailbreak Prompt Optimization via Bio-Inspired Search
Feb 26, 2026 · Citations: 0 · Score: 6.0

HF: Red Team · Eval: Automatic Metrics · Benchmark: Not Reported · Metric: Accuracy
MANATEE: Inference-Time Lightweight Diffusion Based Safety Defense for LLMs
Feb 21, 2026 · Citations: 0 · Score: 6.0

HF: Red Team · Eval: Automatic Metrics · Benchmark: Not Reported · Metric: Success rate
RLShield: Practical Multi-Agent RL for Financial Cyber Defense with Attack-Surface MDPs and Real-Time Response Orchestration
Feb 26, 2026 · Citations: 0 · Score: 6.0

HF: Not reported · Eval: Automatic Metrics · Benchmark: APPS · Metric: Cost
What Matters For Safety Alignment?
Jan 7, 2026 · Citations: 0 · Score: 5.5

HF: Red Team · Eval: Automatic Metrics · Benchmark: Not Reported · Metric: Success rate
"Are You Sure?": An Empirical Study of Human Perception Vulnerability in LLM-Driven Agentic Systems
Feb 24, 2026 · Citations: 0 · Score: 4.5

HF: Expert Verification · Eval: Automatic Metrics · Benchmark: Not Reported · Metric: Not Reported

Protocol Matrix (Top 12)

Use this to quickly compare protocol ingredients instead of scanning long prose.

Paper	HF Signal	Eval Modes	Benchmarks	Metrics	QC
A Simple and Efficient Jailbreak Method Exploiting LLMs' Helpfulness Sep 17, 2025	Yes Red Team	Automatic Metrics	AdvBench	Helpfulness	Not Reported
Obscure but Effective: Classical Chinese Jailbreak Prompt Optimization via Bio-Inspired Search Feb 26, 2026	Yes Red Team	Automatic Metrics	Not Reported	Accuracy , Conciseness	Not Reported
MANATEE: Inference-Time Lightweight Diffusion Based Safety Defense for LLMs Feb 21, 2026	Yes Red Team	Automatic Metrics	Not Reported	Success rate , Jailbreak success rate	Not Reported
RLShield: Practical Multi-Agent RL for Financial Cyber Defense with Attack-Surface MDPs and Real-Time Response Orchestration Feb 26, 2026	No Not Reported	Automatic Metrics	APPS	Cost	Not Reported
What Matters For Safety Alignment? Jan 7, 2026	Yes Red Team	Automatic Metrics	Not Reported	Success rate , Jailbreak success rate	Not Reported
"Are You Sure?": An Empirical Study of Human Perception Vulnerability in LLM-Driven Agentic Systems Feb 24, 2026	Yes Expert Verification	Automatic Metrics	Not Reported	Not Reported	Not Reported
A Systematic Review of Algorithmic Red Teaming Methodologies for Assurance and Security of AI Applications Feb 24, 2026	Yes Red Team	Automatic Metrics	Not Reported	Not Reported	Not Reported
ICON: Indirect Prompt Injection Defense for Agents based on Inference-Time Correction Feb 24, 2026	No Not Reported	Automatic Metrics	Not Reported	Not Reported	Not Reported

Protocol Diff (Top Papers)

Fast side-by-side comparison for the highest-ranked papers in this hub.

Signal	A Simple and Efficient Jailbreak Method Exploiting…	Obscure but Effective: Classical Chinese Jailbreak…	MANATEE: Inference-Time Lightweight Diffusion Based…
Human Feedback	Red Team	Red Team	Red Team
Evaluation Modes	Automatic Metrics	Automatic Metrics	Automatic Metrics
Benchmarks	AdvBench	Not reported	Not reported
Metrics	Helpfulness	Accuracy, Conciseness	Success rate, Jailbreak success rate
Quality Controls	Not reported	Not reported	Not reported
Rater Population	Unknown	Unknown	Unknown
Annotation Unit	Unknown	Unknown	Unknown

Research Utility Snapshot

Human Feedback Mix

Red Team (5)
Expert Verification (1)

Evaluation Modes

Automatic Metrics (8)

Top Benchmarks

AdvBench (1)
APPS (1)

Top Metrics

Jailbreak success rate (3)
Success rate (2)
Accuracy (1)
Conciseness (1)

Rater Population Mix

Domain Experts (1)

Quality Controls

Coverage diagnostics (sample-based): human-feedback 33.3% · benchmarks 22.2% · metrics 83.3% · quality controls 0.0%.

Top Papers

What Matters For Safety Alignment?
Xing Li, Hui-Ling Zhen, Lihao Yin, Xianzhi Yu, Zhenhua Dong · Jan 7, 2026 · Citations: 0

Red Team Automatic Metrics Tool Use

This paper presents a comprehensive empirical study on the safety alignment capabilities.
A Simple and Efficient Jailbreak Method Exploiting LLMs' Helpfulness
Xuan Luo, Yue Wang, Zefeng He, Geng Tu, Jing Li · Sep 17, 2025 · Citations: 0

Red Team Automatic Metrics

This study reveals a critical safety blind spot in modern LLMs: learning-style queries, which closely resemble ordinary educational questions, can reliably elicit harmful responses.
Obscure but Effective: Classical Chinese Jailbreak Prompt Optimization via Bio-Inspired Search
Xun Huang, Simeng Qin, Xiaoshuang Jia, Ranjie Duan, Huanqian Yan · Feb 26, 2026 · Citations: 0

Red Team Automatic Metrics

Owing to its conciseness and obscurity, classical Chinese can partially bypass existing safety constraints, exposing notable vulnerabilities in LLMs.
MANATEE: Inference-Time Lightweight Diffusion Based Safety Defense for LLMs
Chun Yan Ryan Kan, Tommy Tran, Vedant Yadav, Ava Cai, Kevin Zhu · Feb 21, 2026 · Citations: 0

Red Team Automatic Metrics

We propose MANATEE, an inference-time defense that uses density estimation over a benign representation manifold.
"Are You Sure?": An Empirical Study of Human Perception Vulnerability in LLM-Driven Agentic Systems
Xinfeng Li, Shenyu Dai, Kelong Zheng, Yue Xiao, Gelei Deng · Feb 24, 2026 · Citations: 0

Expert Verification Automatic Metrics

Large language model (LLM) agents are rapidly becoming trusted copilots in high-stakes domains like software development and healthcare.
RLShield: Practical Multi-Agent RL for Financial Cyber Defense with Attack-Surface MDPs and Real-Time Response Orchestration
Srikumar Nayak · Feb 26, 2026 · Citations: 0

Automatic Metrics Multi Agent

This paper proposes RLShield, a practical multi-agent RL pipeline for financial cyber defense.
ICON: Indirect Prompt Injection Defense for Agents based on Inference-Time Correction
Che Wang, Fuyao Zhang, Jiaming Zhang, Ziqi Zhang, Yinghui Wang · Feb 24, 2026 · Citations: 0

Automatic Metrics Long Horizon

Large Language Model (LLM) agents are susceptible to Indirect Prompt Injection (IPI) attacks, where malicious instructions in retrieved content hijack the agent's execution.
A Systematic Review of Algorithmic Red Teaming Methodologies for Assurance and Security of AI Applications
Shruti Srivastava, Kiranmayee Janardhan, Shaurya Jauhari · Feb 24, 2026 · Citations: 0

Red Team Automatic Metrics

These limitations have driven the evolution toward auto-mated red teaming, which leverages artificial intelligence and automation to deliver efficient and adaptive security evaluations.
Assessing Deanonymization Risks with Stylometry-Assisted LLM Agent
Boyang Zhang, Yang Zhang · Feb 26, 2026 · Citations: 0

Automatic Metrics

In this work, we introduce an LLM agent designed to evaluate and mitigate such risks through a structured, interpretable pipeline.
Adversarial Intent is a Latent Variable: Stateful Trust Inference for Securing Multimodal Agentic RAG
Inderjeet Singh, Vikas Pahuja, Aishvariya Priya Rathina Sabapathy, Chiara Picardi, Amit Giloni · Feb 24, 2026 · Citations: 0

Automatic Metrics

Current stateless defences for multimodal agentic RAG fail to detect adversarial strategies that distribute malicious semantics across retrieval, planning, and generation components.
AdapTools: Adaptive Tool-based Indirect Prompt Injection Attacks on Agentic LLMs
Che Wang, Jiaming Zhang, Ziqi Zhang, Zijie Wang, Yinghui Wang · Feb 24, 2026 · Citations: 0

Automatic Metrics

The integration of external data services (e.g., Model Context Protocol, MCP) has made large language model-based agents increasingly powerful for complex task execution.
Weight space Detection of Backdoors in LoRA Adapters
David Puertolas Merenciano, Ekaterina Vasyagina, Raghav Dixit, Kevin Zhu, Ruizhe Li · Feb 16, 2026 · Citations: 0

Automatic Metrics

We evaluate the method on 500 LoRA adapters -- 400 clean, and 100 poisoned for Llama-3.2-3B on instruction and reasoning datasets: Alpaca, Dolly, GSM8K, ARC-Challenge, SQuADv2, NaturalQuestions, HumanEval, and GLUE dataset.
Randomized Masked Finetuning: An Efficient Way to Mitigate Memorization of PIIs in LLMs
Kunj Joshi, David A. Smith · Dec 2, 2025 · Citations: 0

Automatic Metrics

We present MaxTER, a Pareto-optimal evaluation framework for assessing privacy-utility tradeoffs, and show the performance of RMFT vs Deduplication by Area Under The Response Curve (AURC) metric.
DropVLA: An Action-Level Backdoor Attack on Vision--Language--Action Models
Zonghuan Xu, Xiang Zheng, Xingjun Ma, Yu-Gang Jiang · Oct 13, 2025 · Citations: 0

Automatic Metrics

The backdoor remains robust to moderate trigger variations and transfers across evaluation suites (96.27%, 99.09%), whereas text-only largely fails (0.72%).
Multi-hop Deep Joint Source-Channel Coding with Deep Hash Distillation for Semantically Aligned Image Recovery
Didrik Bergström, Deniz Gündüz, Onur Günlü · Oct 8, 2025 · Citations: 0

Automatic Metrics

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
A Lightweight IDS for Early APT Detection Using a Novel Feature Selection Method
Bassam Noori Shaker, Bahaa Al-Musawi, Mohammed Falih Hassan · Jun 13, 2025 · Citations: 0

Automatic Metrics

The results of our proposed method showed the ability to reduce the selected features of the SCVIC-APT-2021 dataset from 77 to just four while maintaining consistent evaluation metrics for the suggested system.
PII-Bench: Evaluating Query-Aware Privacy Protection Systems
Hao Shen, Zhouhong Gu, Haokai Hong, Weili Han · Feb 25, 2025 · Citations: 0

Automatic Metrics

To address this challenge, we propose a query-unrelated PII masking strategy and introduce PII-Bench, the first comprehensive evaluation framework for assessing privacy protection systems.
Topic-Based Watermarks for Large Language Models
Alexander Nemecek, Yuzhou Jiang, Erman Ayday · Apr 2, 2024 · Citations: 0

Automatic Metrics

The indistinguishability of large language model (LLM) output from human-authored content poses significant challenges, raising concerns about potential misuse of AI-generated text and its influence on future model training.

Related Hubs

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.

Post a Job Get a Quote