Jailbreak Foundry: From Papers to Runnable Attacks for Reproducible Benchmarking

Zhicheng Fang, Jingjie Zheng, Chenxu Fu, Wei Xu · Feb 27, 2026 · Citations: 0

Coding Llm As Judge Multi Agent Multilingual Red Team

Abstract

Jailbreak techniques for large language models (LLMs) evolve faster than benchmarks, making robustness estimates stale and difficult to compare across papers due to drift in datasets, harnesses, and judging protocols. We introduce JAILBREAK FOUNDRY (JBF), a system that addresses this gap via a multi-agent workflow to translate jailbreak papers into executable modules for immediate evaluation within a unified harness. JBF features three core components: (i) JBF-LIB for shared contracts and reusable utilities; (ii) JBF-FORGE for the multi-agent paper-to-module translation; and (iii) JBF-EVAL for standardizing evaluations. Across 30 reproduced attacks, JBF achieves high fidelity with a mean (reproduced-reported) attack success rate (ASR) deviation of +0.26 percentage points. By leveraging shared infrastructure, JBF reduces attack-specific implementation code by nearly half relative to original repositories and achieves an 82.5% mean reused-code ratio. This system enables a standardized AdvBench evaluation of all 30 attacks across 10 victim models using a consistent GPT-4o judge. By automating both attack integration and standardized evaluation, JBF offers a scalable solution for creating living benchmarks that keep pace with the rapidly shifting security landscape.

HFEPX Relevance Assessment

This paper has direct human-feedback and/or evaluation protocol signal and is likely useful for eval pipeline design.

Eval-Fit Score

67/100 • Medium

Useful as a secondary reference; validate protocol details against neighboring papers.

Human Feedback Signal

Detected

Evaluation Signal

Detected

HFEPX Fit

High-confidence candidate

If you are doing eval pipeline work, start here:

Human Eval Hub LLM-as-Judge Hub Pairwise Preference Hub Tool-Use Eval Hub

Human Data Lens

Uses human feedback: Yes
Feedback types: Red Team
Rater population: Unknown
Unit of annotation: Unknown
Expertise required: Coding, Multilingual
Extraction source: Runtime deterministic fallback

Evaluation Lens

Evaluation modes: Llm As Judge
Agentic eval: Multi Agent
Quality controls: Not reported
Confidence: 0.80
Flags: runtime_fallback_extraction

Protocol And Measurement Signals

Benchmarks / Datasets

AdvBenchJbf-Eval

Reported Metrics

success ratejailbreak success rate

Research Brief

Deterministic synthesis

Jailbreak techniques for large language models (LLMs) evolve faster than benchmarks, making robustness estimates stale and difficult to compare across papers due to drift in datasets, harnesses, and judging protocols. HFEPX signals include Red Team, Llm As Judge, Multi Agent with confidence 0.80. Updated from current HFEPX corpus.

Generated Mar 3, 2026, 8:35 PM · Grounded in abstract + metadata only

Key Takeaways

Jailbreak techniques for large language models (LLMs) evolve faster than benchmarks, making robustness estimates stale and difficult to compare across papers due to drift in…
We introduce JAILBREAK FOUNDRY (JBF), a system that addresses this gap via a multi-agent workflow to translate jailbreak papers into executable modules for immediate evaluation…

Researcher Actions

Compare its human-feedback setup against pairwise and rubric hubs.
Cross-check benchmark overlap: AdvBench, Jbf-Eval.
Validate metric comparability (success rate, jailbreak success rate).

Caveats

Generated from title, abstract, and extracted metadata only; full-paper implementation details are not parsed.
Extraction confidence is probabilistic and should be validated for critical decisions.

Recommended Queries

llm-as-judge calibration agent eval benchmark comparison inter-rater agreement adjudication

Research Summary

Contribution Summary

Jailbreak techniques for large language models (LLMs) evolve faster than benchmarks, making robustness estimates stale and difficult to compare across papers due to drift in datasets, harnesses, and judging protocols.
We introduce JAILBREAK FOUNDRY (JBF), a system that addresses this gap via a multi-agent workflow to translate jailbreak papers into executable modules for immediate evaluation within a unified harness.
JBF features three core components: (i) JBF-LIB for shared contracts and reusable utilities; (ii) JBF-FORGE for the multi-agent paper-to-module translation; and (iii) JBF-EVAL for standardizing evaluations.

Why It Matters For Eval

Jailbreak techniques for large language models (LLMs) evolve faster than benchmarks, making robustness estimates stale and difficult to compare across papers due to drift in datasets, harnesses, and judging protocols.
We introduce JAILBREAK FOUNDRY (JBF), a system that addresses this gap via a multi-agent workflow to translate jailbreak papers into executable modules for immediate evaluation within a unified harness.

Researcher Checklist

Pass: Human feedback protocol is explicit

Detected: Red Team
Pass: Evaluation mode is explicit

Detected: Llm As Judge
Gap: Quality control reporting appears

No calibration/adjudication/IAA control explicitly detected.
Pass: Benchmark or dataset anchors are present

Detected: AdvBench, Jbf-Eval
Pass: Metric reporting is present

Detected: success rate, jailbreak success rate

Related Papers

Papers are ranked by protocol overlap, extraction signal alignment, and semantic proximity.

EpidemIQs: Prompt-to-Paper LLM Agents for Epidemic Modeling and Analysis Protocol Overlap

Citations: 0 Relevance: 8.40 Shared tag: Llm As JudgeShared tag: Multi Agent
- Shared 2 HFEPX protocol tags
- Aligned evaluation mode
- Aligned agent-evaluation setup
- Shared metric mentions
Refusal Steering: Fine-grained Control over LLM Refusal Behaviour for Sensitive Topics Protocol Overlap

Citations: 0 Relevance: 7.90 Shared tag: Red TeamShared tag: Llm As Judge
- Shared 2 HFEPX protocol tags
- Aligned human feedback protocol
- Aligned evaluation mode
The Emergence of Lab-Driven Alignment Signatures: A Psychometric Framework for Auditing Latent Bias and Compounding Risk in Generative AI Protocol Overlap

Citations: 0 Relevance: 7.50 Shared tag: Llm As JudgeShared tag: Multi Agent
- Shared 2 HFEPX protocol tags
- Aligned evaluation mode
- Aligned agent-evaluation setup
World-Model-Augmented Web Agents with Action Correction Protocol Overlap

Citations: 0 Relevance: 7.50 Shared tag: Llm As JudgeShared tag: Multi Agent
- Shared 2 HFEPX protocol tags
- Aligned evaluation mode
- Aligned agent-evaluation setup
Helpful to a Fault: Measuring Illicit Assistance in Multi-Turn, Multilingual LLM Agents Protocol Overlap

Citations: 0 Relevance: 6.80 Shared tag: Red TeamShared tag: Multilingual
- Shared 2 HFEPX protocol tags
- Aligned human feedback protocol
IndicJR: A Judge-Free Benchmark of Jailbreak Robustness in South Asian Languages Protocol Overlap

Citations: 0 Relevance: 6.80 Shared tag: Red TeamShared tag: Multilingual
- Shared 2 HFEPX protocol tags
- Aligned human feedback protocol
Obscure but Effective: Classical Chinese Jailbreak Prompt Optimization via Bio-Inspired Search Protocol Overlap

Citations: 0 Relevance: 6.80 Shared tag: Red TeamShared tag: Multilingual
- Shared 2 HFEPX protocol tags
- Aligned human feedback protocol
Refusal Direction is Universal Across Safety-Aligned Languages Protocol Overlap

Citations: 0 Relevance: 6.80 Shared tag: Red TeamShared tag: Multilingual
- Shared 2 HFEPX protocol tags
- Aligned human feedback protocol
Estonian Native Large Language Model Benchmark Protocol Overlap

Citations: 0 Relevance: 6.50 Shared tag: Llm As JudgeShared tag: Multilingual
- Shared 2 HFEPX protocol tags
- Aligned evaluation mode
SAMAS: A Spectrum-Guided Multi-Agent System for Achieving Style Fidelity in Literary Translation Protocol Overlap

Citations: 0 Relevance: 6.40 Shared tag: MultilingualShared tag: Multi Agent
- Shared 2 HFEPX protocol tags
- Aligned agent-evaluation setup
MANATEE: Inference-Time Lightweight Diffusion Based Safety Defense for LLMs Protocol Overlap

Citations: 0 Relevance: 5.90 Shared tag: Red Team
- Shared HFEPX protocol tags
- Aligned human feedback protocol
- Shared metric mentions
Reasoning Up the Instruction Ladder for Controllable Language Models Protocol Overlap

Citations: 0 Relevance: 5.90 Shared tag: Red Team
- Shared HFEPX protocol tags
- Aligned human feedback protocol
- Shared metric mentions

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.

Post a Job Get a Quote