What is the best open-source implementation of "Derail Yourself: Multi-turn LLM Jailbreak Attack through Self-discovered Clues"?

The best maintained implementation is renqibing/actorattack with 129 stars on GitHub. Confidence: high. Reproducibility: Limited.

Are there pretrained models available for "Derail Yourself: Multi-turn LLM Jailbreak Attack through Self-discovered Clues"?

Yes, 1 Hugging Face model found. The top result is garak-llm/roberta-large-snli_mnli_fever_anli_R1_R2_R3-nli with 55,655 downloads.

What framework is used to implement "Derail Yourself: Multi-turn LLM Jailbreak Attack through Self-discovered Clues"?

The primary implementation uses none.

Derail Yourself: Multi-turn LLM Jailbreak Attack through Self-discovered Clues

Q: How reproducible is "Derail Yourself: Multi-turn LLM Jailbreak Attack through Self-discovered Clues"?

Estimated time to first reproduction: a few hours. Risk flags: License metadata missing, No CI workflows detected. Start with renqibing/actorattack and validate setup instructions in README.

Published: Oct 1, 2024

Best maintained implementation now

Evidence: Direct

Domain fit: AI-core

Verified repos: 1

Top repo stars: 129

Core AI workload signals detected from paper context and implementation/artifact evidence.

Framework: none

Time to first repro: a few hours

2 risk flags

arXiv PDF

Technical details

Canonical key: arxiv-2410.10700

Cache status: Fresh

Generated at: Apr 23, 2026, 12:48 PM

Artifact coverage: direct

HF provider: ok (token)

PWC source used: Yes

LLM status: not_generated

LLM model: n/a

LLM generated: Unknown

LLM content type: n/a

HF policy: hf-relevance-v27

implementation starting point

Benchmarks: thin evidence

Time to repro: a few hours

2 risk flags

none

Results & Benchmarks

Freshness tier: cold

Direct + Inferred Evidence

Natural language processing

Crescendo

Llama-3

0.25

Source: paper fulltext

Benchmark evidence drill-down

1 findings

Audit each benchmark finding before selecting an implementation path. Evidence refs map to the disclosure section below.

Task	Dataset	Metric	Value	Source	Evidence refs
Natural language processing	Crescendo	Llama-3	0.25	paper-derived	No explicit refs

Derail Yourself: Multi-turn LLM Jailbreak Attack through Self-discovered Clues is the primary contribution described in this paper.

Use This Implementation Because…

Confidence: high

renqibing/actorattack is the strongest maintained implementation based on ranking signals. Dependency/environment manifests are present.

Open renqibing/actorattack

Reproduction Risks

License metadata missing
No CI workflows detected

Evidence disclosure

Evidence graph: 4 refs, 4 links.

Utility signals: depth 95/100, grounding 95/100, status high.

Implementation Comparison

Top 3 paths

Compare maintenance quality, reproducibility coverage, and evidence confidence before choosing a reproduction baseline.

renqibing/actorattack

best maintained

Maintenance: Stale

Confidence: High

Reproducibility: Limited

Official implementation from Papers with Code · Repository link is mentioned in the paper metadata

Stars: 129
Last push: Feb 3, 2025 (445d ago)

Dependencies

Risk flags

No push in 12+ months
No CI pipeline detected
No tagged releases

cyberark/FuzzyAI

alternative

Maintenance: Recently updated

Confidence: Low

Reproducibility: Moderate

Matched via arXiv identifier search · Community adoption signal (1338 stars)

Stars: 1,338
Last push: Feb 6, 2026 (76d ago)

Dependencies

Risk flags

No CI pipeline detected
No tagged releases
No Docker setup

AIM-Intelligence/Automated-Multi-Turn-Jailbreaks

alternative

Maintenance: Recently updated

Confidence: Low

Reproducibility: Limited

Community adoption signal (131 stars)

Stars: 131
Last push: Dec 3, 2025 (142d ago)

Risk flags

No CI pipeline detected
No tagged releases
No Docker setup

Best implementation now

renqibing/actorattack

Confidence: High

Reproducibility: Limited

AI45Lab/ActorAttack

Stars: 129

Forks: 12

Last push: Feb 3, 2025

Official implementation from Papers with Code

Repository link is mentioned in the paper metadata

Community adoption signal (129 stars)

License –

CI –

Deps ✓

Docker –

Selected renqibing/actorattack as the strongest maintained implementation for new work.
Includes dependency/environment manifest signals.
Repository activity is within the last 24 months.

Reproduction readiness

Setup Required

Time to first repro: hours

Last checked: Apr 23, 2026

Dependencies pinned, manual setup needed

· renqibing/actorattack has requirements.txt but requires manual environment setup.
· Last push was 445 days ago — expect possible dependency version conflicts.
· No Dockerfile — you will set up the environment manually.
· No CI pipeline — test coverage is unknown.

Open renqibing/actorattack

Quick start

git clone https://github.com/renqibing/actorattack.git
pip install -r requirements.txt

Additional implementations

No additional verified repositories beyond the primary recommendation.

Possible but unverified matches (8)

These repositories had low-confidence matching signals and are hidden by default.

Showing top 6 by score. 2 additional low-confidence matches are hidden.

cyberark/FuzzyAI

Confidence: Low

Stars: 1,338
AIM-Intelligence/Automated-Multi-Turn-Jailbreaks

Confidence: Low

Stars: 131
AI45Lab/ActorAttack

Confidence: Low

Stars: 129
ms0017/ActorAttackEval

Confidence: Low

Stars: 0
ThKrypt/FuzzyAI

Confidence: Low

Stars: 0
thomaschoi143/MCS-LVLM-Jailbreak-Project

Confidence: Low

Stars: 0

Hugging Face artifacts

No direct paper-linked artifacts were found. Showing strongest curated related artifacts for faster exploration.

Models

garak-llm/roberta-large-snli_mnli_fever_anli_R1_R2_R3-nli

Curated Related

Downloads: 55,655

Likes: 0

Broaden model search

Transformer Natural language processing Transformer Large Language Models Natural language processing

Datasets

Necent/llm-jailbreak-prompt-injection-dataset

Curated Related

Downloads: 495

Likes: 5

Updated: Apr 11, 2026
GloriaaaM/LLM-Agent-Harness-Survey

Curated Related

Downloads: 282

Likes: 4

Updated: Apr 15, 2026