NYU CTF Bench: A Scalable Open-Source Benchmark Dataset for Evaluating LLMs in Offensive Security

Q: What is the best open-source implementation of "NYU CTF Bench: A Scalable Open-Source Benchmark Dataset for Evaluating LLMs in Offensive Security"?

The best maintained implementation is nyu-llm-ctf/llm_ctf_automation with 149 stars on GitHub. Confidence: high. Reproducibility: Moderate.

Q: How reproducible is "NYU CTF Bench: A Scalable Open-Source Benchmark Dataset for Evaluating LLMs in Offensive Security"?

Estimated time to first reproduction: a few hours. Risk flags: No CI workflows detected. Start with nyu-llm-ctf/llm_ctf_automation and validate setup instructions in README.

Q: What framework is used to implement "NYU CTF Bench: A Scalable Open-Source Benchmark Dataset for Evaluating LLMs in Offensive Security"?

The primary implementation uses none.

Minghao Shao, Sofija Jancheska, Meet Udeshi, Brendan Dolan-Gavitt, Haoran Xi, Kimberly Milner, Boyuan Chen, Max Yin, Siddharth Garg, Prashanth Krishnamurthy, Farshad Khorrami, Ramesh Karri, Muhammad Shafique

Published: Jun 8, 2024

Best maintained implementation now

Evidence: Direct

Domain fit: AI-core

Verified repos: 3

Top repo stars: 149

Core AI workload signals detected from paper context and implementation/artifact evidence.

Framework: none

Time to first repro: a few hours

1 risk flag

arXiv PDF

Large Language Models (LLMs) are being deployed across various domains today. However, their capacity to solve Capture the Flag (CTF) challenges in cybersecurity has not been thoroughly evaluated. To address this, we develop a novel method to assess LLMs in solving CTF challenges by creating a scalable, open-source benchmark database specifically designed for these applications. This database includes metadata for LL ...

Read full abstract

M testing and adaptive learning, compiling a diverse range of CTF challenges from popular competitions. Utilizing the advanced function calling capabilities of LLMs, we build a fully automated system with an enhanced workflow and support for external tool calls. Our benchmark dataset and automated framework allow us to evaluate the performance of five LLMs, encompassing both black-box and open-source models. This work lays the foundation for future research into improving the efficiency of LLMs in interactive cybersecurity tasks and automated task planning. By providing a specialized benchmark, our project offers an ideal platform for developing, testing, and refining LLM-based approaches to vulnerability detection and resolution. Evaluating LLMs on these challenges and comparing with human performance yields insights into their potential for AI-driven cybersecurity solutions to perform real-world threat management. We make our benchmark dataset open source to public https://github.com/NYU-LLM-CTF/NYU_CTF_Bench along with our playground automated framework https://github.com/NYU-LLM-CTF/llm_ctf_automation.

Technical details

Canonical key: arxiv-2406.05590

Cache status: Fresh

Generated at: Jun 5, 2026, 4:20 PM

Artifact coverage: direct

HF provider: ok (token)

PWC source used: Yes

LLM status: not_generated

LLM model: n/a

LLM generated: Unknown

LLM content type: n/a

HF policy: hf-relevance-v27

implementation starting point

Benchmarks: missing

Time to repro: a few hours

1 risk flag

none

Results & Benchmarks

Freshness tier: hot

Direct + Inferred Evidence

No concrete benchmark grounding is available yet. Treat the page as context or an implementation starting point only.

Large Language Models (LLMs) are being deployed across various domains today.

Use This Implementation Because…

Confidence: high

nyu-llm-ctf/llm_ctf_automation is the strongest maintained implementation based on ranking signals. License is declared (MIT). Dependency/environment manifests are present.

Open nyu-llm-ctf/llm_ctf_automation

Reproduction Risks

No CI workflows detected

Evidence disclosure

Evidence graph: 3 refs, 3 links.

Utility signals: depth 35/100, grounding 75/100, status low.

Implementation Comparison

Top 3 paths

Compare maintenance quality, reproducibility coverage, and evidence confidence before choosing a reproduction baseline.

nyu-llm-ctf/llm_ctf_automation

best maintained

Maintenance: Stale risk

Confidence: High

Reproducibility: Moderate

Official implementation from Papers with Code · Repository link is mentioned in the paper metadata

Stars: 149
Last push: Oct 25, 2025 (223d ago)

ReleasesDependencies

Risk flags

No CI pipeline detected
No Docker setup

nyu-llm-ctf/nyu_ctf_bench

alternative

Maintenance: Stale risk

Confidence: High

Reproducibility: Limited

Official implementation from Papers with Code · Repository link is mentioned in the paper metadata

Stars: 153
Last push: Sep 22, 2025 (256d ago)

Releases

Risk flags

No CI pipeline detected
No Docker setup
Dependency manifest missing

nyu-llm-ctf/llm_ctf_database

historical official

Maintenance: Stale risk

Confidence: High

Reproducibility: Limited

Official implementation from Papers with Code · Repository link is mentioned in the paper metadata

Stars: 153
Last push: Sep 22, 2025 (256d ago)

Releases

Risk flags

No CI pipeline detected
No Docker setup
Dependency manifest missing

Best implementation now

nyu-llm-ctf/llm_ctf_automation

Confidence: High

Reproducibility: Moderate

The D-CIPHER and NYU CTF baseline LLM Agents built for NYU CTF Bench

Stars: 149

Forks: 32

Last push: Oct 25, 2025

License: MIT

Official implementation from Papers with Code

Repository link is mentioned in the paper metadata

Strong overlap with paper title keywords

Community adoption signal (149 stars)

License ✓

CI –

Deps ✓

Docker –

Selected nyu-llm-ctf/llm_ctf_automation as the strongest maintained implementation for new work.
Includes dependency/environment manifest signals.
Repository activity is within the last 24 months.
Official repository is preserved separately as historical context.

Historical official implementation

Preserved for provenance. Not recommended as the default path for new builds.

nyu-llm-ctf/llm_ctf_database

Stars: 153

Last push: Sep 22, 2025

Reproduction readiness

Setup Required

Time to first repro: hours

Last checked: Jun 5, 2026

Dependencies pinned, manual setup needed

· nyu-llm-ctf/llm_ctf_automation has pyproject.toml but requires manual environment setup.
· Last push was 223 days ago — expect possible dependency version conflicts.
· No Dockerfile — you will set up the environment manually.
· No CI pipeline — test coverage is unknown.

Open nyu-llm-ctf/llm_ctf_automation

Quick start

git clone https://github.com/nyu-llm-ctf/llm_ctf_automation.git
pip install -e .

No benchmark numbers could be verified. You will not be able to validate reproduction correctness against published numbers.

Additional implementations

Official

nyu-llm-ctf/nyu_ctf_bench
Confidence: High

NYU-LLM-CTF/NYU_CTF_Bench

Stars: 153

Forks: 28

Last push: Sep 22, 2025

License: GPL-2.0

Community

No additional community repositories detected yet.

Possible but unverified matches (4)

These repositories had low-confidence matching signals and are hidden by default.

princeton-nlp/swe-agent

Confidence: Low

Stars: 19,428
swe-agent/swe-agent

Confidence: Low

Stars: 19,428
anshug/ai-security-evals

Confidence: Low

Stars: 2
flacman/IH3A

Confidence: Low

Stars: 0

Hugging Face artifacts

No direct paper-linked artifacts were found. Showing strongest curated related artifacts for faster exploration.

Models

No trustworthy model matches right now.

Search models on Hugging Face

Datasets

nepfaff/scalable-real2sim

Curated Related

Downloads: 111

Likes: 0

Updated: Mar 14, 2026

Broaden dataset search

Transformer Agentic tool use dataset Transformer Natural Language Processing dataset Agentic tool use dataset

Spaces

No trustworthy demo spaces right now.

Search spaces on Hugging Face

Explore on Hugging Face

Search models Search datasets Search spaces

Research context

Tasks

Agentic tool use

Methods

Transformer

Domains

Natural Language Processing, Large Language Models, AI Agents

Evaluation & Human Feedback Data

Open this paper in HFEPX to review benchmark signals, evaluation modes, and human-feedback protocol context.

Open in HFEPX

Explore Similar Papers

Jump to Paper2Code search queries derived from this paper's research context.

Agentic tool use Transformer Natural Language Processing Large Language Models AI Agents

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.

Post a Job Get a Quote