Official implementation from Papers with Code · Repository link is mentioned in the paper metadata
- Stars
- 149
- Last push
- Oct 25, 2025 (223d ago)
Risk flags
- No CI pipeline detected
- No Docker setup
Minghao Shao, Sofija Jancheska, Meet Udeshi, Brendan Dolan-Gavitt, Haoran Xi, Kimberly Milner, Boyuan Chen, Max Yin, Siddharth Garg, Prashanth Krishnamurthy, Farshad Khorrami, Ramesh Karri, Muhammad Shafique
Core AI workload signals detected from paper context and implementation/artifact evidence.
Large Language Models (LLMs) are being deployed across various domains today. However, their capacity to solve Capture the Flag (CTF) challenges in cybersecurity has not been thoroughly evaluated. To address this, we develop a novel method to assess LLMs in solving CTF challenges by creating a scalable, open-source benchmark database specifically designed for these applications. This database includes metadata for LL ...
M testing and adaptive learning, compiling a diverse range of CTF challenges from popular competitions. Utilizing the advanced function calling capabilities of LLMs, we build a fully automated system with an enhanced workflow and support for external tool calls. Our benchmark dataset and automated framework allow us to evaluate the performance of five LLMs, encompassing both black-box and open-source models. This work lays the foundation for future research into improving the efficiency of LLMs in interactive cybersecurity tasks and automated task planning. By providing a specialized benchmark, our project offers an ideal platform for developing, testing, and refining LLM-based approaches to vulnerability detection and resolution. Evaluating LLMs on these challenges and comparing with human performance yields insights into their potential for AI-driven cybersecurity solutions to perform real-world threat management. We make our benchmark dataset open source to public https://github.com/NYU-LLM-CTF/NYU_CTF_Bench along with our playground automated framework https://github.com/NYU-LLM-CTF/llm_ctf_automation.
No concrete benchmark grounding is available yet. Treat the page as context or an implementation starting point only.
Large Language Models (LLMs) are being deployed across various domains today.
nyu-llm-ctf/llm_ctf_automation is the strongest maintained implementation based on ranking signals. License is declared (MIT). Dependency/environment manifests are present.
Open nyu-llm-ctf/llm_ctf_automationEvidence graph: 3 refs, 3 links.
Utility signals: depth 35/100, grounding 75/100, status low.
Compare maintenance quality, reproducibility coverage, and evidence confidence before choosing a reproduction baseline.
Official implementation from Papers with Code · Repository link is mentioned in the paper metadata
Risk flags
Official implementation from Papers with Code · Repository link is mentioned in the paper metadata
Risk flags
Official implementation from Papers with Code · Repository link is mentioned in the paper metadata
Risk flags
The D-CIPHER and NYU CTF baseline LLM Agents built for NYU CTF Bench
Preserved for provenance. Not recommended as the default path for new builds.
Dependencies pinned, manual setup needed
Quick start
git clone https://github.com/nyu-llm-ctf/llm_ctf_automation.git
pip install -e . No benchmark numbers could be verified. You will not be able to validate reproduction correctness against published numbers.
NYU-LLM-CTF/NYU_CTF_Bench
No additional community repositories detected yet.
These repositories had low-confidence matching signals and are hidden by default.
No direct paper-linked artifacts were found. Showing strongest curated related artifacts for faster exploration.
No trustworthy model matches right now.
Search models on Hugging FaceNo trustworthy demo spaces right now.
Search spaces on Hugging FaceTasks
Agentic tool use
Methods
Transformer
Domains
Natural Language Processing, Large Language Models, AI Agents
Evaluation & Human Feedback Data
Open this paper in HFEPX to review benchmark signals, evaluation modes, and human-feedback protocol context.
Open in HFEPXExplore Similar Papers
Jump to Paper2Code search queries derived from this paper's research context.
Need human evaluators for your AI research? Scale annotation with expert AI Trainers.