AI Red Teaming as an Evaluation Data Problem

What is red teaming in AI, really?

If you already run LLM evaluations, the useful answer is not “trying to jailbreak a model.” The useful answer is structured adversarial discovery that produces evidence. AI red teaming matters when it helps you design better evaluation data, refusal tests, harm rubrics, and release decisions. It matters less when it becomes a gallery of viral prompts. [1]

The distinction matters because the field uses several related terms loosely. Traditional cyber red teaming is adversary emulation against an enterprise’s security posture. AI red-teaming is a structured effort to find flaws and vulnerabilities in AI systems, often with developer collaboration. An eval is narrower: give the system an input, apply grading logic, and measure whether it did the thing you care about. Safety testing and pre-deployment TEVV are broader umbrellas that can include red teaming, field testing, public feedback, and formal evaluations. [1]

How AI red teaming differs from cyber red teaming and standard evals.

Practice	Unit of analysis	Goal	Typical output	Wrong inference to avoid
Cyber red team	Organization, network, product, or security program.	Emulate adversaries against a security posture.	Attack path, exploit narrative, remediation list.	A cyber exercise is not automatically an AI evaluation.
AI red teaming	AI model, application, scaffold, or deployment surface.	Discover flaws, vulnerabilities, and harmful behaviors.	Findings, prompts, transcripts, severity notes.	A viral jailbreak is not automatically reliable measurement.
Model evaluation	Defined input, system setup, grader, and metric.	Measure whether a target behavior occurs under a stated harness.	Scores, labels, confidence intervals, error slices.	An eval score only supports the claim its harness can support.
Safety testing and TEVV	Whole lifecycle of evidence and assurance activities.	Combine tests, field evidence, risk thresholds, and review.	Risk register, acceptance evidence, monitoring plan.	A red-team round alone does not prove a system is safe.

Definitions synthesized from NIST and model-evaluation guidance.

That makes demo jailbreaks a weak unit of governance. NIST warns that jailbreak and prompt-engineering tests may not systematically assess validity or reliability. Microsoft frames red teaming as a way to expose harms and understand a risk surface, not as a replacement for systematic measurement. AISI makes the same point from the evaluator side: exploratory tests can signpost concerns, but stronger claims require better elicitation, grading, and mapping between results and risk thresholds. [2]

The practical goal is simple: turn discovery into cases you can replay, label, audit, and route.

Threat model first, jailbreak gallery last

If a red-team program starts by asking people to “go break the model,” it will usually produce colorful examples and weak measurement. Reverse the order. Start by stating the claim the evidence should support: capability elicitation, safeguard robustness, or comparison between systems under a shared setup. Then define the harness, tools, and budget that make the claim meaningful.

OpenAI’s third-party evaluation guidance is direct on this point: scores are only interpretable if the report describes what claim the setup supports, how the system was elicited, and what validity checks were performed. [3]

A threat model for AI red teaming should name the actor, target behavior, attack surface, and downstream harm. AISI argues that evaluations should align to explicit risk models and cover relevant paths to harm. Microsoft describes an internal ontology that tracks actors, tactics, weaknesses, and downstream impacts because raw findings are too messy to reason about without a shared structure. [4]

For most teams, “attack surface” should not stop at the prompt box. If the deployed system has retrieval, tools, long context, multimodal inputs, connectors, or agent scaffolding, those surfaces belong in scope. The harness changes what the system can do. OpenAI emphasizes that surrounding scaffolds can materially change performance, especially for agentic systems. Microsoft similarly recommends testing both the base model and application layer, ideally through the production UI when possible. [3][5]

This is where prompt-only theater starts: testing user text against a stripped-down chat endpoint, then treating the result as evidence about a deployed product. That can miss indirect prompt injection, unsafe tool use, stateful multi-turn escalation, retrieval-dependent failures, and environment-specific behavior. The problem is not that prompt-only tests are useless. The problem is treating them as if they measured the whole system.

Five-step flow from threat model to probes to findings to adjudicated cases to evaluation or training routing. — Red teaming becomes useful when findings are routed through an evaluation-data layer. Workflow synthesized from NIST AI RMF, OpenAI evaluation guidance, AISI methodology, and Microsoft planning guidance.

Sampling, severity, and adjudication

Sampling is where red teaming becomes evaluation design. Google recommends diverse and representative inputs across product policies, use cases, failure modes, lexical variety, and semantic variety. Microsoft recommends starting with open-ended testing to discover harms, then moving into guided rounds based on an evolving harm list. In practice, a sample should not be a public jailbreak list plus a few improvised prompts. It should be a planned slice across harm categories, user intents, surfaces, languages, and plausible elicitation strategies. [6]

Severity also needs structure before the program scales. NIST frames risk as a combination of likelihood and consequence. For red-team labeling, that is a starting point, not a finished rubric. Most teams also need dimensions such as harm type, actionability, reproducibility, realism, and whether the behavior persists after ordinary safeguards. This rubric is editorial synthesis grounded in NIST risk framing and evaluation practice, not a quoted standard. [2]

Adjudication is where teams often fail quietly. Google notes that humans may annotate troubling content differently and recommends clear guidelines or templates for raters. OpenAI recommends clear rubrics, example score levels, pass-fail thresholds, and consensus aggregation when multiple reviewers are used. For high-severity or policy-ambiguous cases, treat human disagreement as a signal about the measurement system, not just annotator noise. [6][7]

A useful operating rule follows: if a finding is high severity, policy-ambiguous, or domain-specialized, escalate it from single-review labeling to expert adjudication. If reviewers disagree on whether the model violated policy, the case should usually drive rubric revision before it drives post-training. That recommendation is synthesis, but it follows from the cited guidance on rubric clarity, calibration, and human review.

Split human and automated work on purpose

Automation helps most when the job is breadth, refresh, and weak-signal detection. Anthropic’s model-written evaluations work showed that models can help generate many relevant eval items quickly, with high crowdworker agreement on labels in that setting. AISI argues that automated capability evaluations are scalable enough to cover a large part of a concerning capability surface. Google also recommends starting with seed examples and expanding them synthetically when existing datasets are insufficient. [8][4][6]

The same sources also define the limits. AISI says automated testing does not mirror real-world usage and should not, on its own, support strong conclusions. Google warns that classifier accuracy can be low for loosely defined constructs. OpenAI says there is more to evals than scores and recommends calibrating automated scoring against human judgment. [4][6][7]

Automated probes are strong at three things. First, they improve coverage by generating more lexical and semantic variants than a small human team can produce. Second, they make regression testing cheap once a case becomes a reusable eval. Third, they let teams attack a system at scale with repeated sampling, which matters because some jailbreak classes improve with more attempts. Anthropic’s many-shot work is one example of long context creating new failure modes. [9]

Tooling now reflects this division of labor. Microsoft PyRIT is a model-agnostic framework for probing multimodal systems and reusing composable red-team building blocks. AISI Inspect supports multi-turn dialog, agent scaffolds, model grading, and detailed logs. Those tools are useful because red-team findings need to become analyzable evaluation artifacts, not loose notes. [10][11]

Humans remain necessary where the question is not only “did policy fail?” but “what actually matters here?” Microsoft is direct: medicine, cybersecurity, CBRN, cross-cultural harms, and psychosocial harms require subject-matter judgment. AISI similarly describes expert red teaming as more naturalistic and open-ended than automated tests. [12][4]

LLM judges need the same skepticism. OpenAI recommends pairwise or pass-fail formats, clear rubrics, and validation against human labels because judge models can be biased by answer order and response length. Use LLM judges like measurement instruments: calibrate them, check for drift, and audit them periodically. Do not give them final authority on specialist harms or ambiguous policy boundaries. [7]

Where automation helps, and where humans still matter.

Method	Best use	Main advantage	Main blind spot	Human review needed?
Automated probes	Broad repeated attempts across known tactics.	Scales coverage and refresh cadence.	Can miss realistic product context and domain harm.	Yes, for severe or ambiguous cases.
Model-generated variants	Expanding seed examples across wording and semantics.	Finds lexical and semantic neighbors quickly.	Can overfit to easy variations.	Yes, for label calibration and novelty checks.
Classifiers	Pre-screening large output volumes.	Cheap triage and monitoring.	Low accuracy on loose constructs.	Yes, for policy and severity decisions.
LLM judges	Pass-fail or pairwise scoring with clear rubrics.	Fast reusable scoring when calibrated.	Order, length, style, and domain bias.	Yes, with periodic human audits.
Scripted regression suites	Rerunning known cases after model or policy changes.	Stable replay and diffing.	Measures yesterday's failure modes.	Yes, when regressions are high risk.
Human domain experts	Medicine, security, CBRN, cultural, legal, or psychosocial harms.	Context and consequence judgment.	Limited throughput and disagreement without rubrics.	They are the review path.

Based on AISI tiering, Google adversarial testing, OpenAI eval practices, Anthropic model-written evals, and Microsoft red teaming lessons.

A good operational split is simple. Let automation generate candidate attacks, expand seed sets, pre-label obvious cases, cluster duplicates, and rerun regressions. Let humans define the harm list, review uncertain or severe cases, audit judge quality, and interpret what the system failure means for the product or model. That division is synthesis grounded in the sources above.

Treat findings as data objects, not anecdotes

If red-team outputs need to become reusable eval data, capture more than the prompt and final answer. Microsoft recommends recording at least the date surfaced, unique identifier, input prompt, and output details. Google recommends annotating outputs into failure modes and harms. Inspect’s log model shows what a more mature representation can preserve: input, sample metadata, scores, errors, messages, and event traces at multiple levels of granularity. [5][6][11]

For applied teams, the minimal useful schema is usually larger. In addition to prompt and response, keep model version, system prompt version or hash, tool configuration, retrieval context or references, attack method, policy category, refusal status, reviewer label, severity label, rationale, adjudication status, case cluster, and routing status. This exact field list is synthesis, but it is a practical extension of current guidance on structured capture, metadata, and scoring.

Duplicate handling matters more than most teams expect. Google recommends avoiding duplication and noisy multi-label examples when generating adversarial datasets. Without deduplication, “top failure modes” often collapse into a few viral attack templates, which then bias both mitigation work and training data. Keep a notion of canonical case versus paraphrase cluster, and report both raw count and unique attack-family count. The canonical-versus-cluster recommendation is synthesis, but the need for uniqueness and careful composition comes directly from Google’s guidance. [6]

A reusable red-team case schema for evaluation and post-training loops.

Field	Why it matters	Training-safe?	Holdout-safe?	Notes
Case ID	Keeps review, reruns, and remediation traceable.	Yes	Yes	Stable IDs should survive paraphrase clustering.
Prompt/input	Defines the elicitation artifact.	Sometimes	Yes	Exact holdout prompts should not later become training examples.
Model output	Captures the observed behavior.	Sometimes	Yes	Include refusals, partial completions, and tool outputs.
Model version	Prevents false comparisons across changing systems.	Yes	Yes	Record model, checkpoint, and relevant release channel.
System prompt hash	Links behavior to instructions without exposing secrets.	Yes	Yes	Use hashes or controlled references when prompts are sensitive.
Tool/retrieval context	Shows what the deployed harness made available.	Sometimes	Yes	Critical for RAG, tool use, long context, and agent tests.
Attack method	Supports coverage and deduplication.	Yes	Yes	Examples include direct jailbreak, indirect injection, role play, or multi-turn escalation.
Policy category	Connects the finding to the rubric.	Yes	Yes	Revise categories when adjudication exposes ambiguity.
Refusal status	Separates unsafe compliance from over-refusal.	Yes	Yes	Do not collapse all refusals into success.
Reviewer label	Defines the human or calibrated judge decision.	Yes	Yes	Track reviewer type and calibration state.
Severity	Prioritizes remediation and release gates.	Yes	Yes	Severity dimensions are synthesis, not a universal standard.
Rationale	Makes later audits possible.	Yes	Yes	Short rationales are better than opaque class labels.
Adjudication status	Separates single-review labels from resolved cases.	Yes	Yes	High-severity and disputed cases need escalation.
Cluster ID	Prevents duplicate prompts from dominating reports.	Yes	Yes	Report raw count and unique attack-family count separately.
Routing status	Prevents leakage between training and measurement.	Yes, when routed to training	Yes, when blocked as holdout	Possible states include training candidate, eval case, blocked holdout, rubric revision, or release evidence.

Case ID

Why it matters: Keeps review, reruns, and remediation traceable.
Training-safe?: Yes
Holdout-safe?: Yes
Notes: Stable IDs should survive paraphrase clustering.

Prompt/input

Why it matters: Defines the elicitation artifact.
Training-safe?: Sometimes
Holdout-safe?: Yes
Notes: Exact holdout prompts should not later become training examples.

Model output

Why it matters: Captures the observed behavior.
Training-safe?: Sometimes
Holdout-safe?: Yes
Notes: Include refusals, partial completions, and tool outputs.

Model version

Why it matters: Prevents false comparisons across changing systems.
Training-safe?: Yes
Holdout-safe?: Yes
Notes: Record model, checkpoint, and relevant release channel.

System prompt hash

Why it matters: Links behavior to instructions without exposing secrets.
Training-safe?: Yes
Holdout-safe?: Yes
Notes: Use hashes or controlled references when prompts are sensitive.

Tool/retrieval context

Why it matters: Shows what the deployed harness made available.
Training-safe?: Sometimes
Holdout-safe?: Yes
Notes: Critical for RAG, tool use, long context, and agent tests.

Attack method

Why it matters: Supports coverage and deduplication.
Training-safe?: Yes
Holdout-safe?: Yes
Notes: Examples include direct jailbreak, indirect injection, role play, or multi-turn escalation.

Policy category

Why it matters: Connects the finding to the rubric.
Training-safe?: Yes
Holdout-safe?: Yes
Notes: Revise categories when adjudication exposes ambiguity.

Refusal status

Why it matters: Separates unsafe compliance from over-refusal.
Training-safe?: Yes
Holdout-safe?: Yes
Notes: Do not collapse all refusals into success.

Reviewer label

Why it matters: Defines the human or calibrated judge decision.
Training-safe?: Yes
Holdout-safe?: Yes
Notes: Track reviewer type and calibration state.

Severity

Why it matters: Prioritizes remediation and release gates.
Training-safe?: Yes
Holdout-safe?: Yes
Notes: Severity dimensions are synthesis, not a universal standard.

Rationale

Why it matters: Makes later audits possible.
Training-safe?: Yes
Holdout-safe?: Yes
Notes: Short rationales are better than opaque class labels.

Adjudication status

Why it matters: Separates single-review labels from resolved cases.
Training-safe?: Yes
Holdout-safe?: Yes
Notes: High-severity and disputed cases need escalation.

Cluster ID

Why it matters: Prevents duplicate prompts from dominating reports.
Training-safe?: Yes
Holdout-safe?: Yes
Notes: Report raw count and unique attack-family count separately.

Routing status

Why it matters: Prevents leakage between training and measurement.
Training-safe?: Yes, when routed to training
Holdout-safe?: Yes, when blocked as holdout
Notes: Possible states include training candidate, eval case, blocked holdout, rubric revision, or release evidence.

Field list combines tool docs and operational guidance; training/holdout routing is editorial synthesis.

Decide what each finding becomes before you train on it

This is the core operational question, and the answer should be written down before the first post-training meeting.

A red-team case can become at least five different things: a training example, a reusable eval case, a blocked holdout, a policy or rubric revision, or release-gate evidence. Current primary sources support the need for these separate lanes, even though they do not package them into one standard workflow. OpenAI warns not to leak eval data into reinforcement fine-tuning, describes production-derived evaluations that are refreshed periodically, and emphasizes validity hazards such as contamination and reward hacking. AISI similarly argues for predefined thresholds and repeated testing over the model lifecycle. [13][14][3][4]

A case is a reasonable training candidate when the label is stable, the failure is representative of realistic use, the policy boundary is clear, and the example is not reserved for future measurement. Anthropic’s earlier red-teaming work found that using red-team data in safety methods reduced susceptibility to the attack corpus studied. That is why teams are tempted to train on everything. It is also why leakage control matters: if those same cases are later used as evidence of progress, the measurement is compromised. [15]

Keep a case out of training when it is needed as a holdout, when it resembles a public benchmark or likely future benchmark, when the label is disputed, or when the exploit is so case-specific that training on it would mostly teach the model to memorize a patch. Anthropic’s BrowseComp contamination note shows how public leakage can inflate results, and OpenAI’s SWE-bench Verified analysis shows the same failure pattern at benchmark scale. In both cases, the lesson is the same: the same artifact cannot be both training material and honest measurement evidence. [16][17]

Practical routing rule: if a case exposed a novel failure mode, block it into a holdout bucket first. Only after you refresh the holdout and prove the fix on new cases should you consider moving variants of that failure pattern into training. That rule is synthesis, but it respects current leakage guidance from provider and evaluator sources.

A small-team operating model that does not pretend to guarantee safety

For a small AI team, the goal is not “cover everything.” The goal is to create a loop that produces evidence you will still trust after the model changes.

Start with one concrete deployment-relevant threat model, one narrow product surface, and one manual discovery round. Use diverse testers if possible, but at minimum include someone who understands the domain harm and someone who can think adversarially about the system interface. Microsoft recommends diverse testers and explicit assignments to harms or features; Google recommends starting from seed examples and expanding them carefully; OpenAI recommends building evals early, logging everything, and calibrating automation with human labels. [5][6][7]

Then freeze the first tranche of high-confidence findings into a held-out adversarial eval set. Do not fine-tune on those exact cases. Build simple regression runs around them, even if the grader is pass-fail with human audit. Once that works, add a second lane for training candidates and keep the split clean. OpenAI’s evaluation guidance repeatedly emphasizes held-out data, non-overlap, and separate training sets for post-training loops. [13][7]

After that, automation becomes worth the effort. Use model-generated expansions, classifiers, or LLM judges to widen coverage and reduce review cost, but only in areas where you have established what a correct label means. AISI’s tiered approach is a useful template: let lighter automated tests find concerns quickly, then escalate to bespoke elicitation and expert review when the signal matters. [4]

Readiness check

Minimum viable operating loop

A practical start for small teams that need reusable evidence, not a safety guarantee.

Define one deployment-relevant threat model Name the actor, surface, target behavior, and downstream harm.
Run one manual discovery round Use testers who understand the domain harm and the system interface.
Log prompt, output, context, and version metadata Capture enough provenance to replay and audit the case.
Adjudicate severe or ambiguous findings Escalate beyond single-review labels when policy or domain judgment is uncertain.
Split holdouts before training Do not train on the exact cases you plan to use as progress evidence.
Automate only after labels are calibrated Use judges, classifiers, and synthetic expansions where human labels define the target.

None of this proves a system is safe. That is the wrong promise. A good red-team loop gives you cases you can replay, labels you can defend, holdouts you did not train on, and enough structure to tell whether a mitigation changed the model or only patched a demo prompt.

OpenTrain can help teams hire red-teamers, domain reviewers, and evaluation specialists for that work. Use the LLM judge reliability reference for evaluator calibration, the RLAIF vs RLHF guide for automation boundaries, the PRM vs ORM reference for measurement-target design, and the RLHF scoping guide for review-loop planning. When the bottleneck is staffing the review loop, post a job.