Safety Evaluations
This project focused on proactively identifying and mitigating safety risks in a large language model through adversarial testing. The scope was to challenge the model's safety filters across a wide spectrum of potential harms to ensure responsible AI behavior. My key responsibilities centered on Adversarial Prompting (Red Teaming) to test and "jailbreak" the model's safety protocols. This process involved detailed Harm Classification, where I evaluated and labeled outputs against a specific safety policy. Quality was measured by strict Policy Adherence, consistency in applying the safety taxonomy, and the clarity of my written justifications for each evaluation.