Skip to content
← Back to explorer

Tag: Red Team

Human red teaming is used to probe failures or safety weaknesses.

Papers in tag: 21

Research Utility Snapshot

Evaluation Modes

  • Automatic Metrics (18)
  • Simulation Env (2)

Human Feedback Types

  • Red Team (20)
  • Pairwise Preference (1)
  • Rubric Rating (1)

Required Expertise

  • General (14)
  • Multilingual (3)
  • Law (2)
Alignment-Weighted DPO: A principled reasoning approach to improve safety alignment

Mengxuan Hu, Vivek V. Datla, Anoop Kumar, Zihan Guan, Sheng Li, Alfy Samuel · Feb 24, 2026 · Citations: 0

Pairwise PreferenceRed Team Automatic Metrics General
  • Recent advances in alignment techniques such as Supervised Fine-Tuning (SFT), Reinforcement Learning from Human Feedback (RLHF), and Direct Preference Optimization (DPO) have improved the safety of large language models (LLMs).
  • We construct and release a novel Chain-of-Thought (CoT) fine-tuning dataset that includes both utility-oriented and safety-critical prompts with step-by-step rationales.
Assessing Risks of Large Language Models in Mental Health Support: A Framework for Automated Clinical AI Red Teaming

Ian Steenstra, Paola Pedrelli, Weiyan Shi, Stacy Marsella, Timothy W. Bickmore · Feb 23, 2026 · Citations: 0

Red Team Simulation Env Medicine
  • Large Language Models (LLMs) are increasingly utilized for mental health support; however, current safety benchmarks often fail to detect the complex, longitudinal risks inherent in therapeutic dialogue.
  • We introduce an evaluation framework that pairs AI psychotherapists with simulated patient agents equipped with dynamic cognitive-affective models and assesses therapy session simulations against a comprehensive quality of care and risk ont
FENCE: A Financial and Multimodal Jailbreak Detection Dataset

Mirae Kim, Seonghun Jeong, Youngjun Kwak · Feb 20, 2026 · Citations: 0

Red Team Automatic Metrics General
  • A baseline detector trained on FENCE achieves 99 percent in-distribution accuracy and maintains strong performance on external benchmarks, underscoring the dataset's robustness for training reliable detection models.
IndicJR: A Judge-Free Benchmark of Jailbreak Robustness in South Asian Languages

Priyaranjan Pattnayak, Sanchari Chowdhuri · Feb 18, 2026 · Citations: 0

Red Team Automatic Metrics CodingMultilingual
  • Safety alignment of large language models (LLMs) is mostly evaluated in English and contract-bound, leaving multilingual vulnerabilities understudied.
  • We introduce \textbf{Indic Jailbreak Robustness (IJR)}, a judge-free benchmark for adversarial safety across 12 Indic and South Asian languages (2.1 Billion speakers), covering 45216 prompts in JSON (contract-bound) and Free (naturalistic)
Intent Laundering: AI Safety Datasets Are Not What They Seem

Shahriar Golchin, Marc Wetter · Feb 17, 2026 · Citations: 0

Red Team Automatic Metrics General
  • We systematically evaluate the quality of widely used AI safety datasets from two perspectives: in isolation and in practice.
  • We find that these datasets overrely on "triggering cues": words or phrases with overt negative/sensitive connotations that are intended to trigger safety mechanisms explicitly, which is unrealistic compared to real-world attacks.
What Matters For Safety Alignment?

Xing Li, Hui-Ling Zhen, Lihao Yin, Xianzhi Yu, Zhenhua Dong, Mingxuan Yuan · Jan 7, 2026 · Citations: 0

Red Team Automatic Metrics General
  • This paper presents a comprehensive empirical study on the safety alignment capabilities.
  • We evaluate what matters for safety alignment in LLMs and LRMs to provide essential insights for developing more secure and reliable AI systems.
Refusal Steering: Fine-grained Control over LLM Refusal Behaviour for Sensitive Topics

Iker García-Ferrero, David Montero, Roman Orus · Dec 18, 2025 · Citations: 0

Red Team Automatic Metrics General
  • We replace fragile pattern-based refusal detection with an LLM-as-a-judge that assigns refusal confidence scores and we propose a ridge-regularized variant to compute steering vectors that better isolate the refusal--compliance direction.
  • On Qwen3-Next-80B-A3B-Thinking, our method removes the refusal behaviour of the model around politically sensitive topics while maintaining safety on JailbreakBench and near-baseline performance on general benchmarks.
Reasoning Up the Instruction Ladder for Controllable Language Models

Zishuo Zheng, Vidhisha Balachandran, Chan Young Park, Faeze Brahman, Sachin Kumar · Oct 30, 2025 · Citations: 0

Red Team Automatic Metrics General
  • Our finetuned models achieve consistent improvements on instruction following and instruction hierarchy benchmarks, achieving roughly a 20% improvement on the IHEval conflict setup.
  • This reasoning ability also generalizes to safety-critical settings beyond the training distribution.
A Simple and Efficient Jailbreak Method Exploiting LLMs' Helpfulness

Xuan Luo, Yue Wang, Zefeng He, Geng Tu, Jing Li, Ruifeng Xu · Sep 17, 2025 · Citations: 0

Red Team Automatic Metrics Law
  • This study reveals a critical safety blind spot in modern LLMs: learning-style queries, which closely resemble ordinary educational questions, can reliably elicit harmful responses.
  • In addition, the assessment of defenses on the constructed safe prompts reveals inherent limitations of LLMs' safety mechanisms and flaws in the defense methods.
When Style Breaks Safety: Defending LLMs Against Superficial Style Alignment

Yuxin Xiao, Sana Tonekaboni, Walter Gerych, Vinith Suriyakumar, Marzyeh Ghassemi · Jun 9, 2025 · Citations: 0

Red Team Automatic Metrics General
  • In this work, we seek to understand whether style patterns compromise LLM safety, how superficial style alignment increases model vulnerability, and how best to mitigate these risks during alignment.
  • We first define ASR inflation as the increase in ASR due to style patterns in existing jailbreak benchmark queries.
RedTeamCUA: Realistic Adversarial Testing of Computer-Use Agents in Hybrid Web-OS Environments

Zeyi Liao, Jaylen Jones, Linxi Jiang, Yuting Ning, Eric Fosler-Lussier, Yu Su · May 28, 2025 · Citations: 0

Red Team Simulation Env General
  • Computer-use agents (CUAs) promise to automate complex tasks across operating systems (OS) and the web, but remain vulnerable to indirect prompt injection.
  • Current evaluations of this threat either lack support realistic but controlled environments or ignore hybrid web-OS attack scenarios involving both interfaces.
Refusal Direction is Universal Across Safety-Aligned Languages

Xinpeng Wang, Mingyang Wang, Yihong Liu, Hinrich Schütze, Barbara Plank · May 22, 2025 · Citations: 0

Red Team Automatic Metrics Multilingual
  • Refusal mechanisms in large language models (LLMs) are essential for ensuring safety.
  • In this paper, we investigate the refusal behavior in LLMs across 14 languages using PolyRefuse, a multilingual safety dataset created by translating malicious and benign English prompts into these languages.
Beyond Single-Turn: A Survey on Multi-Turn Interactions with Large Language Models

Yubo Li, Xiaobin Shen, Xinyu Yao, Xueying Ding, Yidi Miao, Ramayya Krishnan · Apr 7, 2025 · Citations: 0

Red Team Automatic Metrics Math
  • We organize existing benchmarks and datasets into coherent categories reflecting the evolving landscape of multi-turn dialogue evaluation, and review a broad spectrum of enhancement methodologies, including model-centric strategies (in-cont