- TraceSafe: A Systematic Assessment of LLM Guardrails on Multi-Step Tool-Calling Trajectories
Yen-Shan Chen, Sian-Yao Huang, Cheng-Lin Yang, Yun-Nung Chen · Apr 8, 2026 · Citations: 0
Red Team Automatic Metrics Long Horizon
As large language models (LLMs) evolve from static chatbots into autonomous agents, the primary vulnerability surface shifts from final outputs to intermediate execution traces.
- SemEval-2026 Task 6: CLARITY -- Unmasking Political Question Evasions
Konstantinos Thomas, Giorgos Filandrianos, Maria Lymperaiou, Chrysoula Zerva, Giorgos Stamou · Mar 14, 2026 · Citations: 0
Red Team Automatic Metrics
The benchmark is constructed from U.S.
- Prompt Attack Detection with LLM-as-a-Judge and Mixture-of-Models
Hieu Xuan Le, Benjamin Goh, Quy Anh Tang · Mar 26, 2026 · Citations: 0
Red Team Llm As Judge
In production, guardrails must mitigate these attacks under strict low-latency constraints, resulting in a deployment gap in which lightweight classifiers and rule-based systems struggle to generalize under distribution shift, while…
- Exposing Long-Tail Safety Failures in Large Language Models through Efficient Diverse Response Sampling
Suvadeep Hajra, Palash Nandi, Tanmoy Chakraborty · Mar 15, 2026 · Citations: 0
Red Team Automatic Metrics
While most red-teaming work emphasizes adversarial prompt search (input-space optimization), we show that safety failures can also be systematically exposed through diverse response generation (output-space exploration) for a fixed…
- Trojan-Speak: Bypassing Constitutional Classifiers with No Jailbreak Tax via Adversarial Finetuning
Bilgehan Sel, Xuanli He, Alwin Peng, Ming Jin, Jerry Wei · Mar 30, 2026 · Citations: 0
Red Team
Fine-tuning APIs offered by major AI providers create new attack surfaces where adversaries can bypass safety measures through targeted fine-tuning.
- AI Security in the Foundation Model Era: A Comprehensive Survey from a Unified Perspective
Zhenyi Wang, Siyu Luan · Mar 25, 2026 · Citations: 0
Red Team
To address this critical gap, we propose a unified closed-loop threat taxonomy that explicitly frames model-data interactions along four directional axes.
- SecureBreak -- A dataset towards safe and secure models
Marco Arazzi, Vignesh Kumar Kembu, Antonino Nocera · Mar 23, 2026 · Citations: 0
Red Team
To provide a contribution in this scenario, this paper introduces SecureBreak, a safety-oriented dataset designed to support the development of AI-driven solutions for detecting harmful LLM outputs caused by residual weaknesses in security…
- Contrastive Reasoning Alignment: Reinforcement Learning from Hidden Representations
Haozheng Luo, Yimin Wang, Jiahao Yu, Binghui Wang, Yan Chen · Mar 18, 2026 · Citations: 0
Red Team
Empirically, we evaluate CRAFT on multiple safety benchmarks using two strong reasoning models, Qwen3-4B-Thinking and R1-Distill-Llama-8B, where it consistently outperforms state-of-the-art defenses such as IPO and SafeKey.
- SIA: A Synthesize-Inject-Align Framework for Knowledge-Grounded and Secure E-commerce Search LLMs with Industrial Deployment
Zhouwei Zhai, Mengxiang Chen, Anmeng Zhang · Mar 17, 2026 · Citations: 0
Red Team
Our approach first synthesizes high-quality natural language corpus by combining structured knowledge graphs with unstructured behavioral logs, augmented with reasoning chains and safety-aware data.
- Two Birds, One Projection: Harmonizing Safety and Utility in LVLMs via Inference-time Feature Projection
Yewon Han, Yumin Seol, EunGyung Kong, Minsoo Jo, Taesup Kim · Mar 16, 2026 · Citations: 0
Red Team
Existing jailbreak defence frameworks for Large Vision-Language Models often suffer from a safety utility tradeoff, where strengthening safety inadvertently degrades performance on general visual-grounded reasoning tasks.