- What Matters For Safety Alignment?
Xing Li, Hui-Ling Zhen, Lihao Yin, Xianzhi Yu, Zhenhua Dong · Jan 7, 2026 · Citations: 0
Red Team Automatic Metrics Tool Use
This paper presents a comprehensive empirical study on the safety alignment capabilities.
- A Simple and Efficient Jailbreak Method Exploiting LLMs' Helpfulness
Xuan Luo, Yue Wang, Zefeng He, Geng Tu, Jing Li · Sep 17, 2025 · Citations: 0
Red Team Automatic Metrics
This study reveals a critical safety blind spot in modern LLMs: learning-style queries, which closely resemble ordinary educational questions, can reliably elicit harmful responses.
- Obscure but Effective: Classical Chinese Jailbreak Prompt Optimization via Bio-Inspired Search
Xun Huang, Simeng Qin, Xiaoshuang Jia, Ranjie Duan, Huanqian Yan · Feb 26, 2026 · Citations: 0
Red Team Automatic Metrics
Owing to its conciseness and obscurity, classical Chinese can partially bypass existing safety constraints, exposing notable vulnerabilities in LLMs.
- MANATEE: Inference-Time Lightweight Diffusion Based Safety Defense for LLMs
Chun Yan Ryan Kan, Tommy Tran, Vedant Yadav, Ava Cai, Kevin Zhu · Feb 21, 2026 · Citations: 0
Red Team Automatic Metrics
We propose MANATEE, an inference-time defense that uses density estimation over a benign representation manifold.
- "Are You Sure?": An Empirical Study of Human Perception Vulnerability in LLM-Driven Agentic Systems
Xinfeng Li, Shenyu Dai, Kelong Zheng, Yue Xiao, Gelei Deng · Feb 24, 2026 · Citations: 0
Expert Verification Automatic Metrics
Large language model (LLM) agents are rapidly becoming trusted copilots in high-stakes domains like software development and healthcare.
- RLShield: Practical Multi-Agent RL for Financial Cyber Defense with Attack-Surface MDPs and Real-Time Response Orchestration
Srikumar Nayak · Feb 26, 2026 · Citations: 0
Automatic Metrics Multi Agent
This paper proposes RLShield, a practical multi-agent RL pipeline for financial cyber defense.
- ICON: Indirect Prompt Injection Defense for Agents based on Inference-Time Correction
Che Wang, Fuyao Zhang, Jiaming Zhang, Ziqi Zhang, Yinghui Wang · Feb 24, 2026 · Citations: 0
Automatic Metrics Long Horizon
Large Language Model (LLM) agents are susceptible to Indirect Prompt Injection (IPI) attacks, where malicious instructions in retrieved content hijack the agent's execution.
- A Systematic Review of Algorithmic Red Teaming Methodologies for Assurance and Security of AI Applications
Shruti Srivastava, Kiranmayee Janardhan, Shaurya Jauhari · Feb 24, 2026 · Citations: 0
Red Team Automatic Metrics
These limitations have driven the evolution toward auto-mated red teaming, which leverages artificial intelligence and automation to deliver efficient and adaptive security evaluations.
- Assessing Deanonymization Risks with Stylometry-Assisted LLM Agent
Boyang Zhang, Yang Zhang · Feb 26, 2026 · Citations: 0
Automatic Metrics
In this work, we introduce an LLM agent designed to evaluate and mitigate such risks through a structured, interpretable pipeline.
- Adversarial Intent is a Latent Variable: Stateful Trust Inference for Securing Multimodal Agentic RAG
Inderjeet Singh, Vikas Pahuja, Aishvariya Priya Rathina Sabapathy, Chiara Picardi, Amit Giloni · Feb 24, 2026 · Citations: 0
Automatic Metrics
Current stateless defences for multimodal agentic RAG fail to detect adversarial strategies that distribute malicious semantics across retrieval, planning, and generation components.
- AdapTools: Adaptive Tool-based Indirect Prompt Injection Attacks on Agentic LLMs
Che Wang, Jiaming Zhang, Ziqi Zhang, Zijie Wang, Yinghui Wang · Feb 24, 2026 · Citations: 0
Automatic Metrics
The integration of external data services (e.g., Model Context Protocol, MCP) has made large language model-based agents increasingly powerful for complex task execution.
- Weight space Detection of Backdoors in LoRA Adapters
David Puertolas Merenciano, Ekaterina Vasyagina, Raghav Dixit, Kevin Zhu, Ruizhe Li · Feb 16, 2026 · Citations: 0
Automatic Metrics
We evaluate the method on 500 LoRA adapters -- 400 clean, and 100 poisoned for Llama-3.2-3B on instruction and reasoning datasets: Alpaca, Dolly, GSM8K, ARC-Challenge, SQuADv2, NaturalQuestions, HumanEval, and GLUE dataset.
- Randomized Masked Finetuning: An Efficient Way to Mitigate Memorization of PIIs in LLMs
Kunj Joshi, David A. Smith · Dec 2, 2025 · Citations: 0
Automatic Metrics
We present MaxTER, a Pareto-optimal evaluation framework for assessing privacy-utility tradeoffs, and show the performance of RMFT vs Deduplication by Area Under The Response Curve (AURC) metric.
- DropVLA: An Action-Level Backdoor Attack on Vision--Language--Action Models
Zonghuan Xu, Xiang Zheng, Xingjun Ma, Yu-Gang Jiang · Oct 13, 2025 · Citations: 0
Automatic Metrics
The backdoor remains robust to moderate trigger variations and transfers across evaluation suites (96.27%, 99.09%), whereas text-only largely fails (0.72%).
- Multi-hop Deep Joint Source-Channel Coding with Deep Hash Distillation for Semantically Aligned Image Recovery
Didrik Bergström, Deniz Gündüz, Onur Günlü · Oct 8, 2025 · Citations: 0
Automatic Metrics
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- A Lightweight IDS for Early APT Detection Using a Novel Feature Selection Method
Bassam Noori Shaker, Bahaa Al-Musawi, Mohammed Falih Hassan · Jun 13, 2025 · Citations: 0
Automatic Metrics
The results of our proposed method showed the ability to reduce the selected features of the SCVIC-APT-2021 dataset from 77 to just four while maintaining consistent evaluation metrics for the suggested system.
- PII-Bench: Evaluating Query-Aware Privacy Protection Systems
Hao Shen, Zhouhong Gu, Haokai Hong, Weili Han · Feb 25, 2025 · Citations: 0
Automatic Metrics
To address this challenge, we propose a query-unrelated PII masking strategy and introduce PII-Bench, the first comprehensive evaluation framework for assessing privacy protection systems.
- Topic-Based Watermarks for Large Language Models
Alexander Nemecek, Yuzhou Jiang, Erman Ayday · Apr 2, 2024 · Citations: 0
Automatic Metrics
The indistinguishability of large language model (LLM) output from human-authored content poses significant challenges, raising concerns about potential misuse of AI-generated text and its influence on future model training.