Skip to content
← Back to explorer

Helpful to a Fault: Measuring Illicit Assistance in Multi-Turn, Multilingual LLM Agents

Nivya Talokar, Ayush K Tarun, Murari Mandal, Maksym Andriushchenko, Antoine Bosselut · Feb 18, 2026 · Citations: 0

Abstract

LLM-based agents execute real-world workflows via tools and memory. These affordances enable ill-intended adversaries to also use these agents to carry out complex misuse scenarios. Existing agent misuse benchmarks largely test single-prompt instructions, leaving a gap in measuring how agents end up helping with harmful or illegal tasks over multiple turns. We introduce STING (Sequential Testing of Illicit N-step Goal execution), an automated red-teaming framework that constructs a step-by-step illicit plan grounded in a benign persona and iteratively probes a target agent with adaptive follow-ups, using judge agents to track phase completion. We further introduce an analysis framework that models multi-turn red-teaming as a time-to-first-jailbreak random variable, enabling analysis tools like discovery curves, hazard-ratio attribution by attack language, and a new metric: Restricted Mean Jailbreak Discovery. Across AgentHarm scenarios, STING yields substantially higher illicit-task completion than single-turn prompting and chat-oriented multi-turn baselines adapted to tool-using agents. In multilingual evaluations across six non-English settings, we find that attack success and illicit-task completion do not consistently increase in lower-resource languages, diverging from common chatbot findings. Overall, STING provides a practical way to evaluate and stress-test agent misuse in realistic deployment settings, where interactions are inherently multi-turn and often multilingual.

Human Data Lens

  • Uses human feedback: Yes
  • Feedback types: Red Team
  • Rater population: Unknown
  • Unit of annotation: Unknown
  • Expertise required: Law, Multilingual

Evaluation Lens

  • Evaluation modes: Automatic Metrics
  • Agentic eval: None
  • Quality controls: Not reported
  • Confidence: 0.65
  • Flags: None

Research Summary

Contribution Summary

  • LLM-based agents execute real-world workflows via tools and memory.
  • These affordances enable ill-intended adversaries to also use these agents to carry out complex misuse scenarios.
  • Existing agent misuse benchmarks largely test single-prompt instructions, leaving a gap in measuring how agents end up helping with harmful or illegal tasks over multiple turns.

Why It Matters For Eval

  • LLM-based agents execute real-world workflows via tools and memory.
  • These affordances enable ill-intended adversaries to also use these agents to carry out complex misuse scenarios.

Related Papers