Skip to content
← Back to explorer

TraceSafe: A Systematic Assessment of LLM Guardrails on Multi-Step Tool-Calling Trajectories

Yen-Shan Chen, Sian-Yao Huang, Cheng-Lin Yang, Yun-Nung Chen · Apr 8, 2026 · Citations: 0

Data freshness

Extraction: Fresh

Check recency before relying on this page for active eval decisions. Use stale pages as context and verify against current hub results.

Metadata refreshed

Apr 8, 2026, 3:46 PM

Fresh

Extraction refreshed

Apr 10, 2026, 7:13 AM

Fresh

Extraction source

Persisted extraction

Confidence 0.80

Abstract

As large language models (LLMs) evolve from static chatbots into autonomous agents, the primary vulnerability surface shifts from final outputs to intermediate execution traces. While safety guardrails are well-benchmarked for natural language responses, their efficacy remains largely unexplored within multi-step tool-use trajectories. To address this gap, we introduce TraceSafe-Bench, the first comprehensive benchmark specifically designed to assess mid-trajectory safety. It encompasses 12 risk categories, ranging from security threats (e.g., prompt injection, privacy leaks) to operational failures (e.g., hallucinations, interface inconsistencies), featuring over 1,000 unique execution instances. Our evaluation of 13 LLM-as-a-guard models and 7 specialized guardrails yields three critical findings: 1) Structural Bottleneck: Guardrail efficacy is driven more by structural data competence (e.g., JSON parsing) than semantic safety alignment. Performance correlates strongly with structured-to-text benchmarks ($ρ=0.79$) but shows near-zero correlation with standard jailbreak robustness. 2) Architecture over Scale: Model architecture influences risk detection performance more significantly than model size, with general-purpose LLMs consistently outperforming specialized safety guardrails in trajectory analysis. 3) Temporal Stability: Accuracy remains resilient across extended trajectories. Increased execution steps allow models to pivot from static tool definitions to dynamic execution behaviors, actually improving risk detection performance in later stages. Our findings suggest that securing agentic workflows requires jointly optimizing for structural reasoning and safety alignment to effectively mitigate mid-trajectory risks.

HFEPX Relevance Assessment

This paper has useful evaluation signal, but protocol completeness is partial; pair it with related papers before deciding implementation strategy.

Best use

Secondary protocol comparison source

Use if you need

A benchmark-and-metrics comparison anchor.

Main weakness

No major weakness surfaced.

Trust level

High

Eval-Fit Score

65/100 • Medium

Useful as a secondary reference; validate protocol details against neighboring papers.

Human Feedback Signal

Detected

Evaluation Signal

Detected

HFEPX Fit

Moderate-confidence candidate

Extraction confidence: High

Field Provenance & Confidence

Each key protocol field shows extraction state, confidence band, and data source so you can decide whether to trust it directly or validate from full text.

Human Feedback Types

strong

Red Team

Confidence: High Source: Persisted extraction evidenced

Directly usable for protocol triage.

Evidence snippet: As large language models (LLMs) evolve from static chatbots into autonomous agents, the primary vulnerability surface shifts from final outputs to intermediate execution traces.

Evaluation Modes

strong

Automatic Metrics

Confidence: High Source: Persisted extraction evidenced

Includes extracted eval setup.

Evidence snippet: As large language models (LLMs) evolve from static chatbots into autonomous agents, the primary vulnerability surface shifts from final outputs to intermediate execution traces.

Quality Controls

missing

Not reported

Confidence: Low Source: Persisted extraction missing

No explicit QC controls found.

Evidence snippet: As large language models (LLMs) evolve from static chatbots into autonomous agents, the primary vulnerability surface shifts from final outputs to intermediate execution traces.

Benchmarks / Datasets

strong

Tracesafe Bench

Confidence: High Source: Persisted extraction evidenced

Useful for quick benchmark comparison.

Evidence snippet: To address this gap, we introduce TraceSafe-Bench, the first comprehensive benchmark specifically designed to assess mid-trajectory safety.

Reported Metrics

strong

Accuracy

Confidence: High Source: Persisted extraction evidenced

Useful for evaluation criteria comparison.

Evidence snippet: 3) Temporal Stability: Accuracy remains resilient across extended trajectories.

Rater Population

missing

Unknown

Confidence: Low Source: Persisted extraction missing

Rater source not explicitly reported.

Evidence snippet: As large language models (LLMs) evolve from static chatbots into autonomous agents, the primary vulnerability surface shifts from final outputs to intermediate execution traces.

Human Data Lens

  • Uses human feedback: Yes
  • Feedback types: Red Team
  • Rater population: Unknown
  • Unit of annotation: Trajectory
  • Expertise required: General
  • Extraction source: Persisted extraction

Evaluation Lens

  • Evaluation modes: Automatic Metrics
  • Agentic eval: Long Horizon
  • Quality controls: Not reported
  • Confidence: 0.80
  • Flags: None

Protocol And Measurement Signals

Benchmarks / Datasets

Tracesafe-Bench

Reported Metrics

accuracy

Research Brief

Deterministic synthesis

As large language models (LLMs) evolve from static chatbots into autonomous agents, the primary vulnerability surface shifts from final outputs to intermediate execution traces. HFEPX signals include Red Team, Automatic Metrics, Long Horizon with confidence 0.80. Updated from current HFEPX corpus.

Generated Apr 10, 2026, 7:13 AM · Grounded in abstract + metadata only

Key Takeaways

  • As large language models (LLMs) evolve from static chatbots into autonomous agents, the primary vulnerability surface shifts from final outputs to intermediate execution traces.
  • While safety guardrails are well-benchmarked for natural language responses, their efficacy remains largely unexplored within multi-step tool-use trajectories.

Researcher Actions

  • Compare its human-feedback setup against pairwise and rubric hubs.
  • Cross-check benchmark overlap: Tracesafe-Bench.
  • Validate metric comparability (accuracy).

Caveats

  • Generated from title, abstract, and extracted metadata only; full-paper implementation details are not parsed.
  • Extraction confidence is probabilistic and should be validated for critical decisions.

Research Summary

Contribution Summary

  • As large language models (LLMs) evolve from static chatbots into autonomous agents, the primary vulnerability surface shifts from final outputs to intermediate execution traces.
  • While safety guardrails are well-benchmarked for natural language responses, their efficacy remains largely unexplored within multi-step tool-use trajectories.
  • To address this gap, we introduce TraceSafe-Bench, the first comprehensive benchmark specifically designed to assess mid-trajectory safety.

Why It Matters For Eval

  • As large language models (LLMs) evolve from static chatbots into autonomous agents, the primary vulnerability surface shifts from final outputs to intermediate execution traces.
  • To address this gap, we introduce TraceSafe-Bench, the first comprehensive benchmark specifically designed to assess mid-trajectory safety.

Researcher Checklist

  • Pass: Human feedback protocol is explicit

    Detected: Red Team

  • Pass: Evaluation mode is explicit

    Detected: Automatic Metrics

  • Gap: Quality control reporting appears

    No calibration/adjudication/IAA control explicitly detected.

  • Pass: Benchmark or dataset anchors are present

    Detected: Tracesafe-Bench

  • Pass: Metric reporting is present

    Detected: accuracy

Related Papers

Papers are ranked by protocol overlap, extraction signal alignment, and semantic proximity.

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.