TraceSafe: A Systematic Assessment of LLM Guardrails on Multi-Step Tool-Calling Trajectories

Yen-Shan Chen, Sian-Yao Huang, Cheng-Lin Yang, Yun-Nung Chen · Apr 8, 2026 · Citations: 0

Automatic Metrics General Long Horizon Red Team

Open arXiv Find Implementation RSS feed Shortlist (0)

Data freshness

Extraction: Fresh

Check recency before relying on this page for active eval decisions. Use stale pages as context and verify against current hub results.

Metadata refreshed

Apr 8, 2026, 3:46 PM

Fresh

Extraction refreshed

Apr 10, 2026, 7:13 AM

Fresh

Extraction source

Persisted extraction

Confidence 0.80

Abstract

As large language models (LLMs) evolve from static chatbots into autonomous agents, the primary vulnerability surface shifts from final outputs to intermediate execution traces. While safety guardrails are well-benchmarked for natural language responses, their efficacy remains largely unexplored within multi-step tool-use trajectories. To address this gap, we introduce TraceSafe-Bench, the first comprehensive benchmark specifically designed to assess mid-trajectory safety. It encompasses 12 risk categories, ranging from security threats (e.g., prompt injection, privacy leaks) to operational failures (e.g., hallucinations, interface inconsistencies), featuring over 1,000 unique execution instances. Our evaluation of 13 LLM-as-a-guard models and 7 specialized guardrails yields three critical findings: 1) Structural Bottleneck: Guardrail efficacy is driven more by structural data competence (e.g., JSON parsing) than semantic safety alignment. Performance correlates strongly with structured-to-text benchmarks ($ρ=0.79$) but shows near-zero correlation with standard jailbreak robustness. 2) Architecture over Scale: Model architecture influences risk detection performance more significantly than model size, with general-purpose LLMs consistently outperforming specialized safety guardrails in trajectory analysis. 3) Temporal Stability: Accuracy remains resilient across extended trajectories. Increased execution steps allow models to pivot from static tool definitions to dynamic execution behaviors, actually improving risk detection performance in later stages. Our findings suggest that securing agentic workflows requires jointly optimizing for structural reasoning and safety alignment to effectively mitigate mid-trajectory risks.

HFEPX Relevance Assessment

This paper has useful evaluation signal, but protocol completeness is partial; pair it with related papers before deciding implementation strategy.

Best use

Secondary protocol comparison source

Use if you need

A benchmark-and-metrics comparison anchor.

Main weakness

No major weakness surfaced.

Trust level

High

Eval-Fit Score

65/100 • Medium

Useful as a secondary reference; validate protocol details against neighboring papers.

Human Feedback Signal

Detected

Evaluation Signal

Detected

HFEPX Fit

Moderate-confidence candidate

Extraction confidence: High

If you are doing eval pipeline work, start here:

Human Eval Hub LLM-as-Judge Hub Pairwise Preference Hub Tool-Use Eval Hub

Field Provenance & Confidence

Each key protocol field shows extraction state, confidence band, and data source so you can decide whether to trust it directly or validate from full text.

Human Feedback Types

strong

Red Team

Confidence: High Source: Persisted extraction evidenced

Directly usable for protocol triage.

Evidence snippet: As large language models (LLMs) evolve from static chatbots into autonomous agents, the primary vulnerability surface shifts from final outputs to intermediate execution traces.

Evaluation Modes

strong

Automatic Metrics

Confidence: High Source: Persisted extraction evidenced

Includes extracted eval setup.

Evidence snippet: As large language models (LLMs) evolve from static chatbots into autonomous agents, the primary vulnerability surface shifts from final outputs to intermediate execution traces.

Quality Controls

missing

Not reported

Confidence: Low Source: Persisted extraction missing

No explicit QC controls found.

Evidence snippet: As large language models (LLMs) evolve from static chatbots into autonomous agents, the primary vulnerability surface shifts from final outputs to intermediate execution traces.

Benchmarks / Datasets

strong

Tracesafe Bench

Confidence: High Source: Persisted extraction evidenced

Useful for quick benchmark comparison.

Evidence snippet: To address this gap, we introduce TraceSafe-Bench, the first comprehensive benchmark specifically designed to assess mid-trajectory safety.

Reported Metrics

strong

Accuracy

Confidence: High Source: Persisted extraction evidenced

Useful for evaluation criteria comparison.

Evidence snippet: 3) Temporal Stability: Accuracy remains resilient across extended trajectories.

Rater Population

missing

Unknown

Confidence: Low Source: Persisted extraction missing

Rater source not explicitly reported.

Evidence snippet: As large language models (LLMs) evolve from static chatbots into autonomous agents, the primary vulnerability surface shifts from final outputs to intermediate execution traces.

Human Data Lens

Uses human feedback: Yes
Feedback types: Red Team
Rater population: Unknown
Unit of annotation: Trajectory
Expertise required: General
Extraction source: Persisted extraction

Evaluation Lens

Evaluation modes: Automatic Metrics
Agentic eval: Long Horizon
Quality controls: Not reported
Confidence: 0.80
Flags: None

Protocol And Measurement Signals

Benchmarks / Datasets

Tracesafe-Bench

Reported Metrics

accuracy

Research Brief

Deterministic synthesis

Generated Apr 10, 2026, 7:13 AM · Grounded in abstract + metadata only

Key Takeaways

As large language models (LLMs) evolve from static chatbots into autonomous agents, the primary vulnerability surface shifts from final outputs to intermediate execution traces.
While safety guardrails are well-benchmarked for natural language responses, their efficacy remains largely unexplored within multi-step tool-use trajectories.

Researcher Actions

Compare its human-feedback setup against pairwise and rubric hubs.
Cross-check benchmark overlap: Tracesafe-Bench.
Validate metric comparability (accuracy).

Caveats

Generated from title, abstract, and extracted metadata only; full-paper implementation details are not parsed.
Extraction confidence is probabilistic and should be validated for critical decisions.

Recommended Queries

human-eval protocol design agent eval benchmark comparison inter-rater agreement adjudication

Research Summary

Contribution Summary

As large language models (LLMs) evolve from static chatbots into autonomous agents, the primary vulnerability surface shifts from final outputs to intermediate execution traces.
While safety guardrails are well-benchmarked for natural language responses, their efficacy remains largely unexplored within multi-step tool-use trajectories.
To address this gap, we introduce TraceSafe-Bench, the first comprehensive benchmark specifically designed to assess mid-trajectory safety.

Why It Matters For Eval

As large language models (LLMs) evolve from static chatbots into autonomous agents, the primary vulnerability surface shifts from final outputs to intermediate execution traces.
To address this gap, we introduce TraceSafe-Bench, the first comprehensive benchmark specifically designed to assess mid-trajectory safety.

Researcher Checklist

Pass: Human feedback protocol is explicit

Detected: Red Team
Pass: Evaluation mode is explicit

Detected: Automatic Metrics
Gap: Quality control reporting appears

No calibration/adjudication/IAA control explicitly detected.
Pass: Benchmark or dataset anchors are present

Detected: Tracesafe-Bench
Pass: Metric reporting is present

Detected: accuracy

TraceSafe: A Systematic Assessment of LLM Guardrails on Multi-Step Tool-Calling Trajectories

Data freshness

Abstract

HFEPX Relevance Assessment

Field Provenance & Confidence

Human Feedback Types

Evaluation Modes

Quality Controls

Benchmarks / Datasets

Reported Metrics

Rater Population

Human Data Lens

Evaluation Lens

Protocol And Measurement Signals

Benchmarks / Datasets

Reported Metrics

Research Brief

Key Takeaways

Researcher Actions

Caveats

Recommended Queries

Research Summary

Contribution Summary

Why It Matters For Eval

Researcher Checklist

Related Papers