OpenTrain AI
No verified implementation yetPretrained Models Available

ReDAct: Uncertainty-Aware Deferral for LLM Agents

Dzianis Piatrashyn, Nikita Kotelevskii, Kirill Grishchenkov, Nikita Glazkov, Ivan Nasonov +5 more

April 8, 2026arXiv: 2604.07036
0 repos~a few days to reproduce
arXiv PDF

Abstract

Recently, LLM-based agents have become increasingly popular across many applications, including complex sequential decision-making problems. However, they inherit the tendency of LLMs to hallucinate, leading to incorrect decisions. In sequential settings, even a single mistake can irreversibly degrade the trajectory, making hallucinations an even bigger problem. Although larger LLMs hallucinate less, they incur a sig...

Summary

Known: ReDAct equips an agent with two LLMs, using a small, cheap model by default and deferring decisions to a larger, more reliable but expensive model when the small model’s predictive uncertainty exceeds a calibrated threshold. Current extracted evaluation signal points to agentic tool use. Missing: benchmark-specific evidence is still limited in current parsed sources. Next step: validate setup steps, reported metrics, and task framing directly from the paper before treating this as a baseline.

Key Contributions

  • ReDAct equips an agent with two LLMs, using a small, cheap model by default and deferring decisions to a larger, more reliable but expensive model when the small model’s predictive uncertainty exceeds a calibrated.
  • The method estimates predictive uncertainty for deferral using token-level metrics such as mean token entropy, perplexity, and sequence probability computed over both action selection and reasoning traces.
  • For uncertainty estimation over reasoning traces, the mean token entropy metric achieves a ROC-AUC of about 0.596 when distinguishing reliable from unreliable decisions in the ReDAct framework.
  • When applied to action selection, mean token entropy yields a ROC-AUC of 0.710, outperforming alternative uncertainty metrics such as perplexity and sequence probability in the reported comparison.

Reproducibility Notes

  • Estimate is based on paper-only reproduction flow.

Results & Benchmarks

Benchmark data is not yet available for this paper.

Hardware Requirements

  • Expect multi-day setup/compute for meaningful reproduction based on current guidance.

Best Implementation

Maintained implementation evidence is not confirmed for this paper yet.

Use the Implementation Status and Reproduction Path sections below for the current action plan.

Reproduction Path

Follow this baseline workflow to decide if this paper is worth immediate prototyping.

  1. 1

    Use the paper and benchmark evidence to scope a baseline reproduction plan.

  2. 2

    Track assumptions and missing details in an experiment log before coding.

Time to first repro: a few daysEstimate is based on paper-only reproduction flow

Additional Implementations

No additional verified repositories beyond the primary recommendation.

Hugging Face Artifacts

No direct paper-linked artifacts were found. Showing strongest curated related artifacts.