Skip to content
← Back to explorer

Dual-Pool Token-Budget Routing for Cost-Efficient and Reliable LLM Serving

Xunzhuo Liu, Bowei He, Xue Liu, Andy Luo, Haichen Zhang, Huamin Chen · Apr 9, 2026 · Citations: 0

Data freshness

Extraction: Fresh

Check recency before relying on this page for active eval decisions. Use stale pages as context and verify against current hub results.

Metadata refreshed

Apr 9, 2026, 10:47 AM

Fresh

Extraction refreshed

Apr 10, 2026, 4:47 AM

Fresh

Extraction source

Persisted extraction

Confidence 0.35

Abstract

Production vLLM fleets typically provision each instance for the worst-case context length, leading to substantial KV-cache over-allocation and under-utilized concurrency. In practice, 80-95% of requests are short, yet are served under configurations optimized for long contexts, wasting 4-8$\times$ throughput capacity and triggering reliability issues such as OOM crashes, preemption, and request rejections. We identify a common root cause for these inefficiencies: configuration-traffic mismatch. We propose dual-pool token-budget routing, a lightweight dispatch mechanism that partitions a homogeneous fleet into two specialized pools: a high-throughput short-context pool and a high-capacity long-context pool. Each request is routed based on its estimated total token budget, computed using a per-category bytes-to-token ratio that is learned online via exponential moving average from usage.prompt_tokens feedback, eliminating the need for a tokenizer. We also develop a simple analytical model that predicts fleet-level cost savings from workload characteristics and measured throughput differences, enabling practitioners to estimate benefits prior to deployment. Evaluations on real-world traces from the Azure LLM Inference Dataset and LMSYS-Chat-1M, serving Llama-3-70B on A100 GPUs, show that our approach reduces GPU-hours by 31-42%, corresponding to \$2.86M annual savings at fleet scale, while lowering preemption rates by 5.4$\times$ and improving P99 TTFT by 6%. A case study with Qwen3-235B-A22B on AMD MI300X at 10,000 req/s projects \$15.4M in annual savings. The method incurs only O(1) dispatch overhead, adapts automatically to heterogeneous workloads, and composes seamlessly with existing optimizations such as PagedAttention, continuous batching, and prefill-decode disaggregation.

Low-signal caution for protocol decisions

Use this page for context, then validate protocol choices against stronger HFEPX references before implementation decisions.

  • Extraction flags indicate low-signal or possible false-positive protocol mapping.
  • Extraction confidence is 0.35 (below strong-reference threshold).

HFEPX Relevance Assessment

This paper is adjacent to HFEPX scope and is best used for background context, not as a primary protocol reference.

Best use

Background context only

Use if you need

A secondary eval reference to pair with stronger protocol papers.

Main weakness

Extraction flags indicate low-signal or possible false-positive protocol mapping.

Trust level

Low

Eval-Fit Score

0/100 • Low

Treat as adjacent context, not a core eval-method reference.

Human Feedback Signal

Not explicit in abstract metadata

Evaluation Signal

Detected

HFEPX Fit

Adjacent candidate

Extraction confidence: Low

Field Provenance & Confidence

Each key protocol field shows extraction state, confidence band, and data source so you can decide whether to trust it directly or validate from full text.

Human Feedback Types

missing

None explicit

Confidence: Low Source: Persisted extraction missing

No explicit feedback protocol extracted.

Evidence snippet: Production vLLM fleets typically provision each instance for the worst-case context length, leading to substantial KV-cache over-allocation and under-utilized concurrency.

Evaluation Modes

partial

Automatic Metrics

Confidence: Low Source: Persisted extraction evidenced

Includes extracted eval setup.

Evidence snippet: Production vLLM fleets typically provision each instance for the worst-case context length, leading to substantial KV-cache over-allocation and under-utilized concurrency.

Quality Controls

missing

Not reported

Confidence: Low Source: Persisted extraction missing

No explicit QC controls found.

Evidence snippet: Production vLLM fleets typically provision each instance for the worst-case context length, leading to substantial KV-cache over-allocation and under-utilized concurrency.

Benchmarks / Datasets

missing

Not extracted

Confidence: Low Source: Persisted extraction missing

No benchmark anchors detected.

Evidence snippet: Production vLLM fleets typically provision each instance for the worst-case context length, leading to substantial KV-cache over-allocation and under-utilized concurrency.

Reported Metrics

partial

Throughput, Context length, Cost

Confidence: Low Source: Persisted extraction evidenced

Useful for evaluation criteria comparison.

Evidence snippet: Production vLLM fleets typically provision each instance for the worst-case context length, leading to substantial KV-cache over-allocation and under-utilized concurrency.

Rater Population

missing

Unknown

Confidence: Low Source: Persisted extraction missing

Rater source not explicitly reported.

Evidence snippet: Production vLLM fleets typically provision each instance for the worst-case context length, leading to substantial KV-cache over-allocation and under-utilized concurrency.

Human Data Lens

  • Uses human feedback: No
  • Feedback types: None
  • Rater population: Unknown
  • Unit of annotation: Unknown
  • Expertise required: General
  • Extraction source: Persisted extraction

Evaluation Lens

  • Evaluation modes: Automatic Metrics
  • Agentic eval: None
  • Quality controls: Not reported
  • Confidence: 0.35
  • Flags: low_signal, possible_false_positive

Protocol And Measurement Signals

Benchmarks / Datasets

No benchmark or dataset names were extracted from the available abstract.

Reported Metrics

throughputcontext lengthcost

Research Brief

Deterministic synthesis

In practice, 80-95% of requests are short, yet are served under configurations optimized for long contexts, wasting 4-8\times throughput capacity and triggering reliability issues such as OOM crashes, preemption, and request rejections. HFEPX signals include Automatic Metrics with confidence 0.35. Updated from current HFEPX corpus.

Generated Apr 10, 2026, 4:47 AM · Grounded in abstract + metadata only

Key Takeaways

  • In practice, 80-95% of requests are short, yet are served under configurations optimized for long contexts, wasting 4-8\times throughput capacity and triggering reliability issues…
  • We propose dual-pool token-budget routing, a lightweight dispatch mechanism that partitions a homogeneous fleet into two specialized pools: a high-throughput short-context pool…
  • Evaluations on real-world traces from the Azure LLM Inference Dataset and LMSYS-Chat-1M, serving Llama-3-70B on A100 GPUs, show that our approach reduces GPU-hours by 31-42%,…

Researcher Actions

  • Treat this as method context, then pivot to protocol-specific HFEPX hubs.
  • Identify benchmark choices from full text before operationalizing conclusions.
  • Validate metric comparability (throughput, context length, cost).

Caveats

  • Generated from title, abstract, and extracted metadata only; full-paper implementation details are not parsed.
  • Low-signal flag detected: protocol relevance may be indirect.

Research Summary

Contribution Summary

  • In practice, 80-95% of requests are short, yet are served under configurations optimized for long contexts, wasting 4-8\times throughput capacity and triggering reliability issues such as OOM crashes, preemption, and request rejections.
  • We propose dual-pool token-budget routing, a lightweight dispatch mechanism that partitions a homogeneous fleet into two specialized pools: a high-throughput short-context pool and a high-capacity long-context pool.
  • Evaluations on real-world traces from the Azure LLM Inference Dataset and LMSYS-Chat-1M, serving Llama-3-70B on A100 GPUs, show that our approach reduces GPU-hours by 31-42%, corresponding to \2.86M annual savings at fleet scale, while…

Why It Matters For Eval

  • Evaluations on real-world traces from the Azure LLM Inference Dataset and LMSYS-Chat-1M, serving Llama-3-70B on A100 GPUs, show that our approach reduces GPU-hours by 31-42%, corresponding to \2.86M annual savings at fleet scale, while…

Researcher Checklist

  • Gap: Human feedback protocol is explicit

    No explicit human feedback protocol detected.

  • Pass: Evaluation mode is explicit

    Detected: Automatic Metrics

  • Gap: Quality control reporting appears

    No calibration/adjudication/IAA control explicitly detected.

  • Gap: Benchmark or dataset anchors are present

    No benchmark/dataset anchor extracted from abstract.

  • Pass: Metric reporting is present

    Detected: throughput, context length, cost

Category-Adjacent Papers (Broader Context)

These papers are nearby in arXiv category and useful for broader context, but not necessarily protocol-matched to this paper.

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.