Stochastic KV Routing: Enabling Adaptive Depth-Wise Cache Sharing
Anastasiia Filippova, David Grangier, Marco Cuturi, João Monteiro
Serving transformer language models with high throughput requires caching Key-Values (KVs) to avoid redundant computation during autoregressive generation. The memory footprint of KV caching is significant and heavily impacts serving costs. This work proposes to lessen these memory requirements. While recent work has largely addressed KV cache reduction via compression and eviction along the temporal axis, we argue t ...
hat the \emph{depth} dimension offers an orthogonal and robust avenue for optimization. Although prior research suggests that a full cache for every layer is redundant, implementing cross-layer cache sharing remains a practical challenge; existing methods typically suffer from reduced throughput or increased time-to-first-token. In this paper, we demonstrate that dropping a layer's cache offers efficient optimization without information loss. We propose a simple training approach: random cross-layer attention. During training, layers randomly choose to attend either to their own KV states or those of a preceding layer. This stochastic process adapts the model to be robust to various depth-wise cache sharing strategies, ensuring flexibility for unknown hardware constraints at deployment time. Our evaluations show that applying this scheme during pre-training or fine-tuning enables depth-wise cache sharing for various model families. Furthermore, for larger models in data-constrained settings, this approach is suggestive of a regularization-like effect, frequently preserving or improving performance while significantly reducing the cache's memory footprint.
Results & Benchmarks
Benchmark evidence drill-down
Audit each benchmark finding before selecting an implementation path. Evidence refs map to the disclosure section below.
| Task | Dataset | Metric | Value | Source | Evidence refs |
|---|---|---|---|---|---|
| Transformer | HotpotQA | F1 | 0.203 | paper-derived | No explicit refs |
Serving transformer language models with high throughput requires caching Key-Values (KVs) to avoid redundant computation during autoregressive generation.
Implementation Evidence Summary
cmhungsteve/Awesome-Transformer-Attention is the closest maintained adjacent implementation (Matches contextual method/domain keyword: transformer). It is not paper-verified; validate algorithm and evaluation setup against the paper before trusting reported metrics. Community adoption signal: 5035 GitHub stars.
Reproduction Risks
- Adjacent implementations are not paper-verified
- Recommended repository is adjacent and not paper-verified.
Hardware Notes
Expect multi-day setup/compute for meaningful reproduction based on current guidance.
Evidence disclosure
Evidence graph: 3 refs, 3 links.
Utility signals: depth 100/100, grounding 85/100, status high.
Implementation Status
There is no verified maintained implementation yet. Use this baseline plan to decide whether to prototype now or defer.
- No maintained paper-verified implementation was found; start with the closest related repositories below.
- Compare repo methods against the paper equations/algorithm before trusting metrics.
- Create a minimal baseline implementation from the paper and use adjacent repos as references.
Reproduction readiness
Hardware requirements
- Expect multi-day setup/compute for meaningful reproduction based on current guidance.
No verified implementation available
- · No maintained repository has been identified for this paper. Check adjacent implementations or HF artifacts below.
Framework baselines
- Hugging Face Transformers training guide
Modern transformer training baseline.
- PyTorch nn.Transformer docs
Reference transformer building block implementation.
Closest related implementations
These are not paper-verified. Use them as reference points when no direct implementation is available.
- cmhungsteve/Awesome-Transformer-AttentionAdjacentConfidence: MediumStars: 5,035
Matches contextual method/domain keyword: transformer
Hugging Face artifacts
No trustworthy direct or curated related Hugging Face artifacts were found yet.
Continue with targeted Hugging Face searches derived from the paper title and method context:
Tip: start with models, then check datasets/spaces if you need evaluation data or demos.
Direct artifact matches are currently sparse. Use targeted Hugging Face searches to quickly locate candidate models, datasets, and demos.
Research context
Tasks
Transformer
Methods
Transformer
Domains
Natural Language Processing
Evaluation & Human Feedback Data
Open this paper in HFEPX to review benchmark signals, evaluation modes, and human-feedback protocol context.
Open in HFEPXExplore Similar Papers
Jump to Paper2Code search queries derived from this paper's research context.
Need human evaluators for your AI research? Scale annotation with expert AI Trainers.