No verified implementation yetHugging Face Transformers training guide

TriAttention: Efficient Long Reasoning with Trigonometric KV Compression

Weian Mao, Xi Lin, Wei Huang, Yuxin Xie, Tianfu Fu +3 more

April 6, 2026arXiv: 2604.04921

0 repos~a few days to reproduce

Abstract

Extended reasoning in large language models (LLMs) creates severe KV cache memory bottlenecks. Leading KV cache compression methods estimate KV importance using attention scores from recent post-RoPE queries. However, queries rotate with position during RoPE, making representative queries very few, leading to poor top-key selection and unstable reasoning. To avoid this issue, we turn to the pre-RoPE space, where we o...

Summary

TriAttention performs KV cache compression by operating in the pre-RoPE space, leveraging stable Q/K vector concentration around fixed non-zero centers to estimate key importance instead of relying on post-RoPE attention scores. This page includes benchmark evidence for math_reasoning on MATH 500. Reproduction guidance focuses on implementation viability and concrete risk controls.

Key Contributions

TriAttention performs KV cache compression by operating in the pre-RoPE space, leveraging stable Q/K vector concentration around fixed non-zero centers to estimate key importance instead of relying on post-RoPE.
TriAttention uses a trigonometric series derived from Q/K centers to model distance preferences between queries and keys, scoring keys according to their positions and incorporating Q/K norms as an additional.
On AIME25 with 32K-token generation, TriAttention matches Full Attention reasoning accuracy while providing either about 2.5× higher throughput or roughly 10.7× KV cache memory reduction.
On the MATH 500 benchmark, TriAttention achieves similar peak accuracy to R-KV under a higher KV budget and substantially higher accuracy than R-KV when matched on KV memory budget.
Implementation and experimental details for TriAttention must be reverse-engineered from the paper and citation graph, as no ready-to-use open-source codebase is currently identified by the metadata.

Reproducibility Notes

Estimate is based on paper-only reproduction flow.

Results & Benchmarks

Task	Dataset	Metric	Value
Transformer	Full Attention	AIME24	57.1
Transformer	FullKV	10%	46.7

Hardware Requirements

Expect multi-day setup/compute for meaningful reproduction based on current guidance.

Best Implementation

Maintained implementation evidence is not confirmed for this paper yet.

Use the Implementation Status and Reproduction Path sections below for the current action plan.

Reproduction Path

Follow this baseline workflow to decide if this paper is worth immediate prototyping.

1
Use the paper and benchmark evidence to scope a baseline reproduction plan.
2
Track assumptions and missing details in an experiment log before coding.

Framework baselines

Hugging Face Transformers training guide
Modern transformer training baseline.
PyTorch nn.Transformer docs
Reference transformer building block implementation.

Time to first repro: a few daysEstimate is based on paper-only reproduction flow

Additional Implementations

No additional verified repositories beyond the primary recommendation.

Hugging Face Artifacts

No trustworthy direct or curated related Hugging Face artifacts were found yet.

Continue with targeted Hugging Face searches:

models

arxiv:2604.04921 TriAttention Natural Language Processing

datasets

arxiv:2604.04921 TriAttention dataset

spaces

arxiv:2604.04921 TriAttention demo

Research Context