⚡ News

Together AI Open-Sources OSCAR: Attention-Aware 2-Bit KV Cache Quantization for Long-Context LLMs

Together AI Open-Sources OSCAR: Attention-Aware 2-Bit KV Cache Quantization for Long-Context LLMs

Inference with long-context Large Language Models (LLMs) makes the KV cache one of the primary cost factors. During autoregressive decoding, this cache expands with context length, batch size, and model depth. At high batch sizes and long contexts, potentially spanning 100K tokens across dozens of concurrent requests, the KV cache can consume a substantial portion of GPU memory. Compressing it offers a direct pathway to increase batch size and reduce memory traffic.

The most straightforward approach is quantization. However, pushing KV caches to 2-bit (INT2) precision has historically been impractical. Prior methods typically result in significant accuracy degradation or necessitate custom serving layouts that are incompatible with established paged KV-cache systems. Together AI’s OSCAR (Offline Spectral Covariance-Aware Rotation) system has been designed to address both these critical issues.

The Challenge of INT2 KV Cache Quantization

KV activations are characterized by channel-wise outliers, where a small subset of channels exhibits extremely large values while most remain well-behaved. When applying INT2 quantization, which provides only four representable levels, these outliers disproportionately influence the scale factor. This leads to the quantizer dedicating most of its limited range to capturing these rare spikes. Consequently, normal values are compressed into just one or two effective levels, substantially degrading attention quality.

Rotation-based quantization attempts to mitigate this by applying a fixed orthogonal transform, often a Hadamard transform, to redistribute outlier energy across all channels. While this approach yields reasonable results at INT4, a deeper problem persists at INT2: the rotation is data-oblivious. It can smooth activation ranges but lacks awareness of the directions the attention mechanism actually reads. Uniformly spreading quantization error is not equivalent to directing it into low-importance directions. At INT2, with only four levels, this distinction is crucial for model functionality.

How OSCAR Innovates

OSCAR's core insight is that the rotation applied before quantization should be derived from the attention statistics themselves, rather than merely from the raw distribution of KV activations.

For keys, the critical downstream error is not the Euclidean reconstruction error of K, but rather the error in attention logits. The research team demonstrated this error to be: ‖QK⊤ − QK̂⊤‖²F = tr((K − K̂)Q⊤Q(K − K̂)⊤). Crucially, the weighting matrix here is the query covariance Q⊤Q, not K⊤K. Directions where queries possess high energy amplify quantization errors within the logits. OSCAR addresses this by estimating the empirical query covariance CQ = (1/N) Σ qn⊤qn from a calibration set, performing an eigen-decomposition, and utilizing the eigenvectors UQ as the key rotation basis.

For values, the relevant error resides in the attention output SV. This error's magnitude depends on how the attention score matrix S weights each value row. The research team defined the score-weighted value covariance CS = (1/N) V⊤S⊤SV. Directions that remain significant after aggregation by S are those through which quantization error propagates. OSCAR employs the eigenvectors US of CS as the value rotation basis.

The final composed rotation...

↗ Read original source