Long-context inference makes the KV cache one of the primary cost bottlenecks when serving Large Language Models (LLMs). During autoregressive decoding, the cache size scales with context length, batch size, and model depth. At high batch sizes and extreme contexts exceeding 100K tokens across multiple concurrent requests, the KV cache consumes a vast portion of GPU memory. Compressing it is a direct and effective way to scale batch sizes and mitigate memory traffic constraints.
However, pushing KV caches to ultra-low INT2 (2-bit) precision has been historically impractical. Prior approaches either led to catastrophic accuracy loss or required custom serving layouts that broke compatibility with existing paged KV-cache serving systems. Together AI’s newly open-sourced OSCAR (Offline Spectral Covariance-Aware Rotation) successfully addresses both challenges.
Why INT2 KV Cache Quantization is Hard
KV activations naturally exhibit prominent channel-wise outliers, where a small fraction of channels contain exceptionally large values. When applying INT2 quantization—which offers only four representable levels—these outliers dominate the scaling factors. As a result, the quantizer wastes its limited dynamic range on rare spikes, compressing normal values into just one or two effective levels, which severely degrades attention calculations.
While rotation-based methods (like Hadamard transforms) mitigate this by applying a fixed orthogonal transformation to spread outlier energy across channels, they remain data-oblivious. They can smooth activation ranges but cannot identify which directions the attention mechanism actually reads. Distributing quantization error uniformly is not the same as driving it into low-importance directions. At INT2, this distinction is critical to whether the model remains functional.
What OSCAR Does Differently
OSCAR’s key insight is that the rotation applied before quantization should be derived directly from attention statistics, rather than raw KV activation distributions.
For keys, the downstream error that matters is not the Euclidean reconstruction error of K, but the error in attention logits: ‖QK⊤ − QK̂⊤‖²F = tr((K − K̂)Q⊤Q(K − K̂)⊤). Here, the weighting matrix is the query covariance Q⊤Q. Directions where queries have large energy will amplify quantization errors. OSCAR estimates the empirical query covariance CQ = (1/N) Σ qn⊤qn from a calibration set, performs eigen-decomposition, and utilizes the eigenvectors UQ as the key rotation basis.
For values, the error propagates through the attention output SV, which depends on how the attention score matrix S weights each value row. The research team defines the score-weighted value covariance as CS = (1/N) V⊤S⊤SV. Directions that remain large after aggregation by S are those where quantization error propagates. OSCAR uses the eigenvectors US of CS as the value rotation basis.
The final composed rotation matrices are computed entirely offline during a calibration phase, allowing online serving engines like vLLM to apply them with minimal runtime overhead, achieving near-lossless 2-bit quantization while remaining fully compatible with paged layouts.
[AgentUpdate Depth Analysis] Long-context processing is a fundamental enabler for advanced AI Agents executing multi-step planning, long-horizon workflows, and extensive document retrieval. However, massive KV cache footprints have been the chief bottleneck constraining concurrent multi-agent executions. Together AI’s OSCAR breaks this barrier by compressing the KV cache down to 2-bit without severe quality degradation. By steering quantization error away from critical attention pathways using covariance-aware rotation, OSCAR unlocks a massive boost in serving density. Compared to sparse pruning or standard INT4 quantization, OSCAR preserves the core reasoning dynamics of the model. This is a massive milestone for the Agent ecosystem, directly lowering enterprise serving costs and opening up viable pathways for executing complex, long-context Agent workloads on memory-constrained edge hardware.