Open-TQ-Metal Achieves 128K Long-Context Llama 3.1 70B Inference on Apple Silicon with Fused Compressed-Domain Attention

A new development, Open-TQ-Metal, marks the first implementation of fused compressed-domain attention on Apple Silicon, enabling 128K-context inference for Llama 3.1 70B on a single consumer Mac with 64GB memory. This configuration was previously unachievable with existing inference frameworks.

Open-TQ-Metal's core innovation lies in its KV cache management. It quantizes the KV cache to int4 on the fly and computes attention directly on this compressed representation using custom Metal compute shaders, completely eliminating intermediate dequantization matrices. Across 330 experiments with Gemma 4 31B and Llama 3.1 70B model families, the fused sdpa_int4 kernel achieved a 48x attention speedup at 128K context over the dequantize-then-attend baseline.

Furthermore, the technology reduced KV cache memory from 40 GB to 12.5 GB (a 3.2x compression) while maintaining identical top-1 token predictions to FP16 inference. The research also presents the first cross-architecture analysis of KV cache quantization methods. This analysis reveals that the attention scale factor, rather than model size, determines the success or failure of angular quantization schemes like PolarQuant. Specifically, Gemma 4's attn_scale=1.0 was found to amplify directional error 25-100x more than Llama's standard 1/sqrt(d) scaling.