In the realm of long-context decoding for Large Language Models (LLMs) and Multimodal Models (LMMs), the attention mechanism encounters a growing memory bottleneck. This is primarily because each decoding step necessitates loading a substantial volume of KV-cache data from GPU memory.
Existing acceleration strategies frequently compromise accuracy for efficiency by relying on heuristic pruning, which often discards useful information. Furthermore, these methods tend to indiscriminately preserve all high-scoring tokens, treat early tokens as indispensable anchors, or depend on heuristic head routing. This reflects an insufficient mechanistic understanding of the attention sink phenomenon.
Addressing these limitations, researchers introduce SinkRouter. The paper reveals a crucial insight: the attention sink phenomenon corresponds to a stable, reachable, and error-controllable fixed point constructed during training.
Based on this understanding, SinkRouter is proposed as a training-free selective routing framework. It is designed to detect the attention sink signal and subsequently skip computations that would otherwise produce near-zero output, thereby enhancing efficiency.
To translate this mechanism into real-world hardware acceleration, SinkRouter integrates a hardware-aware Triton kernel. This kernel incorporates block-level branching and Split-K parallelism, optimizing GPU resource utilization.
Extensive evaluations were conducted on a diverse suite of long-context benchmarks, including LongBench, InfiniteBench, CVBench, MileBench, and MMVP. These evaluations employed both text-only and multimodal backbones such as Llama-3.1-8B, Llama-3.1-70B, Yi-9B-200K, LLaVA-1.5-7B, and LLaVA-1.5-13B. Across these varied settings, SinkRouter consistently demonstrates improved decoding efficiency while maintaining competitive accuracy, achieving a 2.03x speedup with a 512K context.