SOURCE // NEWS

SPIN Framework Unifies Sparse Attention with Hierarchical Memory for Scalable Long-Context LLM Serving

SPIN Framework Unifies Sparse Attention with Hierarchical Memory for Scalable Long-Context LLM Serving

Long-context Large Language Model (LLM) serving is currently bottlenecked by the significant cost associated with attending over ever-growing Key-Value (KV) caches. While dynamic #sparse attention offers a promising solution by accessing only a small, query-dependent subset of the KV state per decoding step and extending KV storage to CPU memory, these algorithmic savings seldom translate into tangible end-to-end system-level gains. This discrepancy arises because sparse methods typically operate at varying granularities, necessitating ad hoc, algorithm-specific implementations. Furthermore, the introduction of hierarchical KV storage, spanning GPU and CPU memory, introduces a new systems bottleneck: retrieving fine-grained, irregular KV subsets across the GPU-CPU boundary can easily negate the benefits of sparsity.

To overcome these challenges, researchers have introduced SPIN, a sparse-attention-aware inference framework. SPIN is specifically designed to co-design the execution pipeline with hierarchical KV storage through a set of three key techniques:

  1. A unified partition abstraction: This mechanism maps different sparsity granularities onto a shared page-based KV substrate, providing a standardized approach for various sparse attention methods.
  2. A locality-aware #KV cache manager: This manager dynamically sizes per-request High Bandwidth Memory (HBM) budgets and employs a GPU-friendly bucketed Least Recently Used (LRU) policy to significantly reduce PCIe round-trip latencies.
  3. A two-level hierarchical metadata layout: Crucially, this layout is sized to accommodate the active working set rather than relying on the worst-case address space, optimizing memory usage and access efficiency.

SPIN was built upon #vLLM and validated using three representative sparse attention algorithms. The framework demonstrates substantial performance improvements, delivering 1.66 to 5.66 times higher end-to-end throughput compared to vLLM. Furthermore, SPIN achieves a remarkable 7 to 9 times reduction in Time To First Token (TTFT). When compared against the original sparse-attention implementations, SPIN also reduces Time Per Output Token (TPOT) by up to 58%, underscoring its significant potential for scalable and efficient long-context LLM inference.