Long-context Large Language Model (LLM) serving is currently bottlenecked by the significant cost associated with attending over ever-growing Key-Value (KV) caches. While dynamic sparse attention offers a promising solution by accessing only a small, query-dependent subset of the KV state per decoding step and extending KV storage to CPU memory, these algorithmic savings seldom translate into tangible end-to-end system-level gains. This discrepancy arises because sparse methods typically operate at varying granularities, necessitating ad hoc, algorithm-specific implementations. Furthermore, the introduction of hierarchical KV storage, spanning GPU and CPU memory, introduces a new systems bottleneck: retrieving fine-grained, irregular KV subsets across the GPU-CPU boundary can easily negate the benefits of sparsity.
To overcome these challenges, researchers have introduced SPIN, a sparse-attention-aware inference framework. SPIN is specifically designed to co-design the execution pipeline with hierarchical KV storage through a set of three key techniques:
- A unified partition abstraction: This mechanism maps different sparsity granularities onto a shared page-based KV substrate, providing a standardized approach for various sparse attention methods.
- A locality-aware KV cache manager: This manager dynamically sizes per-request High Bandwidth Memory (HBM) budgets and employs a GPU-friendly bucketed Least Recently Used (LRU) policy to significantly reduce PCIe round-trip latencies.
- A two-level hierarchical metadata layout: Crucially, this layout is sized to accommodate the active working set rather than relying on the worst-case address space, optimizing memory usage and access efficiency.
SPIN was built upon vLLM and validated using three representative sparse attention algorithms. The framework demonstrates substantial performance improvements, delivering 1.66 to 5.66 times higher end-to-end throughput compared to vLLM. Furthermore, SPIN achieves a remarkable 7 to 9 times reduction in Time To First Token (TTFT). When compared against the original sparse-attention implementations, SPIN also reduces Time Per Output Token (TPOT) by up to 58%, underscoring its significant potential for scalable and efficient long-context LLM inference.