Zhipu AI has released a technical blog post detailing the 'Scaling Pain' encountered with its GLM-5 series models, particularly when handling high-concurrency Coding Agent tasks. This blog post deviates from typical hard-core tech outputs, offering instead a candid account of challenges and solutions.
Zhipu's inference infrastructure handles hundreds of millions of Coding Agent calls daily. Recently, users of GLM-5 models reported anomalies such as garbled output, repetition, and rare character generation during complex Coding Agent tasks. Crucially, these issues were not reproducible in standard inference environments.
After weeks of investigation, the team identified the root causes. Initial attempts to reproduce locally failed, indicating the model itself wasn't the primary cause. By simulating online environments, adjusting PD separation ratios, and increasing system load, anomalies were finally reproduced (3-5 per 10,000 requests), suggesting issues related to inference state management under high load within the underlying inference pipeline. The team further optimized anomaly detection by utilizing Speculative Decoding metrics. Low spec_accept_length indicated a KV cache mismatch for garbled/rare characters, while abnormally high spec_accept_length pointed to degraded attention patterns leading to repetition. Based on this, Zhipu developed an online monitoring strategy to abort generation if spec_accept_length falls below 1.4 (for generation length > 128 tokens) or spec_accept_rate exceeds 0.96.
The primary issue was traced to a KV Cache reuse conflict within the PD separation architecture. This conflict arose from inconsistencies in request lifecycle and the timing of KV Cache recycling and reuse. To mitigate this, Zhipu introduced stricter timing constraints and explicit synchronization in the inference engine. Specifically, after a termination command, the decoding stage notifies the prefill stage. The prefill stage returns a safe recycling signal only if no RDMA writes have started or all previous writes are complete. The decoding stage then reclaims the KV Cache slot only after receiving this confirmation, ensuring KV Cache writes do not cross memory reuse boundaries. This fix reduced the anomaly occurrence rate from over 0.001% to below 0.0003%.
Another identified issue was the potential access of unready KV Cache when HiCache swap-in and computation overlapped. The existing implementation failed to guarantee data completion before use. Zhipu resolved this by refactoring the HiCache read process and introducing explicit synchronization. A Load Stream synchronization point is inserted before the Indexer operator to ensure the corresponding Indexer cache is fully loaded. The Forward Stream proceeds with computation only after data is ready, eliminating read-before-ready problems and stabilizing the system under similar workloads.
Recognizing that both bugs highlighted prefill as a significant bottleneck in long-context Coding Agent serving tasks under high concurrency, Zhipu designed a hierarchical KV Cache storage scheme called LayerSplit. In this scheme, each GPU stores only a partial set of KV Cache layers, significantly reducing per-GPU memory footprint. Before Attention computation, the relevant KV Cache layers are broadcast to other ranks. To minimize communication overhead, KV Cache broadcasting is overlapped with indexer computation, effectively hiding communication latency. The only additional overhead comes from broadcasting the Indexer Cache, which is only one-eighth the size of the KV Cache, making the overall communication cost negligible. When combined with GLM-5.1, LayerSplit boosted system throughput by 10% to 132% for request lengths between 40k and 120k with a 90% cache hit rate, with gains increasing with context length.
These optimizations significantly enhance the system's processing capabilities in Coding Agent scenarios. Zhipu emphasizes that as AI scales into high-concurrency, long-context applications, maintaining inference output quality becomes paramount. Future large-scale AI requires not just capability growth driven by scaling laws but also equivalent system engineering support.