News

DeepSeek V4 Unveiled: Million-Token Context, Domestic Chip Support, and Architectural Innovations Detailed in Comprehensive Report

DeepSeek V4 Unveiled: Million-Token Context, Domestic Chip Support, and Architectural Innovations Detailed in Comprehensive Report

DeepSeek V4, despite a half-year delay, has garnered widespread acclaim since its release. A comprehensive technical report details its 484-day development journey, unveiling significant breakthroughs.

The report highlights two primary advancements:

First, fully open-sourcing a million-token context window with substantial KV cache reduction. Both DeepSeek V4-Pro (1.6 trillion parameters) and V4-Flash (284 billion parameters) support a 1M (million) token context. In 1M context scenarios, V4-Pro's single-token FLOPs are just 27% of V3.2, and its KV cache is dramatically reduced to 10%. Industry experts suggest this innovation could help alleviate current HBM (High Bandwidth Memory) shortages.

Second, adaptation for domestic Chinese AI chips. DeepSeek V4 now supports Huawei's computing power, with bulk shipments of Ascend 950 supernodes anticipated in the latter half of the year, providing robust support for China's AI ecosystem.

Prior to V4's release, DeepSeek had hinted at several potential technologies through published papers. The newly open-sourced technical report allows for a direct comparison:

  • mHC (Manifold-Constrained Hyper-Connections): Uploaded to arXiv on December 31, 2025, this technology has been successfully integrated into V4.
  • Engram (Conditional Memory Module): Jointly released by DeepSeek and Peking University in January, it was not included in V4 but is explicitly designated for future V5 development.
  • DualPipe: A core component from V3, it continues to be utilized in V4 with adjustments made to accommodate mHC.
  • Muon optimizer: In V4, it has replaced AdamW as the primary optimizer for training the vast majority of model parameters.

Overall, DeepSeek V4 represents the most architecturally modified version in the series. Compared to V3, V4 introduces upgrades in three key areas:

  1. Integration of mHC (Manifold-Constrained Hyper-Connections): To strengthen residual connections and enhance model training stability.
  2. Design of a hybrid attention architecture: Alternating CSA (Compressed Sparse Attention) and HCA (Heavily Compressed Attention) to efficiently manage long-context processing.
  3. Adoption of Muon as the primary optimizer.

The Mixture of Experts (MoE) component continues to use DeepSeekMoE, and the Multi-Token Prediction (MTP) module remains consistent with V3. Minor detailed adjustments include: changing the affinity score's activation function from Sigmoid to Sqrt(Softplus(·)); removing the quantity constraint on routing target nodes; and replacing dense FFNs in early layers with Hash routing MoE layers.

Let's delve deeper into these crucial technologies.

mHC: Fortifying Residual Connections

Residual connections, introduced by Kaiming He in ResNet in 2016, have been fundamental for deep learning model training, ensuring effective gradient propagation. However, as models deepen and parameter counts increase, traditional residual connections face challenges in signal transmission stability, often leading to unstable training.

mHC evolves from Hyper-Connections (HC), an idea initially proposed by the Kimi team. HC extends the residual flow from one dimension to n_hc parallel channels, introducing a matrix B to mix between layers, thereby adding a scaling dimension to the residual flow. However, DeepSeek's practical experience revealed numerical instability when stacking multiple HC layers.

V4's mHC approach constrains matrix B to a "doubly stochastic matrix" manifold (mathematically known as the Birkhoff polytope), ensuring that both rows and columns of the matrix normalize to 1. This constraint offers significant advantages:

  • The spectral norm of the matrix is inherently capped at 1, effectively setting a hard upper limit for residual propagation and preventing gradient explosion.
  • Such matrices remain closed under multiplication, maintaining stability even when stacked across many layers.

Input mappings A and output mappings C utilize Sigmoid functions to ensure non-negativity and boundedness, preventing signal cancellation.

In terms of implementation, mHC employs the Sinkhorn-Knopp iteration, alternately performing row and column normalization, converging in approximately 20 iterations. This process runs independently for each layer. Despite the seemingly high computational cost, DeepSeek optimized it with fused kernels and selective recomputation, limiting the additional wall-time overhead from mHC to 6.7% within an overlapped pipeline.

Hybrid Attention Mechanism: The Secret to Million-Token Efficiency

This is the core technology enabling DeepSeek V4's "million-token efficiency." V4's attention layer isn't a single structure but an alternating use of two mechanisms: CSA (Compressed Sparse Attention) and HCA (Heavily Compressed Attention).

The CSA (Compressed Sparse Attention) mechanism operates in four steps:

  1. KV Compression: Every m KV entries from m tokens are compressed into a single block via an attention-like mechanism with learnable weights.
  2. Lightning Indexer + Top-k Selection: Inherited from V3.2's DSA. For each query token, a lightweight indexer computes relevance scores against each compressed KV block, selecting the top-k blocks.
  3. Core Attention Calculation: Multi-Query Attention is performed on the selected top-k compressed KV blocks to generate the attention output.
  4. Grouped Output Projection: Given V4's head dimension c is set to 512 (significantly larger than V3.2's 128), directly projecting all head outputs back to d dimensions would be costly. Thus, V4 groups n_h heads into g groups, each first projected to an intermediate dimension d_g, and finally merged and projected back to d.

CSA essentially performs two layers of compression: first, sequence length compression (n becomes n/m), and second, sparse selection (n/m becomes top-k). For a 1M token sequence, instead of attending to 1M tokens, it now only needs to attend to 1024 compressed blocks.

The HCA (Heavily Compressed Attention) approach is more direct and aggressive in compression, without sparse selection.

It uses a compression ratio m’=128, where every 128 tokens are compressed into a single block, with no overlap in the compression process. Then, dense attention is applied to all compressed KV blocks.

The alternating use of CSA and HCA reflects their distinct roles: CSA provides fine-grained, token-level retrieval through gentle compression and sparse selection; HCA delivers aggressive compression and maintains density for long-range global signal summarization. V4 models (61 layers in Pro, 43 layers in Flash) stack these mechanisms alternately, ensuring both detail retention and efficiency in long-context processing.

Additionally, Q/KV normalization is another optimization detail.

↗ Read original source