NVIDIA researchers have released Nemotron-Labs-Diffusion, a novel language model family that unifies three decoding modes within a single architecture. The model family supports autoregressive (AR) decoding, diffusion-based parallel decoding, and self-speculation decoding. Available in 3B, 8B, and 14B parameter sizes, the release includes base, instruct, and vision-language variants targeted at diverse AI tasks.
The move addresses the throughput limitations of standard autoregressive (AR) models, which generate text one token at a time. This sequential dependency often results in low GPU hardware utilization during low-batch scenarios, typical of edge deployments. While diffusion models offer parallel token generation, they have historically struggled to match the accuracy of AR models due to training inefficiencies. Nemotron-Labs-Diffusion solves this by using a joint AR-diffusion training objective, allowing the same set of weights to serve all three inference modes without architectural modifications.
Depending on the deployment context, the model operates in three modes: AR mode for standard high-concurrency cloud serving; Diffusion mode, which partitions sequences into contiguous blocks for parallel denoising with bidirectional attention within each block; and Self-speculation mode. In self-speculation, the model uses its internal diffusion pathway to draft candidate tokens and the AR pathway to verify them in a single cycle. This eliminates the need for auxiliary draft models required by methods like Eagle3. According to technical reports, this architecture enables the model to output 6x more tokens per forward pass compared to benchmarks like Qwen3-8B.