Large language models (LLMs) have become the standard interface for workflows ranging from code generation to document understanding. However, most LLMs still rely on autoregressive (AR) generation, where tokens are produced one at a time, each depending on its predecessors. While AR training is stable and simple to serve, it creates a significant hardware bottleneck: every new token requires a full model pass and complete weight loading from memory.
For developers of latency-sensitive applications, this token-by-token approach often leaves performance on the table. Modern GPUs spend the majority of their time on memory operations rather than actual computation when running smaller batch sizes. Furthermore, AR models lack the inherent ability to revise previously generated tokens, allowing mistakes to propagate through the output sequence.
Nemotron-Labs Diffusion introduces a paradigm shift via diffusion language models (DLMs). Instead of sequential generation, DLMs generate multiple tokens in parallel and iteratively refine them. This approach better leverages the computational power of modern GPUs and offers unique capabilities, such as revising existing text and addressing fill-in-the-middle objectives. The generate-and-refine mechanism also provides a built-in way to manage the inference budget; users can reduce computation requirements at runtime by decreasing the number of refinement steps.
The Nemotron-Labs Diffusion family includes text models at 3B, 8B, and 14B scales, released under the commercially-friendly NVIDIA Nemotron Open Model License. NVIDIA is also releasing an 8B vision-language model (VLM) for research purposes. To facilitate adoption, NVIDIA has provided the training recipes via the Megatron Bridge framework and enabled deployment through the SGLang inference engine, allowing developers to experiment with these next-generation architectures today.