NVIDIA has rolled out CUDA 13.3, delivering a suite of new features and performance enhancements to the CUDA development community. A standout introduction is CUDA Tile programming in C++, which facilitates high-level, tile-based kernel development. This abstraction layer automatically handles intricate low-level GPU specifics, ensuring optimal performance and portability across various architectures. Notably, CUDA Tile programming now extends its support to Compute Capability 9.0 (NVIDIA Hopper) GPUs, alongside all previously supported architectures.
Accompanying this release is CUDA Python 1.0, marking a significant milestone in solidifying the stability and support for the CUDA Python software ecosystem. This version introduces crucial functionalities such as green contexts and process checkpointing.
Performance gains are also a key focus with the new NVIDIA CompileIQ compiler auto-tuning framework, which has demonstrated up to a 15% speedup on essential kernels like GEMM and attention. Furthermore, CUDA 13.3 brings official C++23 support to NVCC, enhances tensor interoperability through DLPack/mdspan in CCCL 3.3, and includes substantial updates to core math libraries such as cuBLAS, cuSPARSE, and cuSOLVER, as well as profiling tools like Nsight Compute and Nsight Systems.
The integration of CUDA Tile support into C++ is a major highlight, empowering a vast existing C++ developer community to craft highly-optimized GPU tile kernels. This programming model effectively automates complex low-level aspects such as parallelism, memory movement, and asynchrony, yielding C++ code that is inherently portable across diverse NVIDIA GPU architectures.
CUDA Python, a collection of libraries bridging CUDA with Python, reaches its 1.0 stable release. This milestone signifies a commitment to semantic versioning, guaranteeing that breaking API changes will be reserved strictly for major-version releases. Minor releases will focus on feature additions, while patch releases address bug fixes. Public APIs slated for deprecation will first be marked in a minor release, providing developers with clear migration paths.
The CUDA Python 1.0 ecosystem includes several key software components: cuda.binding offers low-level Python bindings for CUDA C APIs; cuda.core provides Pythonic access to the CUDA Runtime and other core functionalities; cccl-cuda enables Pythonic access to CCCL’s efficient and customizable parallel algorithms; and cuda-pathfinder assists in locating installed CUDA components within Python environments.
Additionally, cuda.coop is available within the cuda-cccl package under an _experimental namespace, offering reusable block-wide and warp-wide device primitives for Numba CUDA kernels, though its API may be subject to changes. Finally, cuda.core has achieved stability in version 1.0. It provides a comprehensive Pythonic interface to the CUDA runtime, encompassing devices, streams, programs, linkers, memory resources, and graphs. This version consolidates APIs that have matured over prior release cycles into a unified and supported surface, while also adding support for green contexts.