DeepGEMM
by deepseek-ai
About
DeepGEMM is a unified, high-performance Tensor Core kernel library that brings together key computation primitives of modern large language models, such as FP8, FP4, BF16 GEMMs, fused MoE with overlapped communication, MQA scoring, and HyperConnection, into a single, cohesive CUDA codebase. All kernels are compiled at runtime via a lightweight Just-In-Time (JIT) module, requiring no CUDA compilation during installation. Despite its lightweight design, DeepGEMM's performance matches or exceeds expert-tuned libraries across various matrix shapes.
Features
- Unified Tensor Core Kernel Library
- Lightweight JIT Runtime Compilation
- Multi-precision GEMM Support (FP8, FP4, BF16)
- Fused MoE with Overlapped Communication
- High Performance, Matching/Exceeding Expert-tuned Libraries
Supported Platforms
desktop