DeepGEMM

by deepseek-ai

🔓 Open Source Cuda 🌍 Global free

About

DeepGEMM is a unified, high-performance Tensor Core kernel library that brings together key computation primitives of modern large language models, such as FP8, FP4, BF16 GEMMs, fused MoE with overlapped communication, MQA scoring, and HyperConnection, into a single, cohesive CUDA codebase. All kernels are compiled at runtime via a lightweight Just-In-Time (JIT) module, requiring no CUDA compilation during installation. Despite its lightweight design, DeepGEMM's performance matches or exceeds expert-tuned libraries across various matrix shapes.

Features

Unified Tensor Core Kernel Library
Lightweight JIT Runtime Compilation
Multi-precision GEMM Support (FP8, FP4, BF16)
Fused MoE with Overlapped Communication
High Performance, Matching/Exceeding Expert-tuned Libraries

Supported Platforms

desktop

Links

📦 GitHub Repository