llama.cpp
by ggml-org
About
Developed by ggml-org, llama.cpp is a powerful open-source C/C++ inference engine designed to run large language and multimodal models with minimal setup. Operating without external dependencies, it leverages the ggml tensor library for state-of-the-art performance locally and in the cloud. Key features include comprehensive integer quantization (1.5-bit to 8-bit), multi-platform hardware acceleration (Metal, CUDA, Vulkan), and hybrid CPU+GPU inference. It natively supports the GGUF format and includes a built-in REST API server and WebUI.
Features
- Plain C/C++ architecture without dependencies
- Comprehensive 1.5-bit to 8-bit quantization
- Multi-backend hardware acceleration
- CPU+GPU hybrid inference
- Built-in REST API server and WebUI
Supported Platforms
webmobiledesktopiot