Gemma 4 represents Google's latest family of open models. This means Google trained these models and subsequently released them to the public, allowing developers to download, run them on their own hardware, build applications without per-token costs, and ensure data privacy by keeping all operations local without sending data to cloud servers.
The Gemma 4 family isn't a single model but comprises three distinct offerings, each optimized for different use cases:
- Gemma 4 2B / 4B: These tiny models are ideal for edge devices such as smartphones, Raspberry Pis, or in-browser applications.
- Gemma 4 31B Dense: A medium-sized model suitable for local machines equipped with a decent GPU, targeting serious development projects.
- Gemma 4 26B MoE: This efficient Mixture of Experts (MoE) model is designed for high-throughput applications, advanced reasoning tasks, and server deployments. MoE models function with specialized internal teams, where only the relevant "expert" components activate for specific tasks, leading to greater efficiency.
Developers often ponder when to choose Gemma 4 over proprietary hosted models like GPT-4 or Claude. The primary drivers for selecting Gemma 4 include:
- Privacy: Your data remains securely on your device.
- Cost: Eliminates API bills, offering a zero-cost inference solution.
- Customization: Enables fine-tuning on proprietary datasets.
- Offline Use: Fully functional in environments without internet access, such as flights, rural areas, or air-gapped servers.
- Speed: Local inference can achieve high speeds with appropriate hardware.
Conversely, hosted models might be preferred when the priority is zero setup time, access to the absolute frontier of AI capability, or when local hardware limitations exist. Both serve as distinct tools, and the optimal choice depends on project requirements.
A notable feature is the capability of the 2B model to run on a Raspberry Pi. This is achieved through modern quantization techniques, which compress a model's weights from 32-bit floats down to 4-bit integers. This process can reduce a model's size to 10–15% of its original, with remarkably minimal quality degradation. For instance, a 2B parameter model, when quantized to 4-bit, occupies approximately 1.2 GB. Given that a Raspberry Pi 5 typically has 8GB RAM, the model fits and can execute, albeit slowly.
To run Gemma 4 2B on a Raspberry Pi using Ollama:
curl -fsSL https://ollama.com/install.sh | sh
ollama run gemma4:2b
While not high-speed, the ability to run a multimodal AI model on an $80 computer without internet connectivity is a significant advancement.
Regarding "native multimodal" capability, it signifies that Gemma 4 can directly process and comprehend input comprising images, text, or a combination of both. This contrasts with older methods, which typically involved piping an image through a vision encoder, combining its embeddings with text, and then passing this aggregated input to a large language model.