News

Google Unleashes Gemma 4: Fully Open-Source Models Bring Advanced AI to Edge Devices, Outperforming Larger Counterparts

Google Unleashes Gemma 4: Fully Open-Source Models Bring Advanced AI to Edge Devices, Outperforming Larger Counterparts

Google recently announced the release of its Gemma 4 series of models, marking a significant transition from 'open' to truly 'open-source'. Unlike previous Gemma models, which operated under restrictive terms, the Gemma 4 series now fully adopts the Apache 2.0 license. This crucial shift empowers developers to freely use, redistribute, and modify the models for personal, commercial, and enterprise projects without limitations, opening vast opportunities for AI Agents and edge computing applications.

The Gemma 4 series comprises four models, sharing underlying technology with Gemini 3, designed to cater to a wide range of hardware platforms from edge devices to high-performance workstations:

  • E2B / E4B: Optimized specifically for mobile phones and IoT devices, developed in close collaboration with the Google Pixel team, Qualcomm, and MediaTek. These models activate only 2 billion and 4 billion parameters, respectively, during inference to maximize memory and power efficiency. They support a 128K context window, feature image, video, and native audio input capabilities, and can run completely offline on devices like Pixel phones, Raspberry Pi, and Jetson Orin Nano with near-zero latency. Android developers can preview Agent Mode via the AICore developer preview.
  • 26B MoE: Utilizing a Mixture-of-Experts architecture, this model activates only 3.8 billion of its total parameters during inference. This design achieves high quality while maintaining extremely fast inference speeds, scoring 1441 in Arena AI text ratings, placing it sixth among open-source models.
  • 31B Dense: Engineered for ultimate raw performance, achieving an Arena AI text rating of 1452, ranking third among open-source models. Its unquantized bfloat16 weights can run on a single 80GB NVIDIA H100 GPU, while quantized versions support consumer-grade GPUs, providing a robust foundation for local fine-tuning.

Functionally, all four Gemma 4 models offer consistent capabilities: they support multi-step reasoning and complex logic, natively facilitate function calling, JSON structured output, and system instructions, enabling developers to build autonomous AI Agents that interact with external tools and APIs. Furthermore, they support image and video input, excelling in visual tasks like OCR and chart understanding, and are pre-trained on over 140 languages. The 26B and 31B models further extend their context window to 256K, capable of processing entire codebases or lengthy documents in a single prompt.

Benchmark results vividly illustrate the significant upgrade of Gemma 4 over its predecessor, Gemma 3. For instance, Gemma 4 31B's score on the AIME 2026 mathematical reasoning benchmark jumped from 20.8% to 89.2%, its LiveCodeBench v6 coding capability benchmark increased from 29.1% to 80.0%, and its τ2-bench score, which measures Agent tool-calling ability, soared from 6.6% to 86.4%. These improvements directly address key current application scenarios in reasoning, programming, and AI Agents.

Parameter efficiency is another highlight of Gemma 4. Scatter plots comparing model performance against parameter count demonstrate that Gemma 4, with its 26B and 31B models, achieves Elo scores typically seen in models with hundreds of billions or even trillions of parameters. Specifically, the 26B MoE's Arena AI score is comparable to Qwen3.5-397B-A17B (approximately 15 times its parameter count), while the 31B Dense model's score aligns with GLM-5 (over 600B parameters). Google describes this as "unprecedented intelligence density per parameter."

The performance of edge models is equally impressive. The E2B model achieved 60.0% on the MMMLU multilingual question-answering benchmark and 43.4% on the GPQA Diamond scientific knowledge benchmark. Remarkably, this is a model that activates only 2 billion parameters and can run on mobile phones. In comparison, the previous-generation Gemma 3 27B scored 42.4% on GPQA Diamond, indicating that a 2B model on a phone has now matched the performance of a 27 billion-parameter desktop model from the previous generation.

In terms of hardware ecosystem collaboration, NVIDIA has partnered closely with Google to optimize Gemma 4's inference performance on RTX GPUs, DGX Spark personal AI supercomputers, and Jetson Orin Nano. NVIDIA's Tensor Core and CUDA software stack provide out-of-the-box high-throughput, low-latency support for Gemma 4. Furthermore, the local Agent application, OpenClaw, has already been adapted for the latest models, enabling automated task execution by leveraging local user files and application contexts.

The transition of Gemma 4 to the Apache 2.0 license represents a significant leap from 'open' to true 'open source'. Developers can now legally bundle these models into products, services, and hardware devices for delivery. For industries with strict data sovereignty and compliance requirements, such as healthcare and finance, complete on-device operation means data never leaves the device while still benefiting from cutting-edge AI capabilities. The patent protection mechanism embedded in the Apache 2.0 license also offers additional legal safeguards for enterprise users.

Clément Delangue, co-founder and CEO of Hugging Face, hailed this license switch as "an important milestone." Since its initial release in February 2024, the Gemma series has amassed over 400 million downloads, with more than 100,000 community-derived variants. Model weights are now available on Hugging Face, Kaggle, and Ollama, with mainstream frameworks like Transformers, TRL, vLLM, llama.cpp, MLX, Unsloth, SGLang, and Keras offering day-one support.

For local deployment, users can quickly get started with Ollama or llama.cpp using GGUF formatted weights. Unsloth Studio simultaneously provides fine-tuning and deployment support for quantized models. For cloud-based scaling, Google Vertex AI, Cloud Run, and GKE are also readily available.

Small models like Gemma 4 are fundamentally redefining where AI should run. Historically, AI models predominantly ran in data centers, relying on cloud calls. Gemma 4, however, offers the possibility of complete model inference locally on mobile phones, Raspberry Pis, or even factory terminals without external network access. Data remains on the device, and decisions are made locally. Combined with the freedom of the Apache 2.0 license, this significantly expands the application scope of AI in sensitive industries and edge scenarios.

↗ Read original source