News

Google Unveils Gemma 4 Open-Weights Models for Agentic AI and Coding, Targeting Enterprise Sector

Google Unveils Gemma 4 Open-Weights Models for Agentic AI and Coding, Targeting Enterprise Sector

Google recently launched its fourth generation of open-weights Gemma models, specifically optimized for agentic AI and coding tasks. These models are released under a more permissive Apache 2.0 license, aiming to attract enterprise adoption.

This release comes amid a surge of open-weights Chinese Large Language Models (LLMs) from companies like Moonshot AI, Alibaba, and Z.AI, many of which are now rivaling the performance of OpenAI's GPT-5 or Anthropic's Claude. With Gemma 4, Google offers enterprise customers a domestic alternative that promises not to collect sensitive corporate data for future model training.

Developed by Google's DeepMind team, Gemma 4 brings several key improvements, including "advanced reasoning" to enhance performance in mathematics and instruction-following, support for over 140 languages, native function calling capabilities, and compatibility with video and audio inputs.

Consistent with previous Gemma iterations, Google provides these models in various sizes, catering to applications ranging from single board computers and smartphones to laptops and large-scale enterprise data centers.

Topping the lineup is a 31-billion-parameter LLM, which Google states has been meticulously tuned to maximize output quality. This model's size is strategically balanced: it's large enough for robust performance yet small enough that enterprises won't incur hundreds of thousands of dollars in GPU server costs for deployment or fine-tuning, avoiding cannibalization of Google's larger proprietary offerings.

According to Google, the 31B model can run unquantized at 16-bit precision on a single 80 GB H100 GPU. Furthermore, at 4-bit precision, it's compact enough to fit on a 24 GB GPU, such as an Nvidia RTX 4090 or AMD RX 7900 XTX, leveraging frameworks like Llama.cpp or Ollama.

For applications demanding lower latency and faster responses, the Gemma 4 series also includes a 26-billion-parameter model built with a Mixture of Experts (MoE) architecture.

During inference, only a subset of the model's 128 experts, totaling 3.8 billion active parameters, is utilized to process and generate each token. This design allows for significantly faster token generation compared to a dense model of equivalent total size, provided the model fits within the available VRAM.

However, this increased speed comes at the expense of slightly lower output quality, as only a fraction of the total parameters are engaged in the output generation process. This trade-off can be highly beneficial for deployments on devices with slower memory, such as notebooks or consumer-grade graphics cards.

Both the 31B and 26B models feature an extensive 256,000-token context window, making them particularly well-suited for applications like local code assistants—a use case Google prominently highlighted in its launch announcement.

Accompanying these larger models are a pair of LLMs specifically optimized for low-end edge hardware, including smartphones and single-board computers like the Raspberry Pi. These smaller models are offered in two effective sizes: two billion and four billion parameters. The term "effective" is crucial here, as their actual parameter counts are 5.1 billion and 8 billion, respectively. Google achieves this reduction in effective size through the use of per-layer embeddings (PLE).

↗ Read original source