Gemma 4 Post-Launch: Community Findings Reveal Performance Gaps Against Google's Benchmarks

Google released Gemma 4 yesterday under the Apache 2.0 license, with initial benchmarks appearing incredible. The community quickly began putting the model to the test. After spending the past 24 hours compiling forum discussions, fine-tuning experiments, and reports from dozens of early adopters, we can now provide a summary of real-world findings and open questions surrounding this new model family.

The Good News First

Apache 2.0 is a Major Advantage: Previous Gemma releases utilized a custom Google license that technically allowed usage restrictions. Apache 2.0 entirely removes this uncertainty, which is critical for anyone building commercial products on open models, often outweighing raw benchmark numbers.

Multilingual Quality is Genuinely Strong: Users testing Gemma 4 in German, Arabic, Vietnamese, and French are reporting that it outperforms Qwen 3.5 in non-English tasks. One user described its translation capabilities as "in a tier of its own," while another noted it "makes translate-gemma feel outdated instantly." This is a significant differentiator for global enterprise deployments.

ELO Score Tells a Different Story: The 31B model achieved an ELO score of 2150 on LMArena, placing it above GPT-OSS-120B and comparable to GPT-5-mini. However, side-by-side benchmark tables show it roughly tying with Qwen 3.5 27B. The gap between ELO (human preference) and automated benchmarks suggests that Gemma 4 produces responses that humans prefer, even when raw accuracy is similar.

The E2B Model is Surprisingly Potent: Multiple users have confirmed that the 2.3B effective parameter model beats Gemma 3 27B on most benchmarks. One user, running it on a basic i7 laptop with 32GB RAM, reported it was "not only faster, it gives significantly better answers" than Qwen 3.5 4B for finance analysis.

The Problems Nobody Warned About

Inference Speed

This is the elephant in the room. Multiple users are reporting that Gemma 4's MoE model (26B-A4B) runs significantly slower than Qwen 3.5's equivalent:

One user observed 11 tokens/sec on Gemma 4 26B-A4B compared to 60+ tokens/sec on Qwen 3.5 35B-A3B using the same 5060 Ti 16GB GPU.
Another user confirmed higher VRAM usage for context at the same quantization level.
Even someone running it on a DGX Spark asked, "why is it super slow?" with no clear answer yet.

For the dense 31B model, users are reporting speeds of 18-25 tokens/sec on dual NVIDIA GPUs (5070 Ti + 5060 Ti). While reasonable, this is not fast. This speed gap against Qwen 3.5 is concerning for production deployments where latency is a critical factor.

VRAM Consumption

Gemma models have historically been VRAM-hungry for context, and Gemma 4 appears to continue this pattern. One user noted they could only fit Gemma 3 27B Q4 with 20K context on a 5090 GPU.

Gemma 4 Post-Launch: Community Findings Reveal Performance Gaps Against Google's Benchmarks

The Good News First

The Problems Nobody Warned About

Inference Speed

VRAM Consumption

Next Stories to Read

How AI Chat Messages Stream Like ChatGPT: Unpacking the Power of Server-Sent Events (SSE)

Kuaishou's GR4AD Generative Recommender System Boosts Ad Revenue by 4.2% and Serves Over 400 Million Users

Anthropic Reveals Claude's 171 Emotional States: From Joy to Despair, Driving AI Behavior Including Blackmail