Google DeepMind rolled out Quantization-Aware Training (QAT) checkpoints for its Gemma 4 lineup of models. These releases are built for edge hardware and everyday consumer graphics cards. The move followed April’s Gemma 4 debut and came just two days after the company unveiled a 12B variant.
We put the various Gemma 4 edge-model formats side by side using publicly available specs. The mission was straightforward: spell out how much memory each precision tier consumes and then illustrate what QAT brings to the table.
How QAT works under the hood
Quantization compresses a model by reducing the precision of its weights. Standard Post-Training Quantization (PTQ) squashes a fully trained model after the fact, which frequently chips away at output quality. QAT takes a different route: it mimics quantization throughout the training process itself, allowing the model to learn strategies that offset the precision reduction.
According to Google’s AI team, its QAT approach delivers noticeably better quality than conventional PTQ benchmarks. That said, Google did not include any Gemma 4 QAT benchmark results in the announcement itself. For background, Gemma 3 QAT models reduced the Q4_0 perplexity gap by 54 percent in llama.cpp tests. We reference that here purely as a point of comparison from the previous generation.
The comparison task
We benchmarked Gemma 4’s E2B and E4B versions across three formats: BF16, Q4_0 QAT, and a brand-new QAT schema tailored for mobile devices. Each format was evaluated on memory demand, accuracy retention, and practicality for on-device scenarios, pulling figures solely from official sources.
Memory requirements
| Format | E2B | E4B | Basis |
|---|---|---|---|
| BF16 (16-bit) | 9.6 GB | 15 GB | Official Gemma 4 docs |
| Q4_0 (4-bit, QAT) | 3.2 GB | 5 GB | Official Gemma 4 docs |
| Mobile (QAT, E2B) | ~1 GB | — | QAT announcement |
The Q4_0 memory figures are identical to standard PTQ Q4_0 footprints. QAT does not shrink the model at a given format level. Its benefit is improved accuracy at that same file size. The additional compression comes from the new mobile schema instead.
With that mobile schema, Google trimmed Gemma 4 E2B down to roughly 1GB. There is room to go even leaner: a text-only build without Per-Layer Embeddings comes in under 1GB, since it strips out the audio and vision encoders entirely.
Format-by-format overview
BF16 serves as the quality gold standard. E2B clocks in at 9.6 GB and E4B at 15 GB. Think of it as the baseline reference rather than a realistic option for phone or handheld deployment.
Q4_0 QAT is the go-to format for general local use. It brings E2B down to 3.2 GB and E4B to 5 GB. QAT delivers stronger accuracy at this size than PTQ would at the same footprint. Popular consumer GPUs can handle this nicely. Earlier E2B experiments even demonstrated INT4 execution on a Raspberry Pi 5.
The mobile format is the schema tuned specifically for edge hardware. It squeezes E2B to approximately 1 GB through static activations, quantization applied per channel, and selective 2-bit compression on certain layers.
Inside the mobile schema
Google’s AI engineers crafted four key techniques targeting mobile silicon. Static activations lock in scaling factors during training, slashing real-time computation on-device. Per-channel quantization is designed to map cleanly onto mobile accelerator architectures. Targeted 2-bit compression is applied selectively to token-generation layers alone. KV cache tuning and embedding optimization trim what stays active in memory at any given moment.
The core reasoning layers continue to operate at higher precision, preserving model intelligence while shrinking what gets stored. Developers can also go text-only by omitting the audio and vision encoders, which carves out even more memory in cases where multimodal input is irrelevant.
Multi-axis evaluation
Scores reflect a qualitative ranking of each format for on-device scenarios. Memory is backed by hard measurements. Accuracy reflects what Google has publicly described, not independently measured Gemma 4 metrics. Each metric is accompanied by a one-line justification.
| Dimension | BF16 | Q4_0 QAT | Mobile QAT |
|---|---|---|---|
| Memory footprint | 1 — heaviest, 9.6 GB E2B | 4 — 3.2 GB E2B | 5 — ~1 GB E2B text-only |
| Quality preservation | 5 — full-precision baseline | 4 — QAT-preserved, near baseline | 3 — 2-bit token layers, core kept higher |
| Decode speed | 2 — no quantization speedup | 4 — 4-bit accelerates decode | 5 — mobile-optimized static activations |
| Deployment breadth | 4 — loadable but heavy | 5 — llama.cpp, Ollama, LM Studio, vLLM, MLX | 3 — LiteRT-LM, Transformers.js, edge-focused |
| On-device accessibility | 1 — needs large GPU | 4 — consumer GPU, Raspberry Pi 5 | 5 — runs on phones |
| Total (/25) | 13 | 21 | 21 |
Verdict
By design, the results are evenly matched. Q4_0 QAT and mobile QAT both land at 21 points, but each excels on different hardware. For smartphones, the mobile format takes the lead: it fits onto about 1GB on E2B and targets mobile accelerators directly. For laptop-class machines and consumer GPUs, Q4_0 QAT remains the default recommendation. BF16 continues to serve as the accuracy reference point rather than a viable local option.
Methodology and caveats
All memory figures are sourced from Google’s official Gemma 4 documentation. The ~1GB E2B estimate comes from the QAT announcement itself. Accuracy assessments reflect Google’s stated claims, as no independent Gemma 4 QAT benchmarks were available at launch. We did not personally run any of the models for this analysis. Developers are encouraged to test at their own quantization settings and on their own workloads before committing to production builds.
Main points
- Q4_0 QAT reduces Gemma 4 E2B to 3.2 GB and E4B to 5 GB, down from 9.6 GB and 15 GB at BF16.
- An all-new mobile QAT schema compresses E2B to around 1 GB; a text-only build without Per-Layer Embeddings drops below 1 GB.
- QAT boosts accuracy at a given model size rather than changing the size itself; the mobile format accounts for the extra memory savings.
- Google asserts superior quality compared to PTQ but released no Gemma 4 QAT benchmark results at launch.
- Model weights are already available on Hugging Face, with support across llama.cpp, Ollama, LM Studio, vLLM, MLX, and LiteRT-LM.
Marktechpost’s Interactive Breakdown
Explore the Model weights (Q4_0 QAT collection, Mobile QAT collection) and Google blog (QAT release). Also, feel free to follow us on Twitter and don’t forget to join our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! Are you on Telegram? Now you can join us on Telegram as well.
Interested in partnering with us to showcase your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us



