Google Unveils Gemma 4 QAT Checkpoints: Q4_0 And A Revolutionary Mobile Format Slash On-Device Memory

Google DeepMind rolled out Quantization-Aware Training (QAT) checkpoints for its Gemma 4 lineup of models. These releases are built for edge hardware and everyday consumer graphics cards. The move followed April’s Gemma 4 debut and came just two days after the company unveiled a 12B variant.

We put the various Gemma 4 edge-model formats side by side using publicly available specs. The mission was straightforward: spell out how much memory each precision tier consumes and then illustrate what QAT brings to the table.

How QAT works under the hood

Quantization compresses a model by reducing the precision of its weights. Standard Post-Training Quantization (PTQ) squashes a fully trained model after the fact, which frequently chips away at output quality. QAT takes a different route: it mimics quantization throughout the training process itself, allowing the model to learn strategies that offset the precision reduction.

According to Google’s AI team, its QAT approach delivers noticeably better quality than conventional PTQ benchmarks. That said, Google did not include any Gemma 4 QAT benchmark results in the announcement itself. For background, Gemma 3 QAT models reduced the Q4_0 perplexity gap by 54 percent in llama.cpp tests. We reference that here purely as a point of comparison from the previous generation.

The comparison task

We benchmarked Gemma 4’s E2B and E4B versions across three formats: BF16, Q4_0 QAT, and a brand-new QAT schema tailored for mobile devices. Each format was evaluated on memory demand, accuracy retention, and practicality for on-device scenarios, pulling figures solely from official sources.

Memory requirements

Format	E2B	E4B	Basis
BF16 (16-bit)	9.6 GB	15 GB	Official Gemma 4 docs
Q4_0 (4-bit, QAT)	3.2 GB	5 GB	Official Gemma 4 docs
Mobile (QAT, E2B)	~1 GB	—	QAT announcement

The Q4_0 memory figures are identical to standard PTQ Q4_0 footprints. QAT does not shrink the model at a given format level. Its benefit is improved accuracy at that same file size. The additional compression comes from the new mobile schema instead.

With that mobile schema, Google trimmed Gemma 4 E2B down to roughly 1GB. There is room to go even leaner: a text-only build without Per-Layer Embeddings comes in under 1GB, since it strips out the audio and vision encoders entirely.

Format-by-format overview

BF16 serves as the quality gold standard. E2B clocks in at 9.6 GB and E4B at 15 GB. Think of it as the baseline reference rather than a realistic option for phone or handheld deployment.

Q4_0 QAT is the go-to format for general local use. It brings E2B down to 3.2 GB and E4B to 5 GB. QAT delivers stronger accuracy at this size than PTQ would at the same footprint. Popular consumer GPUs can handle this nicely. Earlier E2B experiments even demonstrated INT4 execution on a Raspberry Pi 5.

The mobile format is the schema tuned specifically for edge hardware. It squeezes E2B to approximately 1 GB through static activations, quantization applied per channel, and selective 2-bit compression on certain layers.

Inside the mobile schema

Google’s AI engineers crafted four key techniques targeting mobile silicon. Static activations lock in scaling factors during training, slashing real-time computation on-device. Per-channel quantization is designed to map cleanly onto mobile accelerator architectures. Targeted 2-bit compression is applied selectively to token-generation layers alone. KV cache tuning and embedding optimization trim what stays active in memory at any given moment.

The core reasoning layers continue to operate at higher precision, preserving model intelligence while shrinking what gets stored. Developers can also go text-only by omitting the audio and vision encoders, which carves out even more memory in cases where multimodal input is irrelevant.

Multi-axis evaluation

Scores reflect a qualitative ranking of each format for on-device scenarios. Memory is backed by hard measurements. Accuracy reflects what Google has publicly described, not independently measured Gemma 4 metrics. Each metric is accompanied by a one-line justification.

Dimension	BF16	Q4_0 QAT	Mobile QAT
Memory footprint	1 — heaviest, 9.6 GB E2B	4 — 3.2 GB E2B	5 — ~1 GB E2B text-only
Quality preservation	5 — full-precision baseline	4 — QAT-preserved, near baseline	3 — 2-bit token layers, core kept higher
Decode speed	2 — no quantization speedup	4 — 4-bit accelerates decode	5 — mobile-optimized static activations
Deployment breadth	4 — loadable but heavy	5 — llama.cpp, Ollama, LM Studio, vLLM, MLX	3 — LiteRT-LM, Transformers.js, edge-focused
On-device accessibility	1 — needs large GPU	4 — consumer GPU, Raspberry Pi 5	5 — runs on phones
Total (/25)	13	21	21

Verdict

By design, the results are evenly matched. Q4_0 QAT and mobile QAT both land at 21 points, but each excels on different hardware. For smartphones, the mobile format takes the lead: it fits onto about 1GB on E2B and targets mobile accelerators directly. For laptop-class machines and consumer GPUs, Q4_0 QAT remains the default recommendation. BF16 continues to serve as the accuracy reference point rather than a viable local option.

Methodology and caveats

All memory figures are sourced from Google’s official Gemma 4 documentation. The ~1GB E2B estimate comes from the QAT announcement itself. Accuracy assessments reflect Google’s stated claims, as no independent Gemma 4 QAT benchmarks were available at launch. We did not personally run any of the models for this analysis. Developers are encouraged to test at their own quantization settings and on their own workloads before committing to production builds.

Main points

Q4_0 QAT reduces Gemma 4 E2B to 3.2 GB and E4B to 5 GB, down from 9.6 GB and 15 GB at BF16.
An all-new mobile QAT schema compresses E2B to around 1 GB; a text-only build without Per-Layer Embeddings drops below 1 GB.
QAT boosts accuracy at a given model size rather than changing the size itself; the mobile format accounts for the extra memory savings.
Google asserts superior quality compared to PTQ but released no Gemma 4 QAT benchmark results at launch.
Model weights are already available on Hugging Face, with support across llama.cpp, Ollama, LM Studio, vLLM, MLX, and LiteRT-LM.

Marktechpost’s Interactive Breakdown

Marktechpost · Benchmark

Gemma 4 QAT: Q4_0 vs. the New Mobile Format

Google DeepMind published Quantization-Aware Training checkpoints for Gemma 4. We stacked up three edge-model formats using publicly released numbers.

Formats compared

BF16 (16-bit) · Q4_0 QAT (4-bit) · Mobile QAT

June 5, 2026

The Comparison

Task

What We Evaluated

$ compare gemma-4 --models E2B,E4B 
    --formats BF16,Q4_0-QAT,MOBILE-QAT 
    --rank memory,quality,accessibility 
    --source published-only --no-self-run

Memory figures are from the official Gemma 4 documentation. Quality is based on Google’s published claims. No models were tested locally.

Format 1 of 3 · Reference

BF16 (16-bit)

13 / 25

This represents the full-quality reference version. E2B requires 9.6 GB, and E4B requires 15 GB.

Top note: useful as a benchmark, not intended for running on phones or typical laptops.

Format 2 of 3 · Laptop / GPU

Q4_0 QAT (4-bit)

21 / 25

This versatile format works well for local hardware. E2B shrinks to 3.2 GB, and E4B drops to 5 GB.

Top note: QAT retains more performance quality compared to standard post-training quantization at the same 4-bit level.

Format 3 of 3 · Mobile

Mobile QAT

21 / 25

This lightweight format is optimized for edge devices. E2B is reduced to around 1 GB.

Top note: embedding layers use 2-bit quantization, while reasoning-critical layers maintain higher precision.

Leaderboard

Overall Rankings

Dimension	BF16	Q4_0 QAT	Mobile QAT
Memory footprint	1	4	5
Quality preservation	5	4	3
Decode speed	2	4	5
Deployment breadth	4	5	3
On-device accessibility	1	4	5
Total	13	21	21

The tie is intentional: Q4_0 is best for laptops and GPUs, while mobile QAT is ideal for mobile phones.

Key Takeaways

Essential Information for Developers

Q4_0 QAT significantly reduces E2B to 3.2 GB and E4B to 5 GB, compared to 9.6 GB and 15 GB with BF16.
The new mobile QAT format pushes E2B down to about 1 GB; text-only mode without PLE falls below 1 GB.
QAT influences quality at a given level of compression; the mobile format delivers the additional reduction in memory usage.
Google states improved quality over PTQ, but has not published specific Gemma 4 QAT benchmarks yet.
Model weights are available now on Hugging Face with support for llama.cpp, Ollama, vLLM, and MLX.

Explore the Model weights (Q4_0 QAT collection, Mobile QAT collection) and Google blog (QAT release). Also, feel free to follow us on Twitter and don’t forget to join our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! Are you on Telegram? Now you can join us on Telegram as well.

Interested in partnering with us to showcase your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us

Top Posts

Google Unveils Gemma 4 QAT Checkpoints: Q4_0 and a Revolutionary Mobile Format Slash On-Device Memory

Acer Swift Air 14 vs. MacBook Neo: The Budget Laptop Winner After Testing Both

NVIDIA Nemotron 3.5 ASR: 600M-Parameter Cache-Aware Streaming Model Transcribes 40 Language-Locales in Real Time

Google Unveils Gemma 4 QAT Checkpoints: Q4_0 and a Revolutionary Mobile Format Slash On-Device Memory

Building Multi-Agent Systems in Python: A Hands-On Approach

Meet Harness-1: Inside the Reinforcement Learning Trained 20B Retrieval Agent Built on gpt-oss-20b in a Stateful Search Environment

How a SciPy ODE Solver Derailed My Bayesian Inference — A Cosmologist’s Journey to Discovering Diffrax

Command Colab GPUs & TPUs From Your Terminal With Google’s New CLI

Lessons from the Trenches: How We Chose Our Experimentation Platform

Bridging Scikit-LLM and Open-Source LLMs: A Practical Guide

Google Unveils Gemma 4 QAT Checkpoints: Q4_0 and a Revolutionary Mobile Format Slash On-Device Memory

Acer Swift Air 14 vs. MacBook Neo: The Budget Laptop Winner After Testing Both

NVIDIA Nemotron 3.5 ASR: 600M-Parameter Cache-Aware Streaming Model Transcribes 40 Language-Locales in Real Time

Frontier AI Models Expose Critical Crypto Vulnerabilities, Yet Experts Claim the Industry Lacks Preparedness

Every New iPhone Until I Tweak These Settings—And The Reason Is a Game-Changer

After Years of Testing: Why Wireless Security Cameras Beat Wired Systems for Home Protection

Building Multi-Agent Systems in Python: A Hands-On Approach

“Massive Bitcoin Crash Looming? DWF Labs Co-founder Sounds Alarm on MicroStrategy and BitMine”

Trending

Google Unveils Gemma 4 QAT Checkpoints: Q4_0 and a Revolutionary Mobile Format Slash On-Device Memory

Acer Swift Air 14 vs. MacBook Neo: The Budget Laptop Winner After Testing Both

Latest Posts

Not More Data, but Better World Models – Unite.AI

OpenAI Is Hiring Head of Preparedness, Amid AI Cyberattack Fears

Subscribe to Updates

Top Posts

Google Unveils Gemma 4 QAT Checkpoints: Q4_0 and a Revolutionary Mobile Format Slash On-Device Memory

How QAT works under the hood

The comparison task

Memory requirements

Format-by-format overview

Inside the mobile schema

Multi-axis evaluation

Verdict

Methodology and caveats

Main points

Marktechpost’s Interactive Breakdown

Gemma 4 QAT: Q4_0 vs. the New Mobile Format

What We Evaluated

BF16 (16-bit)

Q4_0 QAT (4-bit)

Mobile QAT

Overall Rankings

Essential Information for Developers

Related Posts