Google DeepMind has unveiled Gemma 4 12B, a dense multimodal model that completely eliminates the need for dedicated encoders. Instead, visual and audio data are fed directly into the core language model. This streamlined approach enables the model to handle agentic tasks on a standard consumer laptop equipped with 16 GB of RAM. It is distributed under the permissive Apache 2.0 license.
Model Overview & Access
Gemma 4 12B is a 12-billion-parameter transformer that operates solely as a decoder. It natively processes text, images, audio, and video without relying on external vision or audio encoders. Its architecture mirrors that of the larger Gemma 4 31B Dense model, positioning it as a middle ground between the compact E4B and the more powerful 26B Mixture of Experts version.
- Architecture: A single, unified decoder-only transformer with no separate encoders.
- Modalities: Supports text, image, video, and direct audio input — marking the first mid-range Gemma model to include native audio support.
- Hardware requirement: Requires 16 GB of VRAM or unified memory, making it compatible with consumer-grade GPUs and Apple Silicon Macs.
- License: Released under Apache 2.0, with model weights freely available for download.
- Inference stack: Works seamlessly with popular tools like llama.cpp, MLX, vLLM, Ollama, SGLang, Unsloth, and LM Studio.
- Download: Available on Hugging Face and Kaggle; the instruction-tuned version is listed as
google/gemma-4-12B-it. - Integration: Supported by Hugging Face Transformers, the LiteRT-LM command-line tool, and a local API server compatible with OpenAI standards via
litert-lm serve.
Alongside the main model, a specialized Multi-Token Prediction (MTP) drafter has also been released to help speed up response times on local devices.
Architecture: The Encoder-Free Design
Previous mid-sized Gemma models relied on distinct transformer encoders for handling vision and audio inputs. These added both computational delay and extra parameters. For instance, the medium-tier Gemma 4 models included a 550-million-parameter vision encoder, while the E2B and E4B variants used a 300-million-parameter audio encoder. All such components have been removed in the 12B version.
Vision embedder (35M parameters): Input images are divided into 48×48 pixel segments. Each segment is mapped into the language model’s internal representation using a simple linear transformation—no attention mechanism is involved, and each patch is handled independently. To preserve spatial context, the model uses a factorized coordinate system: separate learned embeddings for the X and Y axes are retrieved and combined to form a positional signal. This signal is added to the patch embedding and then normalized. That completes the entire vision processing pipeline.
Audio wave projection: Audio sampled at 16 kHz is broken into 40-millisecond chunks, each containing 640 data points. These values are directly projected into the same embedding space used for text tokens—bypassing any traditional feature extraction or conformer-based processing. The model leverages its existing Rotary Position Embedding (RoPE) to interpret the sequential nature of the audio stream. In contrast, earlier E2B and E4B models employed 12 conformer layers for audio; these are now entirely absent.
Significance: With a shared weight space across modalities, there’s no need to separately fine-tune frozen encoders. Techniques like LoRA or full fine-tuning can now update vision, audio, and text processing simultaneously in one training pass. This capability is already supported in frameworks like Hugging Face Transformers and Unsloth.
By removing dedicated encoders, multimodal inference becomes faster—the language model begins processing inputs immediately without waiting for an encoder stage to complete.
Capabilities & Performance
At launch, Google DeepMind has not released comprehensive benchmark data. However, the official documentation indicates that the 12B model achieves performance close to the larger 26B MoE variant on standard evaluations, while consuming less than half the memory.
— AI research, model releases & developer tools for 1M+ practitioners.
marktechpost.com
Key Takeaways
- Google DeepMind has launched Gemma 4 12B, a dense multimodal model without encoders, available under the Apache 2.0 license.
- Vision and audio inputs are processed directly by the LLM core — removing the need for dedicated vision (550M) or audio (300M) encoder modules.
- A lightweight 35M vision embedder relies on a single matrix multiplication combined with split X/Y positional indexing; audio is mapped straight from raw 16 kHz signal frames.
- This marks the first mid-range Gemma model to include built-in audio support, alongside video capability, all runnable on a 16 GB laptop.
- Its benchmark results approach those of the 26B MoE model while consuming under half the memory.
Explore the Model Weights and Technical details. Also, feel free to follow us on Twitter and be sure to join our 150k+ ML SubReddit and subscribe to our Newsletter. Still not on Telegram? You can join us there too.
Interested in collaborating with us to showcase your GitHub repository, Hugging Face page, product launch, or webinar? Get in touch



