Zyphra Unveils Zamba2-VL: A Hybrid Mamba2–Transformer Vision-Language Model Slashing Time-to-First-Token By Nearly 10x

Zyphra has unveiled Zamba2-VL, a collection of open-source vision-language models available in three parameter sizes: 1.2B, 2.7B, and 7B. All variants are built upon the Zamba2 hybrid SSM–Transformer architecture.

Vision-language models (VLMs) process both images and text simultaneously, enabling them to answer questions about charts, documents, and photographs. While most open VLMs rely on a dense Transformer as their language backbone, Zamba2-VL substitutes this with a hybrid state-space architecture, aiming to deliver competitive accuracy with reduced latency.

What is Zamba2-VL

Zamba2-VL adopts the widely-used LLaVA-style VLM framework. A pre-trained vision encoder converts image patches into feature representations, and a lightweight MLP adapter maps those features into the language model’s embedding space. The language model then processes an interleaved sequence of visual and textual tokens. The models support both single-image and multi-image understanding as well as grounding tasks.

Zyphra pairs each Zamba2 backbone with the Vision Transformer from Qwen2.5-VL. This encoder was selected for two key reasons: it employs 2D rotary position embeddings and supports native dynamic-resolution processing. A two-layer MLP adapter bridges the encoder and the backbone.

The Architecture

The Zamba2 backbone is where the design departs from conventional VLMs. It combines Mamba2 state-space layers with shared Transformer blocks. The Mamba2 layers operate in linear time using a fixed-size state, while a small number of shared attention layers are interspersed between them. Each shared block features a unique LoRA adapter at every layer.

The Mamba2 layers handle the majority of computation efficiently, while the shared attention layers retain in-context retrieval capabilities that pure-SSM models sacrifice. This hybrid approach balances the expressiveness of full attention against the efficiency of state-space models.

Zamba2-VL utilizes the Mistral v0.1 tokenizer and was trained on 100 billion tokens of vision-text and text-only data, all sourced from publicly available web datasets.

Eval	Zamba2-VL-2.7B	InternVL3.5-2B	Qwen3-VL-2B	Molmo2-4B	Qwen3-VL-4B
DocVQA (test)	90.9	89.4	93.3	87.8	95.3
ChartQA (test)	79.6	81.6	78.7	86.1	81.8
OCRBench	73.6	83.4	84.1	62.0	84.1
CountBenchQA	87.5	70.0	87.9	91.2	87.3
PixMoCount (test)	82.5	32.8	55.7	87.0	89.2
MMMU (val)	37.7	49.9	40.9	48.8	51.4
MathVista (mini)	51.0	61.4	51.8	56.5	63.6

Top Posts

Crypto Platforms Fuel a Pokémon Card Boom—But Is It Really Not Gambling?

Google Takes Legal Action Against Chinese Smishing Ring Allegedly Leveraging Gemini AI for Phishing Attacks

Hidden in Plain Sight: The Untold Oversight Crisis in AI Health Technology

Zyphra Unveils Zamba2-VL: A Hybrid Mamba2–Transformer Vision-Language Model Slashing Time-to-First-Token by Nearly 10x

Strengths and Weaknesses

Strengths:

Weaknesses and Challenges:

Key Takeaways

Marktechpost’s Interactive Explainer

Zamba2-VL: Hybrid SSM–Transformer Vision-Language Models

`Parse PDFs Locally for RAG Using Docling: Extract Rich Tables Without Cloud Upload`

`Decoding Schizophrenia: How Saliency Maps Illuminate 3D MRI Decision Pathways`

`Moonshot AI Unveils Kimi K2.7-Code: A Leap Forward with +21.8% Boost on Kimi Code Bench v2 Over K2.6`

`“One Job, Many Minds: Harnessing a Team of Claudes for Every Task”`

`Synthetic Data: Transforming Virtual Experiments into Groundbreaking Biomedical Discoveries`

`Google’s Gemini-SQL2 Achieves 80.04% on BIRD Leaderboard with Gemini 3.1 Pro`

`Crypto Platforms Fuel a Pokémon Card Boom—But Is It Really Not Gambling?`

`Google Takes Legal Action Against Chinese Smishing Ring Allegedly Leveraging Gemini AI for Phishing Attacks`

`Hidden in Plain Sight: The Untold Oversight Crisis in AI Health Technology`

`Life on the Line: Why Unbreakable Connectivity Became the Heartbeat of Modern Medical Devices`

`Zyphra Unveils Zamba2-VL: A Hybrid Mamba2–Transformer Vision-Language Model Slashing Time-to-First-Token by Nearly 10x`

`Databricks Unveils Omnigent: The Open-Source Meta-Harness Uniting Claude Code, Codex, and Pi Under One AI Agent Orchestration Layer`

`Ripple CEO Fires Back at JPMorgan Over Alleged CLARITY Act Misrepresentation`

`Massive Supply Chain Attack: 400+ Arch Linux AUR Packages Compromised to Deliver Infostealer and eBPF Rootkit`

`Trending`

`Crypto Platforms Fuel a Pokémon Card Boom—But Is It Really Not Gambling?`

`Google Takes Legal Action Against Chinese Smishing Ring Allegedly Leveraging Gemini AI for Phishing Attacks`

`Latest Posts`

`Not More Data, but Better World Models – Unite.AI`

`OpenAI Is Hiring Head of Preparedness, Amid AI Cyberattack Fears`

Subscribe to Updates

Top Posts

Zyphra Unveils Zamba2-VL: A Hybrid Mamba2–Transformer Vision-Language Model Slashing Time-to-First-Token by Nearly 10x

What is Zamba2-VL

The Architecture

Performance and Evaluation

What Makes Inference Faster

Practical Applications

How to Get Started