Meet North Mini Code: Cohere’s Compact 3B Active Modes Power A Stellar Agentic Coding Model

This week, the Cohere AI team launched its first coding model designed for developers, called ‘North Mini Code’. This open-weight model is built specifically for software engineers. It uses a mixture-of-experts (MoE) architecture with 30 billion total parameters, though only 3 billion are activated for each token.

The launch centers on the concept of “sovereign” AI. The core idea is straightforward: operate powerful models according to your own requirements. Compact, efficient coding models enable teams to host them independently without needing massive GPU clusters. North Mini Code is designed to fill this exact need.

North Mini Code

North Mini Code is a model with 30 billion total parameters and 3 billion active ones (A3B). Cohere tailored it for three primary functions: generating code, agentic software engineering, and terminal-based tasks. It accepts text input and produces text output only, with no support for images or video.

It supports a context window of 256,000 tokens and can generate up to 64,000 tokens in a single output. Cohere specifies a minimum hardware requirement of one H100 GPU running at FP8 precision. The model weights are available under the Apache 2.0 license on Hugging Face. You can also access it via the Cohere API, Model Vault, and OpenRouter.

Field	North-Mini-Code-1.0
License	Apache 2.0
Model size	30B total; 3B active
Context length	256K total; 64K max generation
Optimized for	Code generation, agentic software engineering, terminal tasks
Availability	Hugging Face, Cohere API, Cohere Model Vault, OpenRouter
Hardware (minimum)	1× H100 @ FP8

The Architecture

North Mini Code is a decoder-only Transformer that incorporates sparse MoE layers. Its attention mechanism alternates between two types in a 3:1 ratio. Sliding-window attention employs RoPE for positional encoding, while global attention uses no positional embeddings whatsoever. The feed-forward block contains 128 experts, with 8 activated per token. Each expert is a feed-forward network using SwiGLU activation.

The routing mechanism applies a sigmoid function before performing top-k selection. A single dense layer precedes the sparse layers. This combination keeps the active computation low while expanding the overall model capacity. Cohere released the weights in BF16 format.

Post-training was conducted in two stages. The first involved two-stage cascaded supervised fine-tuning (SFT), followed by reinforcement learning with verifiable rewards (RLVR). The post-training emphasized agentic coding capabilities. The model also features support for interleaved thinking and native tool use.

Benchmarks

Cohere reports a score of 33.4 on the Artificial Analysis Coding Index, describing this as a competitive standing among models of similar size. The company tested on SWE-Bench Verified, SWE-Bench Pro, and Terminal-Bench v2, along with Terminal-Bench Hard, SciCode, and LiveCodeBench v6.

The evaluation methodology was precise. SWE-Bench used the SWE-agent harness v1.1.0. Terminal-Bench v2 used a basic ReAct harness with a single terminal tool. Terminal-Bench Hard used the Terminus-2 harness. Each benchmark was run with three different seeds and the results were averaged. Sampling was configured with temperature 1.0 and top_p 0.95.

The Speed

In Cohere’s internal benchmarks, North Mini Code achieved up to 2.8 times greater output throughput under identical concurrency and hardware conditions. It also demonstrated a 30% improvement in inter-token latency. Time-to-first-token was comparable between the two models, with Devstral Small 2 maintaining a slight advantage in TTFT.

Metric	North Mini Code vs Devstral Small 2
Output throughput	Up to 2.8x higher (same concurrency and hardware)
Inter-token latency	30% better for North Mini Code
Time-to-first-token	Slightly behind Devstral Small 2

Use Cases With Examples

Cohere designed North Mini Code for agentic workflows.

Three key patterns emerge from its own positioning:

Sub-agent orchestration: A primary agent assigns subtasks to supporting agents. For instance, one agent writes unit tests while another resolves failing code.
Systems architecture mapping: The model analyzes a repository and outlines its structure. For example, mapping how services interact with each other before undertaking a major refactor.
Code reviews: The model examines a diff for potential issues. For example, identifying an unguarded null dereference before merging code.

Terminal tasks are also well-suited to the model. For example: listing files, executing a build, and then parsing the output for errors.

Getting Started

The quickest way to begin is with Hugging Face Transformers. Install Transformers from source for this particular model. The recommended sampling settings are temperature 1.0 and top_p 0.95.

# Install Transformers from source (required for this model):
# pip install "git+
from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "CohereLabs/North-Mini-Code-1.0"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")

prompt = "Write a python program to check if a string is a palindrome or not."
messages = [{"role": "user", "content": prompt}]

# return_dict=True yields a dict (input_ids + attention_mask) so **inputs unpacks cleanly
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_dict=True,
    return_tensors="pt",
).to(model.device)

gen_tokens = model.generate(
    **inputs,
    max_new_tokens=1024,
    do_sample=True,
    temperature=1.0,
    top_p=0.95,
)

# Decode only the newly generated tokens, not the prompt
output = tokenizer.decode(gen_tokens[0][inputs["input_ids"].shape[-1]:])
print(output)

For serving the model, vLLM is compatible. You will need the latest vLLM main branch along with Cohere’s melody library, which is essential for accurate response parsing.

uv pip install "git+
uv pip install "cohere_melody>=0.9.0"

vllm serve CohereLabs/North-Mini-Code-1.0 
  -tp 2 
  --max-model-len 320000 
  --tool-call-parser cohere_command4 
  --reasoning-parser cohere_command4 
  --enable-auto-tool-choice

Quantized versions are available for Ollama, LM Studio, and llama.cpp. You can also test the model before downloading it. Cohere provides free access through OpenCode and a hosted Hugging Face Space.

Key Takeaways

North Mini Code: Key Facts

Cohere’s debut coding model, North Mini Code, is a 30-billion-parameter mixture-of-experts system that engages only 3 billion parameters for each token.
It operates on a single H100 GPU using FP8 precision, supports a 256K context window, and can generate up to 64K tokens of output.
The model weights are released under the Apache 2.0 license, though the Hugging Face repository includes an additional non-commercial restriction.
According to Cohere’s official announcement, the model scores 33.4 on the Artificial Analysis Coding Index and delivers up to 2.8 times the throughput of Devstral Small 2.
Designed for agentic coding workflows—including sub-agent coordination, system architecture analysis, and code reviews with built-in tool use.

Marktechpost Interactive Guide

Cohere · Open-Weight Coding Model

North Mini Code

Cohere’s first model built for developers: a 30B mixture-of-experts architecture that activates just 3B parameters per token, purpose-built for agentic software engineering and command-line tasks.

30B total parameters
3B active per token
256K context length
64K max output
1× H100 @ FP8

The essentials at a glance

Open weights, released June 9, 2026. Accepts text input and produces text output.

Size

30B total / 3B active

Architecture

Sparse MoE (decoder-only)

Minimum hardware

1× H100 @ FP8

License

Apache 2.0 see note below

Context window · drag the slider to explore

128K tokens

a medium-sized codebase

8K64K output cap256K max

Real-world size comparisons are approximate. The hard limits are 256K context and 64K maximum generation length.

Optimized for

Code generation
Agentic software engineering
Terminal tasks

Agentic use cases

Sub-agent orchestration
Systems architecture mapping
Code reviews

License note: Cohere’s blog lists Apache 2.0. The Hugging Face model card includes an acceptable-use addendum and a non-commercial clause. Review both documents before deploying in production.

The forward pass

Click any stage to learn what it does. The MoE block is where the sparsity magic happens.

→

→

→

→

Input tokens

Text is tokenized and passed into a decoder-only Transformer. The model takes text in and returns text out.

Try the router

Each MoE block contains 128 experts. The router picks 8 for each token. Route tokens and watch the coverage map grow.

Coral = the 8 experts currently firing. Peach = experts that have been used earlier in the session. Hover over a square to inspect it.

8 / 128 experts

Only 6.25% of experts run per token, keeping compute requirements low.

Unique experts used0 / 128

Tokens routed0

Reported performance

All figures come from Cohere. Running your own benchmarks on your specific workload is still essential.

Artificial Analysis Coding Index

Output throughput vs Devstral Small 2

Better inter-token latency

Higher is better

North Mini Codeup to 2.8×

Devstral Small 21.0× (baseline)

Time-to-first-token was nearly identical between the two models, with Devstral Small 2 holding a slight advantage.

Benchmarks: SWE-Bench Verified, SWE-Bench Pro, Terminal-Bench v2, Terminal-Bench Hard, SciCode, LiveCodeBench v6. Harnesses: SWE-agent v1.1.0 (SWE-Bench), a ReAct harness with one terminal tool (Terminal-Bench v2), Terminus-2 (Terminal-Bench Hard). Each configuration was run with 3 seeds and averaged, using temperature 1.0 and top_p 0.95.

Quickstart

Hugging Face Transformers, installed from source. Recommended sampling settings: temperature 1.0, top_p 0.95.

# Install Transformers from source, then:
from transformers import AutoTokenizer, AutoModelForCausalLM

mid = "CohereLabs/North-Mini-Code-1.0"
tok = AutoTokenizer.from_pretrained(mid)
model = AutoModelForCausalLM.from_pretrained(mid, device_map="auto")

msgs = [{"role": "user", "content": "Write a Python palindrome checker."}]
inputs = tok.apply_chat_template(
    msgs, add_generation_prompt=True,
    return_dict=True, return_tensors="pt",
).to(model.device)

out = model.generate(**inputs, max_new_tokens=1024,
                     do_sample=True, temperature=1.0, top_p=0.95)
print(tok.decode(out[0][inputs["input_ids"].shape[-1]:]))

Serve via vLLM (+ cohere_melody)
Optimized for OpenCode
Built-in tool use + interleaved reasoning

Quantized versions: Ollama, LM Studio, llama.cpp
Also accessible via Cohere API, Model Vault, OpenRouter

Explore the model weights along with technical specifications. And be sure to follow us on Twitter, plus don’t miss out on joining our 150k+ ML SubReddit and newsletter. Already on Telegram? You can join us there too!

Looking to collaborate with us for showcasing your GitHub repo, Hugging Face page, product launch, webinar, and more? Get in touch

Top Posts

Senate Democrats Push to Overturn Key Ruling on Civil Service Job Protections

Visa’s Bold Move: Powering OpenAI’s AI-Driven Payments – Is It Safe?

Anthropic Export Controls Spark Global AI Sovereignty Scramble

Meet North Mini Code: Cohere’s Compact 3B Active Modes Power a Stellar Agentic Coding Model

Anthropic Export Controls Spark Global AI Sovereignty Scramble

GPU Time-Slicing for Concurrent LLM Agents on Kubernetes

3 Sneaky Signs Your Wi-Fi Is Being Hacked — Plus How to Shut It Down for Good

4 Essential Lines Every Claude Skill Must Have

Databricks Unveils Omnigent: The Open-Source Meta-Harness Uniting Claude Code, Codex, and Pi Under One AI Agent Orchestration Layer

I Always Have 3 Plugged Into My Power Station — Here’s the Reason

Senate Democrats Push to Overturn Key Ruling on Civil Service Job Protections

Visa’s Bold Move: Powering OpenAI’s AI-Driven Payments – Is It Safe?

Anthropic Export Controls Spark Global AI Sovereignty Scramble

Mathematical String Probability: A Human-Powered Solution to the 3Blue1Brown Challenge

Reve 2.0 Review: The Best AI Image Generator for Layout Control

Army Data Center Initiatives Face Potential Setback Under House NDAA Clause

I tested dozens of Bluetooth trackers, but this one shocked me with its AirTag-crushing battery life

GPU Time-Slicing for Concurrent LLM Agents on Kubernetes

Trending

Senate Democrats Push to Overturn Key Ruling on Civil Service Job Protections

Visa’s Bold Move: Powering OpenAI’s AI-Driven Payments – Is It Safe?

Latest Posts

Not More Data, but Better World Models – Unite.AI

OpenAI Is Hiring Head of Preparedness, Amid AI Cyberattack Fears

Subscribe to Updates

Top Posts

Meet North Mini Code: Cohere’s Compact 3B Active Modes Power a Stellar Agentic Coding Model

North Mini Code

The Architecture

Benchmarks

The Speed

Use Cases With Examples

Getting Started

Key Takeaways

North Mini Code: Key Facts

Marktechpost Interactive Guide

North Mini Code

Input tokens

Related Posts