This week, the Cohere AI team launched its first coding model designed for developers, called ‘North Mini Code’. This open-weight model is built specifically for software engineers. It uses a mixture-of-experts (MoE) architecture with 30 billion total parameters, though only 3 billion are activated for each token.
The launch centers on the concept of “sovereign” AI. The core idea is straightforward: operate powerful models according to your own requirements. Compact, efficient coding models enable teams to host them independently without needing massive GPU clusters. North Mini Code is designed to fill this exact need.
North Mini Code
North Mini Code is a model with 30 billion total parameters and 3 billion active ones (A3B). Cohere tailored it for three primary functions: generating code, agentic software engineering, and terminal-based tasks. It accepts text input and produces text output only, with no support for images or video.
It supports a context window of 256,000 tokens and can generate up to 64,000 tokens in a single output. Cohere specifies a minimum hardware requirement of one H100 GPU running at FP8 precision. The model weights are available under the Apache 2.0 license on Hugging Face. You can also access it via the Cohere API, Model Vault, and OpenRouter.
| Field | North-Mini-Code-1.0 |
|---|---|
| License | Apache 2.0 |
| Model size | 30B total; 3B active |
| Context length | 256K total; 64K max generation |
| Optimized for | Code generation, agentic software engineering, terminal tasks |
| Availability | Hugging Face, Cohere API, Cohere Model Vault, OpenRouter |
| Hardware (minimum) | 1× H100 @ FP8 |
The Architecture
North Mini Code is a decoder-only Transformer that incorporates sparse MoE layers. Its attention mechanism alternates between two types in a 3:1 ratio. Sliding-window attention employs RoPE for positional encoding, while global attention uses no positional embeddings whatsoever. The feed-forward block contains 128 experts, with 8 activated per token. Each expert is a feed-forward network using SwiGLU activation.
The routing mechanism applies a sigmoid function before performing top-k selection. A single dense layer precedes the sparse layers. This combination keeps the active computation low while expanding the overall model capacity. Cohere released the weights in BF16 format.
Post-training was conducted in two stages. The first involved two-stage cascaded supervised fine-tuning (SFT), followed by reinforcement learning with verifiable rewards (RLVR). The post-training emphasized agentic coding capabilities. The model also features support for interleaved thinking and native tool use.
Benchmarks
Cohere reports a score of 33.4 on the Artificial Analysis Coding Index, describing this as a competitive standing among models of similar size. The company tested on SWE-Bench Verified, SWE-Bench Pro, and Terminal-Bench v2, along with Terminal-Bench Hard, SciCode, and LiveCodeBench v6.
The evaluation methodology was precise. SWE-Bench used the SWE-agent harness v1.1.0. Terminal-Bench v2 used a basic ReAct harness with a single terminal tool. Terminal-Bench Hard used the Terminus-2 harness. Each benchmark was run with three different seeds and the results were averaged. Sampling was configured with temperature 1.0 and top_p 0.95.
The Speed
In Cohere’s internal benchmarks, North Mini Code achieved up to 2.8 times greater output throughput under identical concurrency and hardware conditions. It also demonstrated a 30% improvement in inter-token latency. Time-to-first-token was comparable between the two models, with Devstral Small 2 maintaining a slight advantage in TTFT.
| Metric | North Mini Code vs Devstral Small 2 |
|---|---|
| Output throughput | Up to 2.8x higher (same concurrency and hardware) |
| Inter-token latency | 30% better for North Mini Code |
| Time-to-first-token | Slightly behind Devstral Small 2 |
Use Cases With Examples
Cohere designed North Mini Code for agentic workflows.
Three key patterns emerge from its own positioning:
- Sub-agent orchestration: A primary agent assigns subtasks to supporting agents. For instance, one agent writes unit tests while another resolves failing code.
- Systems architecture mapping: The model analyzes a repository and outlines its structure. For example, mapping how services interact with each other before undertaking a major refactor.
- Code reviews: The model examines a diff for potential issues. For example, identifying an unguarded null dereference before merging code.
Terminal tasks are also well-suited to the model. For example: listing files, executing a build, and then parsing the output for errors.
Getting Started
The quickest way to begin is with Hugging Face Transformers. Install Transformers from source for this particular model. The recommended sampling settings are temperature 1.0 and top_p 0.95.
# Install Transformers from source (required for this model):
# pip install "git+
from transformers import AutoTokenizer, AutoModelForCausalLM
model_id = "CohereLabs/North-Mini-Code-1.0"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")
prompt = "Write a python program to check if a string is a palindrome or not."
messages = [{"role": "user", "content": prompt}]
# return_dict=True yields a dict (input_ids + attention_mask) so **inputs unpacks cleanly
inputs = tokenizer.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_dict=True,
return_tensors="pt",
).to(model.device)
gen_tokens = model.generate(
**inputs,
max_new_tokens=1024,
do_sample=True,
temperature=1.0,
top_p=0.95,
)
# Decode only the newly generated tokens, not the prompt
output = tokenizer.decode(gen_tokens[0][inputs["input_ids"].shape[-1]:])
print(output)For serving the model, vLLM is compatible. You will need the latest vLLM main branch along with Cohere’s melody library, which is essential for accurate response parsing.
uv pip install "git+
uv pip install "cohere_melody>=0.9.0"
vllm serve CohereLabs/North-Mini-Code-1.0
-tp 2
--max-model-len 320000
--tool-call-parser cohere_command4
--reasoning-parser cohere_command4
--enable-auto-tool-choiceQuantized versions are available for Ollama, LM Studio, and llama.cpp. You can also test the model before downloading it. Cohere provides free access through OpenCode and a hosted Hugging Face Space.
Key Takeaways
North Mini Code: Key Facts
- Cohere’s debut coding model, North Mini Code, is a 30-billion-parameter mixture-of-experts system that engages only 3 billion parameters for each token.
- It operates on a single H100 GPU using FP8 precision, supports a 256K context window, and can generate up to 64K tokens of output.
- The model weights are released under the Apache 2.0 license, though the Hugging Face repository includes an additional non-commercial restriction.
- According to Cohere’s official announcement, the model scores 33.4 on the Artificial Analysis Coding Index and delivers up to 2.8 times the throughput of Devstral Small 2.
- Designed for agentic coding workflows—including sub-agent coordination, system architecture analysis, and code reviews with built-in tool use.
Marktechpost Interactive Guide
Cohere · Open-Weight Coding Model
North Mini Code
Cohere’s first model built for developers: a 30B mixture-of-experts architecture that activates just 3B parameters per token, purpose-built for agentic software engineering and command-line tasks.
30B total parameters
3B active per token
256K context length
64K max output
1× H100 @ FP8
The essentials at a glance
Open weights, released June 9, 2026. Accepts text input and produces text output.
Size
30B total / 3B active
Architecture
Sparse MoE (decoder-only)
Minimum hardware
1× H100 @ FP8
License
Apache 2.0 see note below
Context window · drag the slider to explore
128K tokens
a medium-sized codebase
8K64K output cap256K max
Real-world size comparisons are approximate. The hard limits are 256K context and 64K maximum generation length.
Optimized for
Code generation
Agentic software engineering
Terminal tasks
Agentic use cases
Sub-agent orchestration
Systems architecture mapping
Code reviews
License note: Cohere’s blog lists Apache 2.0. The Hugging Face model card includes an acceptable-use addendum and a non-commercial clause. Review both documents before deploying in production.
The forward pass
Click any stage to learn what it does. The MoE block is where the sparsity magic happens.
→
→
→
→
Input tokens
Text is tokenized and passed into a decoder-only Transformer. The model takes text in and returns text out.
Try the router
Each MoE block contains 128 experts. The router picks 8 for each token. Route tokens and watch the coverage map grow.
Coral = the 8 experts currently firing. Peach = experts that have been used earlier in the session. Hover over a square to inspect it.
8 / 128 experts
Only 6.25% of experts run per token, keeping compute requirements low.
Unique experts used0 / 128
Tokens routed0
Reported performance
All figures come from Cohere. Running your own benchmarks on your specific workload is still essential.
0
Artificial Analysis Coding Index
0
Output throughput vs Devstral Small 2
0
Better inter-token latency
Higher is better
Time-to-first-token was nearly identical between the two models, with Devstral Small 2 holding a slight advantage.
Benchmarks: SWE-Bench Verified, SWE-Bench Pro, Terminal-Bench v2, Terminal-Bench Hard, SciCode, LiveCodeBench v6. Harnesses: SWE-agent v1.1.0 (SWE-Bench), a ReAct harness with one terminal tool (Terminal-Bench v2), Terminus-2 (Terminal-Bench Hard). Each configuration was run with 3 seeds and averaged, using temperature 1.0 and top_p 0.95.
Quickstart
Hugging Face Transformers, installed from source. Recommended sampling settings: temperature 1.0, top_p 0.95.
# Install Transformers from source, then: from transformers import AutoTokenizer, AutoModelForCausalLM mid = "CohereLabs/North-Mini-Code-1.0" tok = AutoTokenizer.from_pretrained(mid) model = AutoModelForCausalLM.from_pretrained(mid, device_map="auto") msgs = [{"role": "user", "content": "Write a Python palindrome checker."}] inputs = tok.apply_chat_template( msgs, add_generation_prompt=True, return_dict=True, return_tensors="pt", ).to(model.device) out = model.generate(**inputs, max_new_tokens=1024, do_sample=True, temperature=1.0, top_p=0.95) print(tok.decode(out[0][inputs["input_ids"].shape[-1]:]))
Serve via vLLM (+ cohere_melody)
Optimized for OpenCode
Built-in tool use + interleaved reasoning
Quantized versions: Ollama, LM Studio, llama.cpp
Also accessible via Cohere API, Model Vault, OpenRouter
Explore the model weights along with technical specifications. And be sure to follow us on Twitter, plus don’t miss out on joining our 150k+ ML SubReddit and newsletter. Already on Telegram? You can join us there too!
Looking to collaborate with us for showcasing your GitHub repo, Hugging Face page, product launch, webinar, and more? Get in touch



