Mistral AI has launched Mistral Small 4, a brand new mannequin within the Mistral Small household designed to consolidate a number of beforehand separate capabilities right into a single deployment goal. Mistral staff describes Small 4 as its first mannequin to mix the roles related to Mistral Small for instruction following, Magistral for reasoning, Pixtral for multimodal understanding, and Devstral for agentic coding. The result’s a single mannequin that may function as a common assistant, a reasoning mannequin, and a multimodal system with out requiring mannequin switching throughout workflows.
Structure: 128 Consultants, Sparse Activation
Architecturally, Mistral Small 4 is a Combination-of-Consultants (MoE) mannequin with 128 consultants and 4 energetic consultants per token. The mannequin has 119B whole parameters, with 6B energetic parameters per token, or 8B together with embedding and output layers.
Lengthy Context and Multimodal Help
The mannequin helps a 256k context window, which is a significant leap for sensible engineering use circumstances. Lengthy-context capability issues much less as a advertising quantity and extra as an operational simplifier: it reduces the necessity for aggressive chunking, retrieval orchestration, and context pruning in duties similar to long-document evaluation, codebase exploration, multi-file reasoning, and agentic workflows. Mistral positions the mannequin for common chat, coding, agentic duties, and sophisticated reasoning, with textual content and picture inputs and textual content output. That locations Small 4 within the more and more vital class of general-purpose fashions which can be anticipated to deal with each language-heavy and visually grounded enterprise duties underneath one API floor.
Configurable Reasoning at Inference Time
A extra vital product choice than the uncooked parameter rely is the introduction of configurable reasoning effort. Small 4 exposes a per-request reasoning_effort parameter that enables builders to commerce latency for deeper test-time reasoning. Within the official documentation, reasoning_effort="none" is described as producing quick responses with a chat model equal to Mistral Small 3.2, whereas reasoning_effort="high" is meant for extra deliberate, step-by-step reasoning with verbosity akin to earlier Magistral fashions. This modifications the deployment sample. As an alternative of routing between one quick mannequin and one reasoning mannequin, dev groups can maintain a single mannequin in service and fluctuate inference conduct at request time. That’s cleaner from a methods perspective and simpler to handle in merchandise the place solely a subset of queries really want costly reasoning.
Efficiency Claims and Throughput Positioning
Mistral staff additionally emphasizes inference effectivity. Small 4 delivers a 40% discount in end-to-end completion time in a latency-optimized setup and 3x extra requests per second in a throughput-optimized setup, each measured towards Mistral Small 3. Mistral will not be presenting Small 4 as only a bigger reasoning mannequin, however as a system geared toward enhancing the economics of deployment underneath actual serving hundreds.
Benchmark Outcomes and Output Effectivity
On reasoning benchmarks, Mistral’s launch focuses on each high quality and output effectivity. The Mistral’s analysis staff stories that Mistral Small 4 with reasoning matches or exceeds GPT-OSS 120B throughout AA LCR, LiveCodeBench, and AIME 2025, whereas producing shorter outputs. Within the numbers revealed by Mistral, Small 4 scores 0.72 on AA LCR with 1.6K characters, whereas Qwen fashions require 5.8K to six.1K characters for comparable efficiency. On LiveCodeBench, Mistral staff states that Small 4 outperforms GPT-OSS 120B whereas producing 20% much less output. These are company-published outcomes, however they spotlight a extra sensible metric than benchmark rating alone: efficiency per generated token. For manufacturing workloads, shorter outputs can straight cut back latency, inference value, and downstream parsing overhead.

Deployment Particulars
For self-hosting, Mistral offers particular infrastructure steering. The corporate lists a minimal deployment goal of 4x NVIDIA HGX H100, 2x NVIDIA HGX H200, or 1x NVIDIA DGX B200, with bigger configurations beneficial for greatest efficiency. The mannequin card on HuggingFace lists help throughout vLLM, llama.cpp, SGLang, and Transformers, although some paths are marked work in progress, and vLLM is the beneficial choice. Mistral staff additionally offers a customized Docker picture and notes that fixes associated to instrument calling and reasoning parsing are nonetheless being upstreamed. That’s helpful element for engineering groups as a result of it clarifies that help exists, however some items are nonetheless stabilizing within the broader open-source serving stack.
Key Takeaways
- One unified mannequin: Mistral Small 4 combines instruct, reasoning, multimodal, and agentic coding capabilities in a single mannequin.
- Sparse MoE design: It makes use of 128 consultants with 4 energetic consultants per token, concentrating on higher effectivity than dense fashions of comparable whole measurement.
- Lengthy-context help: The mannequin helps a 256k context window and accepts textual content and picture inputs with textual content output.
- Reasoning is configurable: Builders can regulate
reasoning_effortat inference time as a substitute of routing between separate quick and reasoning fashions. - Open deployment focus: It’s launched underneath Apache 2.0 and helps serving by means of stacks similar to vLLM, with a number of checkpoint variants on Hugging Face.
Try Mannequin Card on HF and Technical particulars. Additionally, be happy to observe us on Twitter and don’t neglect to hitch our 120k+ ML SubReddit and Subscribe to our Publication. Wait! are you on telegram? now you may be a part of us on telegram as effectively.



