StepFun today unveiled Step 3.7 Flash, a multimodal Mixture-of-Experts (MoE) model tailored for agentic workloads. It brings native vision input and more dependable tool usage compared to its predecessor, Step 3.5 Flash.
What Does Step 3.7 Flash Offer?
Step 3.7 Flash is a 198-billion-parameter sparse Mixture-of-Experts vision-language model. It combines a 196-billion-parameter language backbone with a 1.8-billion-parameter vision encoder (ViT) to natively interpret images.
During inference, the model activates roughly 11 billion parameters per token. In MoE designs, only a select group of “expert” sub-networks engages for each forward pass—not the entire model—keeping computational demands comparable to an 11-billion-parameter dense model while preserving the capacity of a 198-billion-parameter architecture.
Core specifications at a glance:
| Metric | Value |
|---|---|
| Total parameters | 198B (196B language + 1.8B ViT) |
| Active parameters per token | ~11B |
| Context window | 256k tokens |
| Throughput | Up to 400 tokens/sec |
| Reasoning levels | Low, medium, high |
| License | Apache 2.0 |
Key Architectural Features
The vision encoder operates as a standalone 1.8B ViT module, embedding image data directly into the language backbone’s context stream. Unlike Step 3.5 Flash, which was text-only, Step 3.7 Flash introduces full multimodal capability.
Developers can choose among three reasoning depths—low, medium, and high—balancing processing speed against depth of analysis. Low delivers faster response times and lower costs, while high devotes more computational effort per answer.
On the SWE-Bench Pro benchmark, Step 3.7 Flash achieves 56.26%, stepping up from Step 3.5 Flash’s 51.3%—an improvement of about 5 percentage points. For Terminal-Bench 2.1, it reaches 59.55%, up from 53.37%.
It also posts a 72.42% score on SWE-MTLG, a multi-task, long-form coding benchmark.
Consistency across StepFun’s internal Step-SWE-Bench harnesses:
With Advisor Mode active on SWE-Bench Verified, StepFun reports that Step 3.7 Flash hits 97% of Claude Opus 4.6’s coding performance at roughly one-ninth the cost ($0.19 vs. $1.76 per task). These figures come from StepFun’s internal measurements.
Vision Capabilities
Step 3.7 Flash provides two visual processing pathways:
Visual Search Tool — For recognition scenarios where the model’s internal knowledge falls short (uncommon entities, recently introduced concepts), it calls a visual search tool to fetch and verify external information. On SimpleVQA (with Search), it scores 79.16%, on par with GPT 5.5 (79.11%) and ahead of Kimi K2.6 (78.24%) and GLM 5V Turbo (78.20%).
Python Tool — For precise visual tasks (high-resolution imagery, targeted inspection, bounding-box analysis), it leverages a code-based interface to crop, zoom, and annotate with pixels or bounding boxes. On V (a self-tested metric using Python), it achieves 95.29%. On HR-Bench 4K and HR-Bench 8K, it posts scores of 89.13% and 86.34%, respectively.
During testing, StepFun observed that the model began blending visual tools with non-visual ones without any explicit training to do so. For instance, after writing frontend code, it would render the output on-screen through a GUI tool to inspect the result before making further adjustments. StepFun refers to this as emerging compositional tool use.
On Android Daily, which measures long-horizon smartphone UI task completion, Step 3.7 Flash scores 61.87%, outpacing Kimi K2.6 (53.36%) and GLM 5V Turbo (51.68%). Gemini 3 Flash (63.21%) remains the leader here.
Search and Research Evaluation
StepFun designed the model’s search behavior around planning, evidence filtering, and synthesis—embedding the search process directly into the reasoning loop rather than treating it as a bolt-on component.
| Benchmark | Step 3.7 Flash | Notable comparison |
|---|---|---|
| HLE with Tools (acc) | 47.20% | DeepSeek V4 Flash: 45.10% |
| BrowseComp (acc) | 75.82% | Claude Opus 4.7: 79.30% |
| DeepSearchQA (F1) | 92.82% | Kimi K2.6: 92.50% |
| ResearchRubrics (score) | 71.68% | GPT 5.5: 61.50% |
Note: The HLE with Tools score of 47.20% marks a significant jump from Step 3.5 Flash’s text-only result of 35.68%. Step 3.5 Flash did not support tool-augmented evaluation on HLE.
General Agent Performance
| Benchmark | Step 3.7 Flash | Description |
|---|---|---|
| Toolathlon | 49.51% | Multi-tool coordination |
| ClawEval-1.1 | 67.07% | Daily autonomous task execution in realistic environments |
| GDPval (44 occupations) | 45.8% | General professional task execution |
| Tau2-bench Telecom | >98% | Across different reasoning difficulty tiers |
On ClawEval-1.1,
Key Takeaways
- Step 3.7 Flash is a 198B sparse MoE model with 11B active parameters and a 256k context window.
- Native multimodal support (images, GUIs, documents) is new — Step 3.5 Flash was text-only.
- Advisor Mode reaches 97% of Claude Opus 4.6’s SWE-Bench Verified performance at $0.19 per task vs. $1.76.
- Cross-harness coding variance narrowed from a 43–73% range (3.5 Flash) to 64.5–71.5% (3.7 Flash).
- Released under Apache 2.0 with BF16, FP8, NVFP4, and GGUF weights on Hugging Face.
Where (Inferences) to Run Step 3.7 Flash
Where to Run It
Step 3.7 Flash — Inference Providers & Access
StepFun’s 198B MoE vision-language model across hosted APIs and open weights.
Hosted API · Live Now
Open Weights · Apache 2.0
Check out the Model Weights, Repo, and Technical Details. Also, feel free to follow us on Twitter and don’t forget to join our 150k+ ML SubReddit and subscribe to our Newsletter. Wait! Are you on Telegram? Now you can join us on Telegram as well.
Need to partner with us to promote your GitHub repo, Hugging Face page, product release, webinar, or more? Connect with us



