StepFun Unveils Step 3.7 Flash: The 198B MoE VL Model Built For Coding Agents And Search Automation

StepFun today unveiled Step 3.7 Flash, a multimodal Mixture-of-Experts (MoE) model tailored for agentic workloads. It brings native vision input and more dependable tool usage compared to its predecessor, Step 3.5 Flash.

What Does Step 3.7 Flash Offer?

Step 3.7 Flash is a 198-billion-parameter sparse Mixture-of-Experts vision-language model. It combines a 196-billion-parameter language backbone with a 1.8-billion-parameter vision encoder (ViT) to natively interpret images.

During inference, the model activates roughly 11 billion parameters per token. In MoE designs, only a select group of “expert” sub-networks engages for each forward pass—not the entire model—keeping computational demands comparable to an 11-billion-parameter dense model while preserving the capacity of a 198-billion-parameter architecture.

Core specifications at a glance:

Metric	Value
Total parameters	198B (196B language + 1.8B ViT)
Active parameters per token	~11B
Context window	256k tokens
Throughput	Up to 400 tokens/sec
Reasoning levels	Low, medium, high
License	Apache 2.0

Key Architectural Features

The vision encoder operates as a standalone 1.8B ViT module, embedding image data directly into the language backbone’s context stream. Unlike Step 3.5 Flash, which was text-only, Step 3.7 Flash introduces full multimodal capability.

Developers can choose among three reasoning depths—low, medium, and high—balancing processing speed against depth of analysis. Low delivers faster response times and lower costs, while high devotes more computational effort per answer.

On the SWE-Bench Pro benchmark, Step 3.7 Flash achieves 56.26%, stepping up from Step 3.5 Flash’s 51.3%—an improvement of about 5 percentage points. For Terminal-Bench 2.1, it reaches 59.55%, up from 53.37%.
It also posts a 72.42% score on SWE-MTLG, a multi-task, long-form coding benchmark.
Consistency across StepFun’s internal Step-SWE-Bench harnesses:
Step 3.7 Flash supports Advisor Mode, StepFun’s take on the advisor strategy originally outlined by Anthropic. The model manages the full agentic loop—invoking tools, reading outputs, iterating—and only defers to a larger advisor model at critical moments such as strategic planning or recovering from persistent errors. The bulk of execution remains at lower executor-level cost.
With Advisor Mode active on SWE-Bench Verified, StepFun reports that Step 3.7 Flash hits 97% of Claude Opus 4.6’s coding performance at roughly one-ninth the cost ($0.19 vs. $1.76 per task). These figures come from StepFun’s internal measurements.
Vision Capabilities
Step 3.7 Flash provides two visual processing pathways:
Visual Search Tool — For recognition scenarios where the model’s internal knowledge falls short (uncommon entities, recently introduced concepts), it calls a visual search tool to fetch and verify external information. On SimpleVQA (with Search), it scores 79.16%, on par with GPT 5.5 (79.11%) and ahead of Kimi K2.6 (78.24%) and GLM 5V Turbo (78.20%).
Python Tool — For precise visual tasks (high-resolution imagery, targeted inspection, bounding-box analysis), it leverages a code-based interface to crop, zoom, and annotate with pixels or bounding boxes. On V (a self-tested metric using Python), it achieves 95.29%. On HR-Bench 4K and HR-Bench 8K, it posts scores of 89.13% and 86.34%, respectively.
During testing, StepFun observed that the model began blending visual tools with non-visual ones without any explicit training to do so. For instance, after writing frontend code, it would render the output on-screen through a GUI tool to inspect the result before making further adjustments. StepFun refers to this as emerging compositional tool use.
On Android Daily, which measures long-horizon smartphone UI task completion, Step 3.7 Flash scores 61.87%, outpacing Kimi K2.6 (53.36%) and GLM 5V Turbo (51.68%). Gemini 3 Flash (63.21%) remains the leader here.
Search and Research Evaluation
StepFun designed the model’s search behavior around planning, evidence filtering, and synthesis—embedding the search process directly into the reasoning loop rather than treating it as a bolt-on component.
Benchmark Step 3.7 Flash Notable comparison
HLE with Tools (acc) 47.20% DeepSeek V4 Flash: 45.10%
BrowseComp (acc) 75.82% Claude Opus 4.7: 79.30%
DeepSearchQA (F1) 92.82% Kimi K2.6: 92.50%
ResearchRubrics (score) 71.68% GPT 5.5: 61.50%
Note: The HLE with Tools score of 47.20% marks a significant jump from Step 3.5 Flash’s text-only result of 35.68%. Step 3.5 Flash did not support tool-augmented evaluation on HLE.
General Agent Performance
Benchmark Step 3.7 Flash Description
Toolathlon 49.51% Multi-tool coordination
ClawEval-1.1 67.07% Daily autonomous task execution in realistic environments
GDPval (44 occupations) 45.8% General professional task execution
Tau2-bench Telecom >98% Across different reasoning difficulty tiers
On ClawEval-1.1,
Slide 1 of 8 — Overview
What Is Step 3.7 Flash?
Step 3.7 Flash is a sparse Mixture-of-Experts (MoE) vision-language model built by StepFun. It pairs a 196B-parameter language core with a 1.8B-parameter Vision Transformer (ViT) encoder for built-in image understanding.
In a MoE model, only a selected group of “expert” sub-networks activates for each token — not the entire network. This keeps inference computation close to an 11B dense model while preserving 198B total parameters.
Context Window
256k tokens
Reasoning Levels
Low / Med / High
Slide 2 of 8 — Architecture
Architecture Notes
The 1.8B ViT encoder operates as a standalone module and feeds image representations into the language backbone’s context. Step 3.5 Flash was text-only; native multimodal support is new in version 3.7.
Three configurable reasoning depths let developers trade off speed against cost:
Low — Fastest and most affordable. Best for simple completions.
Medium — A middle ground between cost and reasoning depth.
High — Allocates more compute per response. Ideal for complex agent tasks.
MoE routing means you pay for roughly 11B active parameters at inference time, not 198B. This is the central efficiency trade-off in Flash-tier models.
Slide 3 of 8 — Agentic Coding
Agentic Coding Performance
Step 3.7 Flash earns 56.26% on SWE-Bench Pro (up from 51.3% in 3.5 Flash) and 59.55% on Terminal-Bench 2.1 (up from 53.37%). On SWE-MTLG it achieves 72.42%.
Per-harness scores on StepFun’s internal Step-SWE-Bench:
Scaffold 3.7 Flash 3.5 Flash
Hermes Agent 67.5% 60.0%
OpenClaw 67.0% 47.0%
KiloCode 67.5% 59.0%
RooCode 64.5% 43.0%
Claude Code 71.5% 73.0%
OpenCode 64.5% 57.0%
3.5 Flash ranged from 43% to 73% across different harnesses. 3.7 Flash tightens that spread to 64.5%–71.5% — delivering more consistent results across diverse scaffolding setups.
Slide 4 of 8 — Advisor Mode
Advisor Mode
Step 3.7 Flash supports Advisor Mode, StepFun’s take on the advisor strategy outlined by Anthropic. The model runs the full agentic loop — invoking tools, reading results, and iterating — but escalates to a larger advisor model only at key decision points.
Escalates during planning or when recovering from repeated failures
Most of the execution stays at executor (Flash) cost
The larger advisor model is called sparingly
SWE-Bench Verified results with Advisor Mode (StepFun internal figures):
Step 3.7 Flash + Advisor
76.3% score
Claude Opus 4.6
78.7% score
Claude Opus 4.6 cost
$1.76
Slide 5 of 8 — Multimodal
Multimodal Capabilities
Step 3.7 Flash offers two visual tool pathways:
Visual Search Tool — Triggered for long-tail entity recognition or recently surfaced concepts where the model’s stored knowledge falls short. SimpleVQA (Search): 79.16%
Python Tool — A code-based interface for cropping, zooming, and performing pixel or bounding-box operations on high-resolution images. V* (Python): 95.29% | HR-Bench 4K: 89.13% | HR-Bench 8K: 86.34%
Android Daily (long-horizon phone UI tasks): Step 3.7 Flash scores 61.87%, outpacing Kimi K2.6 (53.36%) and GLM 5V Turbo (51.68%). Gemini 3 Flash leads at 63.21%.
StepFun observed emergent compositional tool use during testing — the model combined visual and non-visual tools without having been explicitly trained to do so.
Slide 6 of 8 — Search & Research
Search and Research Benchmarks
Search is woven into the model’s reasoning pipeline rather than bolted on as an external add-on. StepFun focused training efforts on search planning, evidence filtering, and synthesis.
Benchmark 3.7 Flash Comparison
HLE w. Tools (acc) 47.20% DeepSeek V4 Flash: 45.10%
BrowseComp (acc) 75.82% Claude Opus 4.7: 79.30%
DeepSearchQA (F1) 92.82% Kimi K2.6: 92.50%
ResearchRubrics 71.68% GPT 5.5: 61.50%
HLE comparison: Step 3.5 Flash scored 35.68% in text-only mode. Step 3.7 Flash reaches 47.20% with tool access — keep in mind these are not directly comparable conditions.
Slide 7 of 8 — Deployment
Pricing, Deployment & Ecosystem
Token Type Price
Input (cache miss) $0.20 / M tokens
Input (cache hit) $0.04 / M tokens
Output $1.15 / M tokens
Available on:
StepFun PlatformOpenRouter
NVIDIA NIM
DeepInfra (coming soon)
Fireworks AI (coming soon)
Modal (coming soon)
Inference backends: vLLM, SGLang, Hugging Face Transformers (v5.0+ required), llama.cpp
Quantization formats: BF16, FP8, NVFP4, GGUF
Local minimum: 120 GB unified memory/VRAM
Slide 8 of 8 — Key Takeaways
Key Takeaways
198B sparse MoE model with ~11B active parameters per token and a 256k context window
Native multimodal support (images, GUIs, documents) — Step 3.5 Flash was text-only
Advisor Mode scores 76.3% on SWE-Bench Verified at $0.19/task vs. Claude Opus 4.6 at $1.76
Cross-harness coding variance narrowed from 43–73% (3.5) to 64.5–71.5% (3.7)
Released under Apache 2.0 with BF16, FP8, NVFP4, and GGUF weights on Hugging Face
Compatible harnesses:
Claude Code
KiloCode
Hermes Agent
OpenClaw
Key Takeaways
Step 3.7 Flash is a 198B sparse MoE model with 11B active parameters and a 256k context window.
Native multimodal support (images, GUIs, documents) is new — Step 3.5 Flash was text-only.
Advisor Mode reaches 97% of Claude Opus 4.6’s SWE-Bench Verified performance at $0.19 per task vs. $1.76.
Cross-harness coding variance narrowed from a 43–73% range (3.5 Flash) to 64.5–71.5% (3.7 Flash).
Released under Apache 2.0 with BF16, FP8, NVFP4, and GGUF weights on Hugging Face.
Where (Inferences) to Run Step 3.7 Flash
Where to Run It
Step 3.7 Flash — Inference Providers & Access
StepFun’s 198B MoE vision-language model across hosted APIs and open weights.
Hosted API · Live Now
Open Weights · Apache 2.0
Check out the Model Weights, Repo, and Technical Details. Also, feel free to follow us on Twitter and don’t forget to join our 150k+ ML SubReddit and subscribe to our Newsletter. Wait! Are you on Telegram? Now you can join us on Telegram as well.
Need to partner with us to promote your GitHub repo, Hugging Face page, product release, webinar, or more? Connect with us

Top Posts

Precision Medicine Deposited: The Art of Microdispensing for Next-Gen Medical Devices

When the World Cup Collided with the Cloud: 2026’s Digital Traffic Surge

Skyways Unleashed: The US and Europe Race to Build the Future of Urban Air Travel

StepFun Unveils Step 3.7 Flash: The 198B MoE VL Model Built for Coding Agents and Search Automation

What Is Step 3.7 Flash?

Architecture Notes

Agentic Coding Performance

Advisor Mode

Multimodal Capabilities

Search and Research Benchmarks

Pricing, Deployment & Ecosystem

Key Takeaways

Beyond Guesswork: A Slurm-Powered Battle Plan for Benchmarking Distributed LLM Servers

Beyond Prompt Engineering: How 4 Context Bricks Silence RAG Hallucinations

Run Mythos Enhanced Coding Model Locally with llama.cpp on Raspberry Pi

Astryx: Meta’s Open-Source React Toolkit—150+ Accessible Components, 7 Themes, and a CLI Agent-Ready Design System

Endless Code: Mastering the Art of the 24-Hour Claude Agent

Unlock Peak Performance: Your Blueprint for Lightning-Fast Agentic Coding with Claude

Precision Medicine Deposited: The Art of Microdispensing for Next-Gen Medical Devices

When the World Cup Collided with the Cloud: 2026’s Digital Traffic Surge

Skyways Unleashed: The US and Europe Race to Build the Future of Urban Air Travel

5 No-Cost Courses to Transform from AI Newbie to Pro

Beyond Guesswork: A Slurm-Powered Battle Plan for Benchmarking Distributed LLM Servers

The Magic of Friction: Engineering Smarter Robot World Models

Trump Mobilizes Defense Industry to Chart Software and Supplier Networks Nationwide

KuCoin Pay: Weaving Crypto Seamlessly Into Everyday Payments

Trending

Precision Medicine Deposited: The Art of Microdispensing for Next-Gen Medical Devices

When the World Cup Collided with the Cloud: 2026’s Digital Traffic Surge

Latest Posts

Not More Data, but Better World Models – Unite.AI

OpenAI Is Hiring Head of Preparedness, Amid AI Cyberattack Fears

Subscribe to Updates

Top Posts

StepFun Unveils Step 3.7 Flash: The 198B MoE VL Model Built for Coding Agents and Search Automation

What Does Step 3.7 Flash Offer?

Key Architectural Features

Vision Capabilities

Search and Research Evaluation

General Agent Performance

What Is Step 3.7 Flash?

Architecture Notes

Agentic Coding Performance

Advisor Mode

Multimodal Capabilities

Search and Research Benchmarks

Pricing, Deployment & Ecosystem

Key Takeaways

Key Takeaways

Where (Inferences) to Run Step 3.7 Flash

Related Posts