StepFun has launched Step-DeepResearch, a 32B parameter finish to finish deep analysis agent that goals to show net search into precise analysis workflows with lengthy horizon reasoning, device use and structured reporting. The mannequin is constructed on Qwen2.5 32B-Base and is educated to behave as a single agent that plans, explores sources, verifies proof and writes studies with citations, whereas maintaining inference value low.
From Search to Deep Analysis
Most current net brokers are tuned for multi-hop question-answering benchmarks. They attempt to match floor fact solutions for brief questions. That is nearer to focused retrieval than to actual analysis. Deep analysis duties are totally different. They contain latent intent recognition, lengthy horizon determination making, multi-turn device use, structured-reasoning and cross-source verification below uncertainty.
Step-DeepResearch reframes this as sequential determination making over a compact set of atomic capabilities. The analysis crew defines 4 atomic capabilities, planning and process decomposition, deep-information looking for, reflection and verification, {and professional} report technology. As a substitute of orchestrating many exterior brokers, the system internalizes this loop right into a single mannequin that decides the following motion at every step.
Information Synthesis round Atomic Capabilities
To show these atomic capabilities, the analysis crew builds separate knowledge pipelines for every talent. For planning, they begin from top quality technical studies, survey papers and monetary evaluation paperwork. They reverse-engineer practical analysis plans and process timber from titles, abstracts and construction, then generate trajectories that comply with these plans. This exposes the mannequin to lengthy horizon mission constructions, not solely quick query templates.
For deep info looking for, they assemble graph primarily based queries over information graphs reminiscent of Wikidata5m and CN-DBpedia. They pattern subgraphs, develop them utilizing search, and synthesize questions that require multi hop reasoning throughout entities and paperwork. A separate pipeline makes use of a Wiki type hyperlink index to power cross doc retrieval and mixture of proof. Simple questions {that a} robust mannequin can already resolve with a easy ReAct type technique are filtered out, so coaching focuses on laborious search issues.
Reflection and verification knowledge is generated by self-correction loops and multi-agent trainer traces. Trainer brokers extract claims, plan checks, confirm details, replan if inconsistencies seem and solely then write studies. The ensuing trajectories are cleaned and used as supervision for a single pupil agent. Report technology is educated in 2 phases, mid coaching for area type and depth utilizing question report pairs, then supervised fine-tuning with strict formatting and plan consistency constraints.
Progressive Coaching on Qwen2.5-32B-Base
The coaching pipeline has 3 phases, agentic mid-training, supervised fine-tuning and reinforcement studying. In mid coaching stage-1, the crew injects atomic capabilities with out instruments, utilizing context size as much as 32k tokens. The info covers lively studying, artificial reasoning traces, summarization and reflection. The analysis crew present regular beneficial properties on SimpleQA, TriviaQA and FRAMES as coaching scales as much as about 150B tokens, with the biggest beneficial properties on FRAMES, which stresses structured reasoning.
In stage-2, the context extends to 128k tokens and express device calls are launched. The mannequin learns duties reminiscent of URL primarily based question-answering, deep net search, lengthy doc summarization and lengthy dialogue reasoning. This stage aligns the mannequin with actual analysis eventualities the place search, searching and evaluation have to be combined in a single trajectory.
Throughout supervised fine-tuning, the 4 atomic capabilities are composed into full deep search and deep analysis traces. Information cleansing retains trajectories which might be appropriate and quick by way of steps and gear calls. The pipeline injects managed device errors adopted by correction to enhance robustness, and enforces quotation codecs in order that studies keep grounded within the retrieved sources.
Reinforcement studying then optimizes the agent in an actual device atmosphere. The analysis crew builds duties and checklists by reverse synthesis, and trains a guidelines type Rubrics Choose to attain studies alongside superb grained dimensions. The reward design converts ternary rubric labels into uneven binary rewards that seize each constructive targets and violations. The coverage is educated with PPO and a realized critic, utilizing generalized benefit estimation with close to zero low cost in order that lengthy trajectories are usually not truncated.
Single Agent ReAct Structure and Search Stack
At inference time, Step-DeepResearch runs as a single ReAct type agent that alternates pondering, device calls and observations till it decides to output a report. The device set consists of batch net search, a todo supervisor, shell instructions and file operations. Execution runs in a sandbox with terminal persistence by tmux. A notion oriented browser reduces redundant web page captures by utilizing perceptual hash distance. Instruments for doc parsing, audio transcription and picture evaluation help multimodal inputs.
Info acquisition makes use of 2 associated assets. StepFun crew states that its Search API is grounded in additional than 20M top quality papers and 600 premium indices. The analysis crew then describes a curated authority indexing technique that isolates greater than 600 trusted domains, together with authorities, educational and institutional websites. Retrieval operates at paragraph degree and makes use of authority conscious rating so that top belief domains are most well-liked when relevance is comparable.
The file instruments help patch primarily based modifying, so the agent can replace solely modified sections of a report. A abstract conscious storage scheme writes full device outputs to native information and injects solely compact summaries into the context. This acts as exterior reminiscence and avoids context overflow for lengthy tasks.
Analysis, Value and Entry
To measure deep analysis habits, the crew introduce ADR-Bench, a Chinese language benchmark with 110 open ended duties throughout 9 domains. 70 duties cowl basic domains reminiscent of training, science and engineering and social life, evaluated by skilled aspect by aspect comparability. 40 duties in finance and legislation are scored with express rubrics that comply with atomicity and verifiability constraints.
On Scale AI Analysis Rubrics, Step-DeepResearch reaches 61.42 % rubric compliance, which is akin to OpenAI-DeepResearch and Gemini-DeepResearch, and clearly forward of a number of open and proprietary baselines. On ADR-Bench, expert-based Elo rankings present that the 32B mannequin outperforms bigger open-models reminiscent of MiniMax-M2, GLM-4.6 and DeepSeek-V3.2, and is aggressive with techniques like Kimi-Researcher and MiniMax-Agent-Professional.
Key Takeaways
- Single agent, atomic functionality design: Step-DeepResearch is a 32B parameter single agent constructed on Qwen2.-32B-Base, it internalizes 4 atomic capabilities, planning, deep info looking for, reflection and verification, {and professional} report technology, as an alternative of counting on many exterior brokers.
- Focused knowledge synthesis for every talent: The analysis crew builds separate knowledge pipelines for planning, deep info looking for, reflection and report writing, utilizing reverse-engineered plans from actual studies, graph-based queries over Wikidata5m and CN-DBpedia, multi-agent trainer traces and strict report formatting knowledge.
- Three stage coaching with lengthy context and RL: Coaching makes use of mid coaching, supervised fine-tuning and reinforcement studying, with mid coaching as much as 150B tokens at 32k after which 128k context, SFT composes full deep analysis trajectories, and PPO primarily based RL with a Rubrics Choose optimizes studies towards superb grained checklists.
- ReAct structure with curated search and exterior reminiscence: At inference time the mannequin runs a ReAct loop that calls instruments for batch net search, todo, shell and file operations, makes use of a Search API grounded in additional than 20M papers and 600 premium indices together with 600+trusted domains, and depends on patch modifying and abstract conscious storage to behave as exterior reminiscence.
- Aggressive high quality with decrease value: On Scale AI Analysis Rubrics the mannequin reaches 61.42 % rubric compliance and is aggressive with OpenAI-DeepResearch and Gemini-DeepResearch, on ADR Bench it achieves 67.1 % win or tie price towards robust baselines.
Try the Paper and Repo. Additionally, be at liberty to comply with us on Twitter and don’t neglect to affix our 100k+ ML SubReddit and Subscribe to our E-newsletter. Wait! are you on telegram? now you possibly can be a part of us on telegram as properly.



