1.
Throughout much of 2024 and 2025, the go-to approach was straightforward: assign the job to a single AI agent, make use of the largest context window you could find, and hope for the best. It occasionally succeeded. Just as often, the model silently lost focus midway through the task.
Anthropic addressed this challenge head-on: tasks that span many steps demand sustained coherence, which frequently exceeds what any single context window can handle dependably. Expanding the window improved things, but didn’t fully fix the underlying issue.
Anthropic had already delivered several tools to manage this. Subagents allowed the primary agent to hand off smaller tasks to separate helper agents, each operating with a clean, independent context and feeding condensed summaries back into the main thread. Skills let users bundle recurring procedures into reusable Markdown guides — essentially playbooks Claude could pull up whenever needed. Agent teams pushed the idea further: multiple distinct Claude instances, each maintaining its own context window, working together via a shared to-do list and direct messaging.
All of these tools represented genuine advances. Yet they all shared the same core limitation.
When using subagents, the lead Claude still orchestrates the full plan. Every report or summary returned from a helper flows back into the main session’s context window. With subagents, skills, and agent teams alike, Claude acts as the central coordinator: it determines step-by-step what task to launch or delegate next, and everything piles up in its context. This causes the orchestrator’s context to balloon as more worker agents come online, until it eventually maxes out. At that point, performance deteriorates the same way it always does — the very same breakdown patterns resurface.
Anthropic pinpointed three recurring failure patterns that emerge whenever a single context window — whether tied to one lone agent or a coordinator managing a small crew — shoulders a task too complex to manage cleanly. These are the three predictable ways things fall apart (Figure 1).
- The first is Agentic Laziness — the agent begins a task but fails to see it through completely. It might quit prematurely, overlook certain files, or presume the outstanding pieces are comparable enough not to matter. Then, with misplaced confidence, it declares the entire task finished. Think of someone skimming just a portion of a massive spreadsheet but officially marking the whole document as verified.
- The second is Self-preferential Bias. AI tends to grade its own work with a generous hand. If you prompt it with, “Did you follow the instructions properly?” it frequently responds positively, naturally inclined to give itself the benefit of the doubt. It may overlook errors in its own output or overstate how well it actually performed.
- The third is Goal Drift. As a task stretches longer, the AI steadily lets the original objective slip out of focus. It might still remember the general aim but lose track of specifics like “do not include X,” “don’t skip any file,” or “only use this format.” The more extended the conversation or task grows, the more severe the drift becomes.
These aren’t glitches. They are the natural consequences of treating a plan as a fleeting thought — and thoughts fade.
The price of ignoring this became starkly evident in early 2026, when Jarred Sumner, the creator of Bun, set out to migrate roughly 750,000 lines of Zig code to Rust — one file at a time. Previously, an undertaking of this scale would have consumed a team of developers for months. Sumner’s strategy, however, was elegant in its simplicity: complete one unit of work, run an adversarial review on it, then apply the changes. He would later call Dynamic Workflows “the state of the art today for reliably using agents to complete medium-to-large projects.” The outcome: 750,000 lines of Rust, 99.8% of the original test suite still passing, and just 11 days elapsed from the initial commit to final merge.
The core insight is that Claude never has to carry the entire plan in memory. The workflow externalizes the plan as executable code. The script owns the looping logic, the decision branches, and the intermediate outputs. Claude is responsible only for the current step and the concluding synthesis. The plan exists as a JavaScript file — and a file doesn’t forget, drift, or prematurely declare victory.
This is the exact gap Dynamic Workflows were engineered to fill. And this article will walk you through it.
By the time you finish reading, you’ll understand precisely where subagents, skills, and agent teams hit their boundaries — and why — not as a vague gut feeling, but as a clear structural argument you can apply to your own work. You’ll know the six composition patterns that handle the vast majority of practical workflow challenges, how to write a workflow prompt that yields a genuinely effective harness, and how to sidestep the two costliest mistakes people make when getting started. You’ll also recognize when a workflow is the wrong approach — because Dynamic Workflows burn through significantly more tokens than an ordinary session, and pulling them out for the wrong job is its own form of failure.
2. What a Dynamic Workflow Is
A Dynamic Workflow is like swapping out one overwhelmed individual for a small, specialized team.
Rather than loading one AI with the entire project from beginning to end, you break the work into distinct, manageable chunks. One agent tackles one specific job. A second reviews the output. A third advances the process further. In this setup, nobody grows fatigued midway and starts taking shortcuts. Nobody awards themselves top marks simply because they produced the answer. And nobody loses sight of the original instructions, because each agent only needs to focus on its own clearly defined piece of the puzzle.
Claude’s Dynamic Workflow enables exactly this. It distributes the job across a team of Claude instances, each starting with a clean slate of context. Each one handles a discrete segment, a separate layer scrutinizes the results, and everything is consolidated back into a single delivered output for you.
The key concept here is harness. A harness is the framework wrapped around the model — the layer responsible for deciding how a task is planned, partitioned, verified, and carried out. The default Claude Code harness is designed primarily for software development tasks. Anthropic’s team discovered that these dynamic harnesses can be “sometimes even more powerful for non-technical work.” From there, Anthropic constructed a harness tailored to whatever unique task you hand it.
Before diving deeper, it’s worth clarifying a handful of terms that tend to blur together. Tools, agents, harnesses, and workflows are often tossed around as synonyms. They’re not. The cleanest way to distinguish them — borrowing this framing from AlphaSignal — poses a single question: who holds the plan? (Figure 2)

- A subagent is a helper dispatched by the main Claude to handle a single, well-defined task. The overall plan remains with the main Claude. The subagent completes its assignment, returns the result, and that result shows up in your chat. It’s mostly a fire-and-forget arrangement. As the table below illustrates, a subagent cannot spawn its own helpers or communicate with other subagents.
- An agent team operates differently. It consists of multiple Claudes working alongside each other as coordinated peers. The plan isn’t housed within a single Claude — it exists between them. They can message each other, adapt as the work progresses, and persist across one larger shared objective. It’s more like handing a project over to a small team.
- A dynamic workflow is yet another approach. Claude writes a small JavaScript program tailored to the specific task. Here, the plan lives in code. The agents carry out their work in the background, their outputs get stored in variables, and only the final combined answer is returned to you.
An agent team and a dynamic workflow may appear similar at first glance. However, they are fundamentally different. Refer to the table below to see how.
| Subagent | Agent team | Dynamic workflow | |
| Who holds the plan | the main Claude (orchestrator), internally | the peers, shared among them | a JavaScript program |
| Lifecycle | fire-and-forget, single task | long-running, ongoing | runs once, returns one answer |
| Talk to each other? | no — the orchestrator routes everything, and a subagent can’t even spawn its own subagents | yes — they coordinate as peers over time | no — agents work in the background through script variables; only the final result is returned |
| Feels like | an intern handed one task | colleagues collaborating on a shared project | an assembly line you’ve designed |
And you might wonder about another question. What makes it “Dynamic”? What’s the difference between dynamic and static?
You could always build a harness on your own. You could connect the Agent SDK, or loop claude -p repeatedly, and construct a fixed system that you reuse each time. That’s a static harness: practical, repeatable, but designed ahead of time.
A dynamic harness is the opposite. Claude writes the harness on the spot, shaped around the specific task you just gave it. It plans the structure, divides the work, runs the agents, evaluates the outputs, and then discards the harness once the job is complete — unless you press s to save it.
Static harnesses are general-purpose; dynamic ones are custom-built and temporary.
Claude can now construct dynamic workflows because Opus 4.8 has become capable enough to build the right harness in real time — as the Anthropic team put it, “intelligent enough to write a custom harness tailor-made for your use case.“
3. The real test
3.1 Patterns that make dynamic workflows useful
There are 6 workflows that Anthropic introduces, and I ran some tests with them to show you intuitively how they work. They are:
- Fan-out-and-synthesize — divide the work, then merge the results. Each piece gets its own agent with a clean context; a final synthesizer waits for all of them before combining outputs.
- Adversarial verification — for every finding, spawn a separate agent whose sole job is to disprove it. A skeptic scrutinizing the optimist.
- Classify-and-act — use a classifier agent to categorize each item first, then route it to the appropriate handler. Think of it as a front desk.
- Generate-and-filter — brainstorm broadly, then filter by a rubric: deduplicate, verify, and keep only what passes scrutiny.
- Tournament — spawn N agents that each tackle the same task differently, then have a judge agent compare them in pairs until one emerges as the winner. Great for matters of taste and naming.
- Loop-until-done — for tasks of uncertain scope, keep spawning agents until a stop condition is met (no new findings, no remaining errors) rather than a predetermined number of passes.
Fan-out-and-synthesize is probably one of the most commonly seen patterns. One task splits into several agents, each with its own clean context so they can’t interfere with each other, and then a synthesize step — a step that waits for everyone — merges their work into a single result (Figure 3).

And Adversarial verification is another frequently used pattern (Figure 4).

3.2 Dynamic Workflow on a non-technical problem
The quickest way to grasp dynamic workflows is to apply one to a problem that has nothing to do with code.
So I gave Claude a straightforward business plan for a restaurant subscription model and asked it to attack the idea from three hostile angles simultaneously: a risk-averse investor, a demanding customer, and an incumbent competitor. Each agent worked independently. Then a final synthesizer gathered the results and returned the three strongest objections, along with ways I could address them.
Here’s that run (Figure 5), sped up:

This illustrates the fan-out-and-synthesize approach: multiple agents explore the same challenge from distinct angles, and one agent merges their findings. The entire process wrapped up in approximately thirteen seconds.
The real value wasn’t the speed—it was the independence. Since each agent operated in its own context window, they didn’t subtly sway one another or dilute each other’s judgments. Each returned with a unique perspective.
Here’s what they found:
- The investor challenged the numbers: The margins are too slim to withstand customer attrition. At $29 per month with around 40% margin, the business earns roughly $11.60 in gross profit per subscriber monthly. With a $35 acquisition cost per customer, the company needs users to stick around long enough for lifetime value to comfortably exceed acquisition cost. But food subscription services typically struggle with churn, and even one bad retention month can sink the model. Recommendation: shore up the unit economics before growing: boost revenue per user via annual subscriptions or add-ons, demonstrate strong cohort retention, and calculate LTV-to-CAC ratios clearly.
- The customer questioned the appeal: The proposal placed too much emphasis on features like rotating menus and carbon-neutral shipping. While these may look impressive in a pitch deck, they might not be what matters most to people deciding on dinner. Most customers prioritize speed, flexibility, and reducing daily decision fatigue. Recommendation: ground the value proposition in real benefits: highlight time saved, ease of use, and how the service simplifies weeknight cooking.
- The competitor challenged the defensibility: A rotating menu and eco-friendly delivery are easy to replicate. Neither creates meaningful switching costs. A bigger player could copy the visible features, offer lower pricing, or fold the service into an existing delivery platform. Recommendation: develop a more durable competitive edge: city-level logistics density, personalized recommendations, loyalty incentives, or user habits that make switching inconvenient.
That’s what made this workflow powerful. It didn’t just deliver generic “feedback on the business plan.” It surfaced three distinct objections from three critical angles: financial viability, customer appeal, and competitive durability. A single conversation would likely have blended these into one polite, somewhat surface-level review. The workflow made the tensions sharper. And the best part: not a single line of code was needed.
3.3 Enable dynamic workflows
The configuration is straightforward. You set the model to Opus 4.8 (more on this shortly), and you can launch the workflow in three ways. The most dependable method is simply to include the word workflow in your prompt. Alternatively, you can set effort to ultracode, which activates enhanced reasoning and allows Claude to autonomously decide whether to construct a workflow. Just be cautious with ultracode—it burns through more tokens, so reserve it for situations where you want automatic orchestration.
The third option applies when you’ve previously built a useful workflow and want to reuse it via /. Workflows can be saved in two places: .claude/workflows/ (project-level; available to anyone who clones the repo) and ~/.claude/workflows/ (personal; accessible across all your projects, but only by you).
The reason Opus 4.8 is important is that the orchestrator carries the heaviest burden. It’s not simply answering a question—it’s figuring out how to divide the task, drafting the workflow script, delegating work to sub-agents, selecting tools, monitoring outputs, and merging everything into a final result. The strategy, then, is: reserve the most capable model for orchestration, and assign smaller or more cost-effective models to the worker agents when the subtasks are more focused.
3.4 Let’s test them out
3.4.1 Default approach
The goal: I provide a multi-file repository and ask Claude to execute workflows that audit this repo using both Fan-out-and-synthesize and Adversarial verification.
Prompt: audit the repo with a workflow: fan out finders and verify each finding, synthesize a severity-ranked report. use 200k token

As shown in Figure 5, Claude designs a workflow with 3 stages: Find → Verify → Synthesis, and deploys 6 finders across 6 categories: security, correctness, data integrity, accessibility, code quality, and repo hygiene. Since I didn’t specify which areas to examine, it proposed these six on its own.
It began executing the workflow. To monitor progress, use the command /workflows.

Within /workflows (Figure 7), 6 agents are active, and the problem is that they’re all running on Opus 4.8, each consuming around ~50k tokens. My budget won’t last long.
After 2 minutes, all finders completed their work and identified 50 candidate issues (Figure 8). This means 50 verification agents will now be spun up—one per issue—to determine whether each finding is legitimate or a false positive. And every single one runs on Opus 4.8.
That’s generally overkill. The orchestrator deserves the strongest model because it must architect the workflow, partition the task, coordinate the agents, and synthesize the output. But many verification tasks are more contained: examine this specific issue, review the evidence, and judge whether it holds up. For that kind of targeted work, a less expensive model is usually sufficient.
So in the next experiment, I reassigned the worker agents to Sonnet. The aim wasn’t to weaken the workflow—it was to keep Opus where it adds the most value—orchestration and synthesis—while delegating the repetitive verification tasks to a more economical model.

3.4.2 Cheaper model for agents
A fresh attempt using Sonnet as the worker agents and Opus as the orchestrator and synthesizer.
Prompt: audit the repository using a workflow — fan out finder agents, verify each finding, then synthesize a severity-ranked report. Use 200k tokensn. Assign Sonnet to all agents and use Opus as the orchestrator and synthesizer
As shown in Figure 9, Claude deployed 7 finder agents running Sonnet 4.6, consuming 254k tokens and taking roughly 5 minutes 17 seconds to surface 71 candidate issues. Sonnet clearly requires more time to execute than Opus.

You can review the verification details for each individual issue in the workflows panel, as illustrated in Figure 10.

Verifying all 71 issues consumed approximately 1.5 million tokens. While the cost is considerably lower than using Opus, the execution time for the finder agents is noticeably longer.
Below is the output from the synthesizer (Opus 4.8), shown in Figure 11.

The key takeaway is that you must carefully read the generated report, review it, and make revisions before instructing Claude to start modifying the code.
The finder agents still flagged several issues, and those were subsequently confirmed as legitimate by the verification agents. However, those issues are inherent to the application’s design — they are supposed to work that way — and flagging them only creates additional review overhead for us. Therefore, I want to introduce some constraints into the workflow before executing it so that these particular issues are excluded from the scan.
3.4.3 Adjusting the workflow before execution
Prompt: audit the repository using a workflow — fan out finder agents, verify each finding, then synthesize a severity-ranked report. Use 200k tokensn. Assign Sonnet to all agents and use Opus as the orchestrator and synthesizer. Generate the workflow script and provide me with a link so I can review and adjust it before running.

Excellent — Claude provided the workflow script for me to examine and modify before instructing it to execute (simply by typing run the workflow) (Figure 12).
I used a smaller codebase and a simpler prompt to walk through the structure of the JavaScript workflow file in Figure 13.

For my test codebase, here is the scope I decided to narrow down:
{
key: 'correctness',
prompt: `Audit for CORRECTNESS / LOGIC bugs. Focus: the deterministic date-based daily pick, shuffle behavior, the "last 5 worn excluded" history logic (off-by-one, wraparound, per-wardrobe isolation), wardrobe-gender switching, 2-piece/3-piece filter, theme auto-switch by hour (6am-6pm boundaries), localStorage key handling. Trace edge cases (empty male wardrobe, all outfits recently worn). Read app.js and collection.js.`,
},
{
key: 'docs-accuracy',
prompt: `Audit DOCUMENTATION ACCURACY. Compare README.md and docs/*.md claims against actual code behavior. Focus: features described that don't match implementation, wrong localStorage keys, stale config, deployment steps that won't work, outdated counts ("all 40 outfits"). Read README.md, docs/codebase-summary.md, docs/deployment-guide.md, then verify against the code.`,
},I removed: shuffle behavior, theme auto-switch by hour (6am-6pm boundaries). Trace edge cases (empty male wardrobe, all outfits recently worn), along with the entire 'docs-accuracy' section. I also scanned other parts of the JavaScript file to confirm those items were fully removed.
You could also ask Claude to exclude those items, but since it’s straightforward, I preferred to handle it directly.
So, the finder agents now cover 6 areas instead of 7, and one of those areas has a tighter scope (Figure 14).

Six finder agents identified 44 unique candidate issues, of which 40 were confirmed. The entire process, involving 51 agents, took 9 minutes and 52 seconds and consumed approximately 1.66 million tokens.
3.4.4 Comparison with a single-agent approach
I ran the same codebase with a single agent in one pass — no team, no verification step. It found 47 issues — more than the workflow’s 44 — using a third of the tokens. However, since no verification was performed, those 47 included the same 2 false positives that the workflow’s verifier agents had caught and filtered out. I present the comparison in the chart below for easier reference (Figure 15).


- 15 agents, ~572k subagent tokens, ~3.5 min wall-clock
- 5 Haiku finders → Haiku adversarial verifiers → Opus synthesizer
- 9 initial findings → 3 validated, 6 debunked. Every “high” severity rating was stripped.
As for the Sonnet agents:
- 23 agents, roughly 1.3 million tokens consumed across both passes. ~2.51 minutes
- 5 Sonnet finders → Sonnet adversarial verifiers → Opus synthesizer
- 18 initial findings → 13 validated, 5 dismissed → consolidated into 8 unique issues. No critical or high-severity findings survived adversarial verification.
One notable observation: all 3 issues confirmed during the Haiku run were also identified in the Sonnet run. That’s a higher level of consistency compared to the earlier run. A likely explanation is that this time the prompt gave each agent a focused area to investigate, rather than asking them to examine the entire system from a broad perspective. That makes intuitive sense. The workflow deployed 5 agents, and each one concentrated on a single security dimension. With a tighter scope, each agent could dig deeper into its assigned problem class instead of dividing attention across too many potential issue categories. When an agent isn’t forced to triage across a wide attack surface, it naturally dedicates more of its reasoning capacity to the specific problem it was assigned — and that produces more thorough, repeatable results.
So even when you’re using Dynamic Workflows with isolated subagents, your prompt still needs to be as precise as possible. Tighter prompts reduce variance and steer agents toward the same conclusions — which is exactly what you need when consistency and reliability are the priority.
6. Save the workflow only if it’s worth keeping
A well-saved workflow should feel like a piece of project automation, not a record of one lucky execution. It should be clean enough that any teammate can open it and immediately grasp: who owns it, what inputs it requires, which tools it’s permitted to call, what each sub-agent is accountable for, and what standard of evidence is needed before the workflow can mark the task complete.
If the workflow performed well and you’d like to reuse it, hit s in the workflow menu to save it to ~/.claude/workflows. Alternatively, you can extract the script into a skill if your goal is to share the approach with your team and make it straightforward to apply across similar tasks.
But resist the urge to save a workflow just because the first run went well. A single successful execution only proves it worked once. Save it when the orchestration itself carries value: when the script is easier to review, reuse, and iterate on than writing a fresh Claude Code prompt from the ground up.
Below are several prompt templates for your reference. Fill in your own specifics when you’re ready to use one:
Stress-test a plan: “Take the plan below and run a workflow where independent agents try to break it — a skeptical investor, a demanding customer, an entrenched competitor — each working alone. Then distill the three toughest objections and craft the strongest counterargument for each.”
Audit a repo: “Run a workflow to audit this repository. Spin up agents to hunt for logic bugs, unsafe routes, weak authentication, missing authorization, exposed secrets, risky dependencies, and data leaks. For each finding, launch a separate agent to adversarially verify it — attempt to prove it’s not real. Produce a severity-ranked report with file paths and suggested fixes.
use 200k tokens.”
Make it cheap: “Design it so the finder agents run on
model: 'haiku'while the orchestrator stays on Opus 4.8 and handles the final synthesis. Report token usage and wall-clock time.”
Reproduce a flaky test: “This test fails roughly 1 out of 50 runs. Set up a workflow to reproduce it — generate competing theories and adversarially test them in worktrees.
/goaldon’t stop until one theory is confirmed.”
Verify a draft: “Go through this draft and use a workflow to verify every technical claim against the codebase and source materials. I don’t want to ship anything inaccurate.”
Rank by real priority (tournament): “I have a list of findings/options. Use a workflow to rank them by [real exploitability / impact / whatever matters] — but instead of scoring each one individually, run a pairwise tournament and rank by win count. Then show me the top three and the reasoning behind each.”
Root-cause a heisenbug: “This bug is intermittent and the obvious explanation doesn’t hold up. Use a workflow: divide the investigation by evidence — one agent on symptoms, one on the code, one on data/logs — then have separate agents attempt to disprove each theory, and synthesize the cause that withstands all challenges.”
Triage a backlog safely: “Use a workflow to triage this backlog: classify each item (fix-now / escalate / needs-a-decision), group duplicates into families, and route accordingly. Anything that reads untrusted input must be read-only — keep it isolated from anything that proposes changes.”
Route by task shape: “Use a workflow with a classifier that examines each task and routes it to the cheapest capable model — smaller models for mechanical work, Opus for ambiguous, security-critical reasoning — then executes each on its assigned model.”
Check house rules: “Use a workflow to check this code against our rules in CLAUDE.md — one verifier per rule, plus a skeptic that hunts for false positives. I care more about avoiding false alarms than catching every minor issue.”
Sources
- Thariq Shihipar & Sid Bidasaria (Anthropic), “A harness for every task: dynamic workflows in Claude Code” — the rationale, the patterns, the prompting guidance, save/share.
- Factory.ai, “The Context Window Problem: Scaling Agents Beyond Token Limits”.
- Engineering at Anthropic, “Effective context engineering for AI agents”.
- Chroma Technical Report, “Context Rot: How Increasing Input Tokens Impacts LLM Performance”
- Anthropic, “Building effective agents” — background on the underlying orchestration patterns.
- Anthropic, “Introducing dynamic workflows in Claude Code.”



