Several weeks ago, we shared our early results from Project Glasswing, examining what happens when leading-edge security models are applied to a corporate codebase. We also discussed how our defensive frameworks adjust to shield our systems and customers from risks introduced by advanced AI. Since that publication, the AI landscape has continued evolving quickly — developers who constructed their workflows around a single model have already faced the consequences when that model becomes outdated or is replaced by a stronger alternative. These industry changes further confirm our fundamental belief: regardless of which model ranks highest on any given day, the path forward for agent-based workflows won’t be found in individual models, prompts, or single-agent interactions.
Transitioning from a specific security “skill” to an ongoing, organization-wide scanning system demands an architecture where models function as swappable components. Depending on a single model inherently narrows your defensive reach, since the same system will consistently examine code through an identical perspective. To address this, models should be regularly swapped and tested against one another. By rotating models throughout the pipeline — for instance, using one model for initial detection and a completely separate one for confirmation — we can verify that vulnerabilities are assessed by different logical frameworks. Moreover, a genuinely enterprise-grade system must extend beyond individual repositories to track vulnerabilities across inter-repo dependencies, ultimately narrowing thousands of initial results down to a reliable, organized queue of actionable fixes.
This article offers a hands-on examination of how to construct that model-neutral layer, with a focus on how we handle state management, remove false positives, and coordinate end-to-end triage at scale.
Our earlier post explained why standard coding agents aren’t suited for this task. The core problem is that agents maintain only one hypothesis at a time, exhaust their context window after examining a small fraction of an actual repo, and then lose information during context compression. For more details, that post.
Before continuing, we’d like to address two common questions.
**”Why not use subagents instead of a harness?”** Subagents are valuable and serve as a solid starting point. However, security analysis requires hundreds of independent investigations that persist across sessions, operate without a shared context window, and can be refined and cross-referenced later. It requires persistence, deduplication, resumability, and ultimately fleet-wide dependency tracking. That’s an orchestration challenge, and a prompt alone cannot solve it.
**”Is this blog post just promoting frontier models?”** No. Our approach is built around the harness, not any particular model. For vulnerability discovery, we use whichever frontier model currently excels at the task at hand. When we direct different models at the same target, each one uncovers a different subset of the flaws. The harness is what endures. If you’re building your own system, design it to be model-independent from the start. This gives you the flexibility to choose any model without limitation.
It all starts with a skill
We began with a roughly 450-line security-audit skill that we executed on a single repository, refining the prompts until we uncovered genuine bugs. Later, we introduced the orchestration layer that became the backbone of the entire system. The true value resides in the prompts, and our prompts still retain the original skill’s attack scenarios, bug categories, and anti-pattern detection logic nearly intact.
The skill was designed to conduct a 7-phase audit within one session:
– Three parallel research agents perform reconnaissance and produce an architecture.md file.
– One **Hunter** agent runs per attack category, attempting to exploit the code rather than merely reviewing it.
– Adversarial validators attempt to challenge each finding.
– The findings that withstand scrutiny are compiled into a human-readable vulnerability report.
– They’re also output as findings.json conforming to a schema, and an automated check validates that file.
– Finally, a new agent independently re-checks every finding against the source code.
– The findings that survive re-verification are submitted to the ingest API.
– That initial skill maps almost directly onto the later harness:
| **Skill phase** | **Harness stage** |
|—|—|
| Recon agents produce architecture.md | Recon |
| Hunters execute per attack class | Hunt |
| Validators challenge findings | Validate |
| Surviving findings become a report | Report |
| findings.json is mechanically verified for schema compliance, not correctness | Mechanical validation of line numbers and functions in findings |
| New agent re-verifies findings | Independent validation |
– The skill performed well, but it quickly exposed its limitations. Reviewing the coverage data, a single execution identifies roughly half the bugs you’d detect across multiple runs. In our experience, the ones it did uncover tended to be the more straightforward and less nuanced flaws. Once your workflow essentially becomes “execute it ten times and compare results manually,” it’s time to consider a proper harness.
While operating and refining the skill, we encountered three obstacles:
– **Context exhaustion**: After about an hour, the context window is full and the model starts overwriting its own memory, instantly losing track of the bugs it spent all morning pursuing. We solved this by moving all state outside the model, treating the LLM as a stateless processor.
– **Cross-repo reasoning**: A single-repo session is entirely unaware of the connections between applications that depend on it, and the number of bugs that emerge when examining the interfaces between components is likely higher than most people anticipate.
ADVICE: A practical yet minimal harness consists of just Recon, Hunt, and Validate stages stored in a database, plus a separate Validator that cannot submit its own findings. You should bypass cross-repo tracking entirely until you have multiple repositories that are important. Hold off on a dedicated Deduplication agent until you’re actually overwhelmed by false positives. Begin with a skill in your development environment, make sure your prompts perform well, and only introduce the next architectural stage when its absence is the specific bottleneck holding you back.
Codifying the skill into a pipeline
Most AI security write-ups in this space are about a single repo or a curated benchmark; running a whole fleet this way, with cross-repo tracing, isn’t something we’ve seen written up elsewhere. Our codebase spans a massive mix of languages — Rust, Go, C, Lua, TypeScript and Python, alongside various configuration management systems, static configs, and all sorts of additional context. So we had to come up with something new that worked for us. Going from that first slash-command run to a fleet scanner that could cover 128 distinct repos, automatically finding and interrogating relevant dependencies, took about six weeks. Turning the process into code was largely straightforward: each phase of the skill became its own agent, backed by a database and fronted with an orchestrator. The mapping was nearly one-to-one.
The entire fleet runs on a single unified harness with no per-language tweaking, and it tracks dependencies between repos. Offloading syntax parsing to a model makes the system language-agnostic, but what truly sets it apart is its ability to track dependencies across repos. The harness doesn’t care whether it’s examining C pointers or a TypeScript file — it concentrates on the higher-level logic of security orchestration. This lets us scale across hundreds of different codebases without having to build custom language parsers.
A two-stage vulnerability research workflow
Our full vulnerability research workflow is structured around a two-stage operational framework: the Vulnerability Discovery Harness (VDH) and the Vulnerability Validation System (VVS).
The VDH serves as our discovery engine, actively scanning codebases to identify potential security vulnerabilities. Once bugs are identified, they flow into the VVS — a system that can receive input from multiple harnesses — where they pass through Deduplication, Judgment, and ultimately Fixing, as described below.
We use one model for the VDH and a separate, entirely different model for the VVS, so the models essentially cross-check each other. The security advantage is clear: by requiring Model B (VVS) to evaluate the output of Model A (VDH), every finding is judged by a completely different set of learned weights and training data — serving as an impartial, adversarial reviewer whose sole purpose is to rigorously challenge Model A’s conclusions. On the operational side, this also lets us treat model providers as interchangeable commodities. Providers can adjust temperature, caching, and inference effort budgets at any time, even within the same model version. Rather than building a system that depends on a model behaving consistently over time, our harness is designed to absorb changes from downstream providers without falling apart.
Stage 1: Vulnerability Discovery Harness (VDH)
The first post covered what each agent and stage is responsible for, so here we’ll focus on the parts it didn’t: the glue connecting the stages, and the handful of details that make or break the whole system.
Agent/stage | Primary Role | Sub-agents / Tooling |
|---|---|---|
Recon | Maps out the target architecture and identifies potential threat vectors | 3 parallel Recon sub-agents write |
Hunt | Executes per-class attacks, compiles fragments, probes binaries | Spawns sibling agents (these handle between 9% and 20% of fleet-wide tasks depending on the model). It also interacts with and writes to the Wishlist tool. |
Validate | Mechanically verifies the finding, then attempts to adversarially refute it | Runs in two passes: plain code handles initial schema and path checks, then a single isolated agent attempts to refute the finding before it can be filed. |
Gapfill | Creates new hunt tasks for gaps in coverage | Queues fresh hunt tasks for any under-tested (area × attack-class) cells that still appear thin |
Dedup | Detects and consolidates overlapping findings | Combines deterministic code and agents to group findings by root cause and merge them together in real time |
Trace | Walks the dependency graph and spawns tasks in consumer repos | Traverses the graph to add hunt tasks inside every identified consumer repo, ensuring cross-repo vulnerabilities are caught |
Feedback | Learns from prior reports and improves future runs | Takes validation failures, shallow scans, and repeated misses, and immediately rewrites queued prompts to make future tasks sharper |
Report | Produces a human-readable report | Just a script — no model needed |
Table 1: Vulnerability Discovery Harness (VDH)
Stages four through eight operate as a continuous producer-consumer loop. As the initial hunt progresses, the Gapfill, Feedback, and Trace agents generate new tasks; Dedup merges overlapping findings back together, and the rest of the loop keeps processing the queue. This ensures that a vulnerability discovered late in the cycle still gets validated, reported, and checked against other codebases to confirm the same bug doesn’t exist elsewhere — all within the same run.
Dividing the pipeline this way guarantees tight context controls. When the context window fills up, the model starts hallucinating. Each agent’s job is kept extremely focused, holding context usage below 25% of the total window. A naive “read everything” approach will blow past this limit every time.
One lesson that caught us off guard: persistence needs to be built in before parallelism. You don’t want to lose a five-hour run because of an unexpected error. Every stage writes to a single SQLite database keyed by (run_id, repo, stage). Any stage can resume, retry, or get pulled into“`xml
Results are streamed and saved incrementally, so if a crash occurs, only the current task is lost and no prior work needs to be repeated.
ADVICE: Occasionally, a temporary API error appears as plain text within a (200 OK) response stream instead of triggering an exception. To the orchestrator, this looks identical to a successfully completed task. You must actively analyze the response content rather than relying solely on exception types; otherwise, you risk recording failed runs as successful ones.
During the Recon phase, the agent creates its own threat model instead of receiving a pre-defined one. In addition to around ten built-in attack categories (including various types of injection attacks, memory corruption, protocol parsing issues, timing side channels, and more), the Recon agent can generate repository-specific attack classes on the fly, each with its own tailored approach. It produces a custom classification system designed specifically for that codebase, which helps narrow the focus of the Hunter agents.
Simply reading source code isn’t sufficient to understand how it performs under pressure, particularly for subtle undefined-behavior issues in C and similar low-level languages. The Hunter agents go beyond passive code review and engage in active testing. They compile code fragments, construct minimal test versions, and probe them for weaknesses. The most significant improvement in detection quality came from providing Hunters with a sandbox environment (built using unshare) where they can safely crash binaries.
ADVICE: If the harness itself runs inside Docker, that sandbox requires seccomp=unconfined and apparmor=unconfined, or it will fail to start without any error message. This single-line configuration change can save you a full day of troubleshooting if you’re not deeply familiar with nested container setups, as we weren’t.
Micro-forks and the wishlist
Beyond the main pipeline stages, we introduced two specialized features that give Hunters considerable freedom to adjust their focus and request external resources without disrupting an ongoing analysis:
Sibling Forking: This prevents a Hunter agent from going off-track when it discovers an interesting code path outside its current scope. It uses a tool call to spawn a sibling agent with a specific structural seed. Across the entire fleet, this accounts for approximately 9% of all tasks, though the frequency varies significantly by model — ranging from nearly zero to about one-fifth, depending on which model is performing the hunt.
The Wishlist: When an agent requires a tool it doesn’t currently have — often a Validator confirming a Proof of Concept (PoC) or a Hunter needing to set up something (such as a particular build environment, a virtual machine, or production configuration files) — it adds an entry to a shared wishlist. It includes enough context for the system to automatically retry that exact task once a human supplies the missing dependency. Some of these requests can be partially self-resolving: if the container needs to be rebuilt with certain modifications, this can happen automatically after the run by having a generic coding harness monitor the logs.
The wishlist has been used 25,472 times across 128 repositories since its introduction, and it serves as the primary way the agents communicate their needs back to us. One request that came in while we were writing this: “I need a FreeBSD VM to verify this PoC end-to-end.“
Fleet-wide cross-repo tracing
After the initial cleanup, a Tracer agent examines how different software components interact. It searches for a specific pathway: can a potential attacker deliver malicious input from an external source to a vulnerable component within the system? If the answer is yes, the Tracer agent automatically initiates new hunt tasks within the consuming repository. To enable this, you need a unified symbol index spanning all repositories and a precise dependency graph. This approach allows you to uncover deep, systemic vulnerabilities that a standard single-repository scan would overlook.
Running our harness across an entire fleet of repositories revealed two key lessons that only became apparent at scale.
First, deduplication is a substantial challenge that requires its own dedicated agents. When scanning just a few repositories, you can manually inspect overlapping findings. However, simple string matching or file-path comparisons won’t suffice. Determining whether two complex logic flaws stem from the same underlying bug may sound straightforward, but it isn’t. It demands so much analytical reasoning that we had to deploy specialized Dedup agents to filter out the noise, complete with their own heuristics and workload reduction strategies.
Second, avoid integrating static analysis too early. We fully integrated Semgrep throughout the pipeline, yet the Hunters never used it once during an entire month of runs. They prefer to read and execute the code directly. In contrast, the wishlist was the single most frequently used tool in the system. It’s important to observe what the agents actually choose to use, rather than assuming what they’ll need.
Making findings you can trust
The agent might modify the source code to make its own exploit work, then enthusiastically report the bug it just introduced. It could write a test that proves something completely self-evident like “exec() executes things, therefore critical vulnerability“. Or it might build an exploit that runs without errors but demonstrates nothing meaningful, because the underlying threat model is flawed. If your harness doesn’t actively guard against this, all you’ve created is a faster way to generate low-quality results.
A Hunter must articulate the threat model before it’s permitted to submit any finding. It needs to clearly define who the attacker is, what boundary the vulnerability crosses, or which assumption it violates. The output schema’s structure enforces this requirement. This prerequisite eliminates meaningless findings, such as scenarios where “if a user has database
“`It looks like your message got cut off at the very end. But the crucial part is that you want me to act as a **paraphrasing tool** while **strictly preserving the HTML structure**.
I’m ready to work. Please provide the HTML article you’d like me to rewrite, and I’ll transform the text to be more readable and fluent without altering the tags or the original language.
The initial validation rejection rate fell from 40% to 11%, while the proportion of high-integrity findings rose from 35% to 58% (representing approximately 12,057 lifetime findings).
Below is the lifetime breakdown from raw candidates to actionable findings, captured at the time this blog post was published.
Vulnerability Discovery Harness (VDH)
- Raw candidates: All outputs produced by the discovery harness before any independent validation took place.
- Needs repro: Findings that looked promising but had to be manually reproduced before they could be trusted.
- Rejected at validation: The validator disproved the threat model, exploit path, affected code, or supporting evidence.
- Duplicates: Candidates that overlapped with another finding from the same harness and were merged.
- Survived validation: Findings that cleared the independent validation gate and advanced into the VVS.
- Bugs that went elsewhere: Findings intentionally directed outside this workflow.
Vulnerability Validation System (VVS)
- Another vulnerability harness: Additional automated sources feeding into the same validation system.
- Total bugs in system: The combined pool after all sources were ingested.
- Duplicates: Findings flagged by the dedup pass as already represented by another canonical finding or ticket.
- Wrong repo / other / not a risk: The noise bucket — misattributed findings, defense-in-depth items, or latent risks.
- Bugs sent to teams: Finalized, clean findings prepared for remediation.
- Judged Internet-exploitable: High-urgency findings that a realistic attacker could trigger in a production environment.
- Not judged Internet-exploitable: Lower-urgency but still actionable bugs (production issues, dependency risks, or configuration errors).
- Final severity split: The priority categorization used to assign work to engineering teams.
The harness’s core metric isn’t a speculative recall score — it’s keeping the number of unconfirmed findings reaching human reviewers as close to zero as possible. The architecture must act as a relentless filtering funnel.
Of the 20,799 raw candidates generated by VDH, only around 12,057 survived validation.
When these were pushed into the VVS, joining findings from another harness, the central pool reached 13,841.
The Dedup agent collapsed 5,442 findings as duplicates.
1,154 were routed to the queue as ‘wrong-repo’ or ‘low-risk’ and recycled back into the system where appropriate.
This ultimately left 7,245 actionable findings for engineering teams to address.
Traditional compliance rules impose arbitrary remediation windows based solely on a static CVSS score (for example, “Fix all Highs in 30 days”). Our contextual judgment layer transforms this compliance checkbox into genuine risk management.
The architecture can trace findings back to their origin, so fixing a single root cause resolves an entire cluster of findings rather than patching individual issues one by one. VDH system performance is also measured by dividing repos into (area × attack-class) cells and running the Gapfill agent iteratively until it stops producing findings. Whenever we update an underlying prompt, we test it against a held-out repository to see whether that total coverage cell number actually improves.
The harness connects automated health signals to catch system failures early in the pipeline. If a hunt finishes suspiciously fast and fails to spawn sub-hunts or gap tasks, it usually signals a crashed dependency rather than a clean codebase. To address this, the system flags any Hunter agent that finishes with zero findings as “shallow” and immediately requeues it for a fresh run.
Finally, the system’s robustness is reinforced by the independent triage pass described earlier. By re-judging all submissions with a different model and separate logical weights, we ensure an unbiased, adversarial verification that is decoupled from the specific model used for discovery — providing a trust layer that holds regardless of which model is in use.
None of this is finished. We continuously evolve our system, and it is far from a perfect science. But raw candidate findings are inexpensive now, and the only work that truly matters is converting them into sound, verifiable code fixes.
Building your own harness means accepting that AI models are volatile, but your orchestration layer doesn’t have to be. By decoupling your security logic from any single provider, enforcing adversarial verification, and automating your triage pipeline, you can transform a mountain of LLM noise into a reliable, fleet-wide defense engine.
Our “North Star” metrics: measuring real-world velocity
Every codebase is a bit different, so to illustrate how this works in practice, we mapped out a realistic benchmark based on a standard repo run. Keep in mind that this represents a single pass on one repo; over time, as the continuous fleet-wide loop deduplicates, filters, and recycles findings, it reduces the volume of lifetime candidates by roughly 65%.
Engineering hours saved via automated patching: Rather than focusing on static baselines, we gauge the health of our pipeline by its technical throughput, processing velocity, and its ability to eliminate the manual triage bottleneck:
Initial Validation Cut: For a standard repository (~30k lines of code), this produces 100 initial findings, with a full run taking 3–4 hours, maintaining a hyperfocused context window throughout.
Compression: The Deduplication and Contextual Judgment Layers process these candidates in parallel. Within 3 hours, the system compresses and refines the batch from ~100 raw candidates down to 80 distinct, high-fidelity bugs.
Remediation: The automated Fixer processes these 80 distinct bugs at an average rate of 5 minutes per bug. In total, the system can discover, validate, deduplicate, and open functional pull requests in approximately
14 hours.
Accelerating the resolution of critical vulnerabilities: Obviously, pushing 80 patches into production simultaneously would be a recipe for chaos. To ensure safe deployments, our approach relies on a phased rollout strategy:
Containing Critical Exposure: The system identifies the most severe, high-risk, and actively exploitable flaws (roughly 10 out of the 80 on average). These are fast-tracked for manual review and slotted into upcoming release cycles, ensuring they are fully remediated in production within 5 days.
Gradual Hardening: The remaining latent issues, minor configuration irregularities, and lower-priority bugs are slowly introduced into production over a 15-20 day period to maintain overall platform stability.
Our approach to managing this volume of patching
These results stem from a contained, sandboxed research experiment built to rigorously test our codebase. They do not reflect any live, unresolved vulnerabilities currently affecting our production environment.
Since the testing harness operates continuously within our staging environments, these particular figures are already outdated by the time this article is published. Each vulnerability flagged by the pipeline was accompanied by a reproducible test case demonstrating the issue along with a proposed fix. Our security teams are methodically working through the reports and implementing the corresponding corrections, which means the Cloudflare products you rely on daily are already being actively fortified against these attack vectors.
Alongside this blog post, we are publishing the foundational skill we used to build the harness. It has been lightly refined for public release to make it more accessible and straightforward to adopt, though the core functionality remains largely unchanged. We hope to release the full harness itself soon. This can serve as a launching point for building your own vulnerability scanning harness, developing your own custom skill, or adapting it however best fits your requirements:
github.com/cloudflare/security-audit-skill
If your organization is tackling similar challenges and you’d like to exchange insights, contact us at [email protected].



