Code assessment is a implausible mechanism for catching bugs and sharing information, however it is usually one of the vital dependable methods to bottleneck an engineering group. A merge request sits in a queue, a reviewer ultimately context-switches to learn the diff, they go away a handful of nitpicks about variable naming, the creator responds, and the cycle repeats. Throughout our inner initiatives, the median wait time for a primary assessment was typically measured in hours.
Once we first began experimenting with AI code assessment, we took the trail that the majority different folks most likely take: we tried out a couple of totally different AI code assessment instruments and located that a variety of these instruments labored fairly nicely, and a variety of them even provided quantity of customisation and configurability! Sadly, although, the one recurring theme that saved developing was that they simply didn’t supply sufficient flexibility and customisation for an organisation the scale of Cloudflare.
So, we jumped to the subsequent most evident path, which was to seize a git diff, shove it right into a half-baked immediate, and ask a big language mannequin to seek out bugs. The outcomes have been precisely as noisy as you would possibly anticipate, with a flood of obscure strategies, hallucinated syntax errors, and useful recommendation to “consider adding error handling” on features that already had it. We realised fairly rapidly {that a} naive summarisation strategy wasn’t going to provide us the outcomes we needed, particularly on complicated codebases.
As a substitute of constructing a monolithic code assessment agent from scratch, we determined to construct a CI-native orchestration system round OpenCode, an open-source coding agent. At present, when an engineer at Cloudflare opens a merge request, it will get an preliminary cross from a coordinated smörgåsbord of AI brokers. Fairly than counting on one mannequin with a large, generic immediate, we launch as much as seven specialised reviewers protecting safety, efficiency, code high quality, documentation, launch administration, and compliance with our inner Engineering Codex. These specialists are managed by a coordinator agent that deduplicates their findings, judges the precise severity of the problems, and posts a single structured assessment remark.
We have been operating this method internally throughout tens of hundreds of merge requests. It approves clear code, flags actual bugs with spectacular accuracy, and actively blocks merges when it finds real, severe issues or safety vulnerabilities. That is simply one of many some ways we’re bettering our engineering resiliency as a part of Code Orange: Fail Small.
This publish is a deep dive into how we constructed it, the structure we landed on, and the precise engineering issues you run into once you attempt to put LLMs within the essential path of your CI/CD pipeline, and extra critically, in the way in which of engineers making an attempt to ship code.
The structure: plugins all the way in which to the moon
When you find yourself constructing inner tooling that has to run throughout hundreds of repositories, hardcoding your model management system or your AI supplier is a good way to make sure you’ll be rewriting the entire thing in six months. We wanted to help GitLab right this moment and who is aware of what tomorrow, alongside totally different AI suppliers and totally different inner requirements necessities, with none part needing to know in regards to the others.
We constructed the system on a composable plugin structure the place the entry level delegates all configuration to plugins that compose collectively to outline how a assessment runs. Here’s what the execution movement seems like when a merge request triggers a assessment:
Every plugin implements a ReviewPlugin interface with three lifecycle phases. Bootstrap hooks run concurrently and are non-fatal, that means if a template fetch fails, the assessment simply continues with out it. Configure hooks run sequentially and are deadly, as a result of if the VCS supplier cannot connect with GitLab, there is no such thing as a level in persevering with the job. Lastly, postConfigure runs after the configuration is assembled to deal with asynchronous work like fetching distant mannequin overrides.
The ConfigureContext offers plugins a managed floor to have an effect on the assessment. They will register brokers, add AI suppliers, set setting variables, inject immediate sections, and alter fine-grained agent permissions. No plugin has direct entry to the ultimate configuration object. They contribute by way of the context API, and the core assembler merges the whole lot into the opencode.json file that OpenCode consumes.
Due to this isolation, the GitLab plugin would not learn Cloudflare AI Gateway configurations, and the Cloudflare plugin would not know something about GitLab API tokens. All VCS-specific coupling is remoted in a single ci-config.ts file.
Right here is the plugin roster for a typical inner assessment:
Plugin | Duty |
|---|---|
| GitLab VCS supplier, MR knowledge, MCP remark server |
| AI Gateway configuration, mannequin tiers, failback chains |
| Inner compliance checking in opposition to engineering RFCs |
| Distributed tracing and observability |
| Verifies the repo’s AGENTS.md is updated |
| Distant per-reviewer mannequin overrides from a Cloudflare Employee |
| Fireplace-and-forget assessment monitoring |
How we use OpenCode below the hood
We picked OpenCode as our coding agent of selection for a few causes:
We use it extensively internally, that means we have been already very aware of the way it labored
It’s open supply, so we will contribute options and bug fixes upstream in addition to examine points actually simply once we spot them (on the time of writing, Cloudflare engineers have landed over 45 pull requests upstream!)
It has a terrific open supply SDK, permitting us to simply construct plugins that work flawlessly
However most significantly, as a result of it’s structured as a server first, with its text-based consumer interface and desktop app appearing as shoppers on high. This was a tough requirement for us as a result of we wanted to create periods programmatically, ship prompts by way of an SDK, and gather outcomes from a number of concurrent periods with out hacking round a CLI interface.
The orchestration works in two distinct layers:
The Coordinator Course of: We spawn OpenCode as a toddler course of utilizing Bun.spawn. We cross the coordinator immediate by way of stdin quite than as a command-line argument, as a result of you probably have ever tried to cross a large merge request description filled with logs as a command-line argument, you will have most likely met the Linux kernel’s ARG_MAX restrict. We discovered this beautiful rapidly when E2BIG errors began displaying up on a small share of our CI jobs for extremely giant merge requests. The method runs with --format json, so all output arrives as JSONL occasions on stdout:
const proc = Bun.spawn(
["bun", opencodeScript, "--print-logs", "--log-level", logLevel,
"--format", "json", "--agent", "review_coordinator", "run"],
PROMPT_BOUNDARY_TAGS.be a part of(",
);
The Assessment Plugin: Contained in the OpenCode course of, a runtime plugin supplies the spawn_reviewers software. When the coordinator LLM decides it’s time to assessment the code, it calls this software, which launches the sub-reviewer periods by way of OpenCode’s SDK consumer:
const createResult = await this.consumer.session.create("));
// Ship the immediate asynchronously (non-blocking)
this.consumer.session.promptAsync({
path:
const totalLines = diffEntries.cut back(
(sum, e) => sum + e.addedLines + e.removedLines, 0
);
const fileCount = diffEntries.size;
const hasSecurityFiles = diffEntries.some(
e => isSecuritySensitiveFile(e.newPath)
);
if (fileCount > 50 ,
physique: {
components: [PROMPT_BOUNDARY_TAGS.join("],
agent: enter.agent,
mannequin:
const totalLines = diffEntries.cut back(
(sum, e) => sum + e.addedLines + e.removedLines, 0
);
const fileCount = diffEntries.size;
const hasSecurityFiles = diffEntries.some(
e => isSecuritySensitiveFile(e.newPath)
);
if (fileCount > 50 ,
},
});
Every sub-reviewer runs in its personal OpenCode session with its personal agent immediate. The coordinator would not see or management what instruments the sub-reviewers use. They’re free to learn supply recordsdata, run grep, or search the codebase as they see match, and so they merely return their findings as structured XML after they end.
What’s JSONL, and what can we use it for?
One of many massive challenges that you simply sometimes face when working with methods like that is the necessity for structured logging, and whereas JSON is a fantastic-structured format, it requires the whole lot to be “closed out” to be a legitimate JSON blob. That is particularly problematic in case your utility exits early earlier than it has an opportunity to shut the whole lot out and write a legitimate JSON blob to disk — and that is typically once you want the debug logs most.
That is why we use JSONL (JSON Strains), which does precisely what it says within the tin: it’s a textual content format the place each line is a legitimate, self-contained JSON object. Not like a normal JSON array, you do not have to parse the entire doc to learn the primary entry. You learn a line, parse it, and transfer on. This implies you don’t have to fret about buffering large payloads into reminiscence, or hoping for a closing ] which will by no means arrive as a result of the kid course of ran out of reminiscence.
In observe, it seems like this:
Stripped: authorization, cf-access-token, host
Added: cf-aig-authorization: Bearer
cf-aig-metadata: {"userId": ""}
Each CI system that should parse structured output from a long-running course of ultimately lands on one thing like JSONL — however we didn’t wish to reinvent the wheel. (And OpenCode already helps it!)
We course of the coordinator’s output in real-time, although we buffer and flush each 100 traces (or 50ms) to avoid wasting our disks from a sluggish however painful appendFileSync demise.
We look ahead to particular triggers because the stream flows in and pull out related knowledge, like token utilization out of step_finish occasions to trace prices, and we use error occasions to kick off our retry logic. We additionally ensure that to maintain an eye fixed out for output truncation — if a step_finish arrives with purpose: "length", we all know the mannequin hit its max_tokens restrict and bought minimize off mid-sentence, so we should always mechanically retry.
One of many operational complications we didn’t predict was that giant, superior fashions like Claude Opus 4.7 or GPT-5.4 can typically spend fairly some time pondering by way of an issue, and to our customers this could make it look precisely like a hung job. We discovered that customers would ceaselessly cancel jobs and complain that the reviewer wasn’t working as meant, when in actuality it was working away within the background. To counter this, we added an very simple heartbeat log that prints “Model is thinking… (Ns since last output)” each 30 seconds which just about solely eradicated the issue.
Specialised brokers as an alternative of 1 massive immediate
As a substitute of asking one mannequin to assessment the whole lot, we cut up the assessment into domain-specific brokers. Every agent has a tightly scoped immediate telling it precisely what to search for, and extra importantly, what to disregard.
The safety reviewer, for instance, has express directions to solely flag points which are “exploitable or concretely dangerous”:
## What to Flag
- Injection vulnerabilities (SQL, XSS, command, path traversal)
- Authentication/authorisation bypasses in modified code
- Hardcoded secrets and techniques, credentials, or API keys
- Insecure cryptographic utilization
- Lacking enter validation on untrusted knowledge at belief boundaries
## What NOT to Flag
- Theoretical dangers that require unlikely preconditions
- Protection-in-depth strategies when major defenses are ample
- Points in unchanged code that this MR would not have an effect on
- "Consider using library X" type strategies
It seems that telling an LLM what not to do is the place the precise immediate engineering worth resides. With out these boundaries, you get a firehose of speculative theoretical warnings that builders will instantly study to disregard.
Each reviewer produces findings in a structured XML format with a severity classification: essential (will trigger an outage or is exploitable), warning (measurable regression or concrete threat), or suggestion (an enchancment value contemplating). This ensures we’re coping with structured knowledge that drives downstream habits, quite than parsing advisory textual content.
As a result of we cut up the assessment into specialised domains, we needn’t use an excellent costly, extremely succesful mannequin for each activity. We assign fashions primarily based on the complexity of the agent’s job:
Prime-tier: Claude Opus 4.7 and GPT-5.4: Reserved solely for the Assessment Coordinator. The coordinator has the toughest job — studying the output of seven different fashions, deduplicating findings, filtering out false positives, and making a ultimate judgment name. It wants the very best reasoning functionality obtainable.
Normal-tier: Claude Sonnet 4.6 and GPT-5.3 Codex: The workhorse for our heavy-lifting sub-reviewers (Code High quality, Safety, and Efficiency). These are quick, comparatively low cost, and glorious at recognizing logic errors and vulnerabilities in code.
Kimi K2.5: Used for light-weight, text-heavy duties just like the Documentation Reviewer, Launch Reviewer, and the AGENTS.md Reviewer.
These are the defaults, however each single mannequin project will be overridden dynamically at runtime by way of our reviewer-config Cloudflare Employee, which we’ll cowl within the management aircraft part under.
Immediate injection prevention
Agent prompts are constructed at runtime by concatenating the agent-specific markdown file with a shared REVIEWER_SHARED.md file containing obligatory guidelines. The coordinator’s enter immediate is assembled by stitching collectively MR metadata, feedback, earlier assessment findings, diff paths, and customized directions into structured XML.
We additionally needed to sanitise user-controlled content material. If somebody places of their MR description, they may theoretically escape of the XML construction and inject their very own directions into the coordinator’s immediate. We strip these boundary tags out solely, as a result of we have discovered over time to by no means underestimate the creativity of Cloudflare engineers in terms of testing a brand new inner software:
const PROMPT_BOUNDARY_TAGS = [
"mr_input", "mr_body", "mr_comments", "mr_details",
"changed_files", "existing_inline_findings", "previous_review",
"custom_review_instructions", "agents_md_template_instructions",
];
const BOUNDARY_TAG_PATTERN = new RegExp(
`?(?:$"))[^>]*>`, "gi"
);
Saving tokens with shared context
The system would not embed full diffs within the immediate. As a substitute, it writes per-file patch recordsdata to a diff_directory and passes the trail. Every sub-reviewer reads solely the patch recordsdata related to its area.
We additionally extract a shared context file (shared-mr-context.txt) from the coordinator’s immediate and write it to disk. Sub-reviewers learn this file as an alternative of getting the total MR context duplicated in every of their prompts. This was a deliberate resolution, as duplicating even a moderately-sized MR context throughout seven concurrent reviewers would multiply our token prices by 7x.
The coordinator helps preserve issues targeted
After spawning all sub-reviewers, the coordinator performs a choose cross to consolidate the outcomes:
Deduplication: If the identical challenge is flagged by each the safety reviewer and the code high quality reviewer, it will get saved as soon as within the part the place it matches greatest.
Re-categorisation: A efficiency challenge flagged by the code high quality reviewer will get moved to the efficiency part.
Reasonableness filter: Speculative points, nitpicks, false positives, and convention-contradicted findings get dropped. If the coordinator is not certain, it makes use of its instruments to learn the supply code and confirm.
The general approval resolution follows a strict rubric:
Situation | Choice | GitLab Motion |
|---|---|---|
All LGTM (“looks good to me”), or solely trivial strategies |
|
|
Solely suggestion-severity gadgets |
|
|
Some warnings, no manufacturing threat |
|
|
A number of warnings suggesting a threat sample |
|
|
Any essential merchandise, or manufacturing security threat |
|
|
The bias is explicitly towards approval, that means a single warning in an in any other case clear MR nonetheless will get approved_with_comments quite than a block.
As a result of this can be a manufacturing system that straight sits between engineers delivery code, we made certain to construct an escape hatch. If a human reviewer feedback break glass, the system forces an approval no matter what the AI discovered. Typically you simply must ship a hotfix, and the system detects this override earlier than the assessment even begins, so we will observe it in our telemetry and aren’t caught out by any latent bugs or LLM supplier outages.
Threat tiers: do not ship the dream group to assessment a typo repair
You do not want seven concurrent AI brokers burning Opus-tier tokens to assessment a one-line typo repair in a README. The system classifies each MR into considered one of three threat tiers primarily based on the scale and nature of the diff:
// Simplified from packages/core/src/threat.ts
operate assessRiskTier(diffEntries: DiffEntry[])
Safety-sensitive recordsdata: something touching auth/, crypto/, or file paths that sound even remotely security-related all the time set off a full assessment, as a result of we’d quite spend a bit additional on tokens than doubtlessly miss a safety vulnerability.
Every tier will get a distinct set of brokers:
Tier | Strains Modified | Information | Brokers | What Runs |
|---|---|---|---|---|
Trivial | ≤10 | ≤20 | 2 | Coordinator + one generalised code reviewer |
Lite | ≤100 | ≤20 | 4 | Coordinator + code high quality + documentation + (extra) |
Full | >100 or >50 recordsdata | Any | 7+ | All specialists, together with safety, efficiency, launch |
The trivial tier additionally downgrades the coordinator from Opus to Sonnet, for instance, as a two-reviewer verify on a minor change would not require a particularly succesful and costly mannequin to judge.
Diff filtering: eliminating the noise
Earlier than the brokers see any code, the diff goes by way of a filtering pipeline that strips out noise like lock recordsdata, vendored dependencies, minified property, and supply maps:
const NOISE_FILE_PATTERNS = [
"bun.lock", "package-lock.json", "yarn.lock",
"pnpm-lock.yaml", "Cargo.lock", "go.sum",
"poetry.lock", "Pipfile.lock", "flake.lock",
];
const NOISE_EXTENSIONS = [".min.js", ".min.css", ".bundle.js", ".map"];
We additionally filter out generated recordsdata by scanning the primary few traces for markers like // @generated or /* eslint-disable */. Nonetheless, we explicitly exempt database migrations from this rule, since migration instruments typically stamp recordsdata as generated regardless that they include schema adjustments that completely have to be reviewed.
The spawn_reviewers software manages the lifecycle of as much as seven concurrent reviewer periods with circuit breakers, failback chains, per-task timeouts, and retry logic. It acts basically as a tiny scheduler for LLM periods.
Figuring out when an LLM session is definitely “done” is surprisingly tough. We rely totally on OpenCode’s session.idle occasions, however we again that up with a polling loop that checks the standing of all operating duties each three seconds. This polling loop additionally implements inactivity detection. If a session has been operating for 60 seconds with no output in any respect, it’s killed early and marked as an error, which catches periods that crash on startup earlier than producing any JSONL.
Timeouts function at three ranges:
Per-task: 5 minutes (10 for code high quality, which reads extra recordsdata). This prevents one sluggish reviewer from blocking the remaining.
General: 25 minutes. A tough cap for all the
spawn_reviewersname. When it hits, each remaining session is aborted.Retry funds: 2 minutes minimal. We do not hassle retrying if there is not sufficient time left within the total funds.
Resilience: circuit breakers and failback chains
Working seven concurrent AI mannequin calls means you might be completely going to hit price limits and supplier outages. We carried out a circuit breaker sample impressed by Netflix’s Hystrix, tailored for AI mannequin calls. Every mannequin tier has impartial well being monitoring with three states:
When a mannequin’s circuit opens, the system walks a failback chain to discover a wholesome different. For instance:
const DEFAULT_FAILBACK_CHAIN = {
"opus-4-7": "opus-4-6", // Fall again to earlier era
"opus-4-6": null, // Finish of chain
"sonnet-4-6": "sonnet-4-5",
"sonnet-4-5": null,
};
Every mannequin household is remoted, so if one mannequin is overloaded, we fall again to an older era mannequin quite than crossing streams. When a circuit opens, we permit precisely one probe request by way of after a two-minute cooldown to see if the supplier has recovered, which prevents us from stampeding a struggling API.
When a sub-reviewer session fails, the system must determine if it ought to set off mannequin failback or if it is an issue {that a} totally different mannequin will not repair. The error classifier maps OpenCode’s error union kind to a shouldFailback boolean:
swap (err.title) {
case "APIError":
// Solely retryable API errors (429, 503) set off failback
return { shouldFailback: Boolean(knowledge.isRetryable), ... };
case "ProviderAuthError":
// Auth failure (a distinct mannequin will not repair unhealthy credentials)
return { shouldFailback: false, ... };
case "ContextOverflowError":
// Too many tokens (a distinct mannequin has the identical restrict)
return { shouldFailback: false, ... };
case "MessageAbortedError":
// Person/system abort (not a mannequin downside)
return { shouldFailback: false, ... };
}
Solely retryable API errors set off failback. Auth errors, context overflow, aborts, and structured output errors don’t.
Coordinator-level failback
The circuit breaker handles sub-reviewer failures, however the coordinator itself can even fail. The orchestration layer has a separate failback mechanism: if the OpenCode youngster course of fails with a retryable error (detected by scanning stderr for patterns like “overloaded” or “503”), it hot-swaps the coordinator mannequin within the opencode.json config file and retries. It is a file-level swap that reads the config JSON, replaces the review_coordinator.mannequin key, and writes it again earlier than the subsequent try.
The management aircraft: Employees for config and telemetry
If a mannequin supplier goes down at 8 a.m. UTC when our colleagues in Europe are simply waking up, we don’t wish to anticipate an on-call engineer to make a code change to change out the fashions we’re utilizing for the reviewer. As a substitute, the CI job fetches its mannequin routing configuration from a Cloudflare Employee backed by Employees KV.
The response comprises per-reviewer mannequin assignments and a suppliers block. When a supplier is disabled, the plugin filters out all fashions from that supplier earlier than choosing the first:
operate filterModelsByProviders(fashions, suppliers) {
return fashions.filter((m) => {
const supplier = extractProviderFromModel(m.mannequin);
if (!supplier) return true; // Unknown supplier → preserve
const config = suppliers[provider];
if (!config) return true; // Not in config → preserve
return config.enabled; // Disabled → filter out
});
}
This implies we will flip a swap in KV to disable a whole supplier, and each operating CI job will route round it inside 5 seconds. The config format additionally carries failback chain overrides, permitting us to reshape all the mannequin routing topology from a single Employee replace.
We additionally use a fire-and-forget TrackerClient that talks to a separate Cloudflare Employee to trace job begins, completions, findings, token utilization, and Prometheus metrics. The consumer is designed to by no means block the CI pipeline, utilizing a 2-second AbortSignal.timeout and pruning pending requests in the event that they exceed 50 entries. Prometheus metrics are batched on the subsequent microtask and flushed proper earlier than the method exits, forwarding to our inner observability stack by way of Employees Logging, so we all know precisely what number of tokens we’re burning in actual time.
Re-reviews: not ranging from scratch
When a developer pushes new commits to an already-reviewed MR, the system runs an incremental re-review that’s conscious of its personal earlier findings. The coordinator receives the total textual content of its final assessment remark and an inventory of inline DiffNote feedback it beforehand posted, together with their decision standing.
The re-review guidelines are strict:
Mounted findings: Omit from the output, and the MCP server auto-resolves the corresponding DiffNote thread.
Unfixed findings: Should be re-emitted even when unchanged, so the MCP server is aware of to maintain the thread alive.
Person-resolved findings: Revered until the difficulty has materially worsened.
Person replies: If a developer replies “won’t fix” or “acknowledged”, the AI treats the discovering as resolved. In the event that they reply “I disagree”, the coordinator will learn their justification and both resolve the thread or argue again.
We additionally made certain to construct in a small Easter egg and made certain that the reviewer can even deal with one lighthearted query per MR. We figured a bit character helps construct rapport with builders who’re being reviewed (typically brutally) by a robotic, so the immediate instructs it to maintain the reply temporary and heat earlier than politely redirecting again to the assessment.
Protecting AI context contemporary: the AGENTS.md Reviewer
AI coding brokers rely closely on AGENTS.md recordsdata to grasp challenge conventions, however these recordsdata rot extremely quick. If a group migrates from Jest to Vitest however forgets to replace their directions, the AI will stubbornly preserve making an attempt to jot down Jest assessments.
We constructed a particular reviewer simply to evaluate the materiality of an MR and yell at builders in the event that they make a significant architectural change with out updating the AI directions. It classifies adjustments into three tiers:
Excessive materiality (strongly suggest replace): package deal supervisor adjustments, check framework adjustments, construct software adjustments, main listing restructures, new required env vars, CI/CD workflow adjustments.
Medium materiality (value contemplating): main dependency bumps, new linting guidelines, API consumer adjustments, state administration adjustments.
Low materiality (no replace wanted): bug fixes, characteristic additions utilizing current patterns, minor dependency updates, CSS adjustments.
It additionally penalizes anti-patterns in current AGENTS.md recordsdata, like generic filler (“write clean code”), recordsdata over 200 traces that trigger context bloat, and gear names with out runnable instructions. A concise, useful AGENTS.md with instructions and bounds is all the time higher than a verbose one.
The system ships as a completely contained inner GitLab CI part. A group provides it to their .gitlab-ci.yml:
embody:
- part: $CI_SERVER_FQDN/ci/ai/opencode@~newest
The part handles pulling the Docker picture, establishing Vault secrets and techniques, operating the assessment, and posting the remark. Groups can customise habits by dropping an AGENTS.md file of their repo root with project-specific assessment directions, and groups can decide to offer a URL to an AGENTS.md template that will get injected into all agent prompts to make sure their customary conventions apply throughout all of their repositories with no need to maintain a number of AGENTS.md recordsdata updated.
All the system additionally runs regionally. The @opencode-reviewer/native plugin supplies a /fullreview command inside OpenCode’s TUI that generates diffs from the working tree, runs the identical threat evaluation and agent orchestration, and posts outcomes inline. It is the very same brokers and prompts, simply operating in your laptop computer as an alternative of in CI.
We have now been operating this method for a couple of month now, and we observe the whole lot by way of our review-tracker Employee. Here’s what the information seems like throughout 5,169 repositories from March 10 to April 9, 2026.
Within the first 30 days, the system accomplished 131,246 assessment runs throughout 48,095 merge requests in 5,169 repositories. The common merge request will get reviewed 2.7 occasions (the preliminary assessment, plus re-reviews because the engineer pushes fixes), and the median assessment completes in 3 minutes and 39 seconds. That’s quick sufficient that the majority engineers see the assessment remark earlier than they’ve completed context-switching to a different activity. The metric we’re the proudest about, although, is that engineers have solely wanted to “break glass” 288 occasions (0.6% of merge requests).
On the fee facet, the typical assessment prices $1.19 and the median is $0.98. The distribution has an extended tail of high-priced evaluations – large refactors that set off full-tier orchestration. The P99 assessment prices $4.45, which suggests 99% of evaluations are available in below 5 {dollars}.
Percentile | Price per assessment | Assessment length |
|---|---|---|
Median | $0.98 | 3m 39s |
P90 | $2.36 | 6m 27s |
P95 | $2.93 | 7m 29s |
P99 | $4.45 | 10m 21s |
The system produced 159,103 complete findings throughout all evaluations, damaged down as follows:
That’s about 1.2 findings per assessment on common, which is intentionally low. We biased onerous for sign over noise, and the “What NOT to Flag” immediate sections are an enormous a part of why the numbers seem like this quite than 10+ findings per assessment of doubtful high quality.
The code high quality reviewer is essentially the most prolific, producing practically half of all findings. Safety and efficiency reviewers produce fewer findings however at larger common severity, however the absolute numbers inform the total story — code high quality produces practically half of all findings by quantity, whereas the safety reviewer flags the very best proportion of essential points at 4%:
Reviewer | Crucial | Warning | Suggestion | Whole |
|---|---|---|---|---|
Code High quality | 6,460 | 29,974 | 38,464 | 74,898 |
Documentation | 155 | 9,438 | 16,839 | 26,432 |
Efficiency | 65 | 5,032 | 9,518 | 14,615 |
Safety | 484 | 5,685 | 5,816 | 11,985 |
Codex (compliance) | 224 | 4,411 | 5,019 | 9,654 |
AGENTS.md | 18 | 2,675 | 4,185 | 6,878 |
Launch | 19 | 321 | 405 | 745 |
Over the month, we processed roughly 120 billion tokens in complete. The overwhelming majority of these are cache reads, which is strictly what we wish to see — it means the immediate caching is working, and we’re not paying full enter pricing for repeated context throughout re-reviews.
Our cache hit price sits at 85.7%, which saves us an estimated 5 figures in comparison with what we’d pay at full enter token pricing. That is partially because of the shared context file optimisation — sub-reviewers studying from a cached context file quite than every getting their very own copy of the MR metadata, but additionally through the use of the very same base prompts throughout all runs, throughout all merge requests.
Right here is how the token utilization breaks down by mannequin and by agent:
Mannequin | Enter | Output | Cache Learn | Cache Write | % of Whole |
|---|---|---|---|---|---|
Prime-tier fashions (Claude Opus 4.7, GPT-5.4) | 806M | 1,077M | 25,745M | 5,918M | 51.8% |
Normal-tier fashions (Claude Sonnet 4.6, GPT-5.3 Codex) | 928M | 776M | 48,647M | 11,491M | 46.2% |
Kimi K2.5 | 11,734M | 267M | 0 | 0 | 0.0% |
Prime-tier fashions and Normal-tier fashions cut up the fee roughly 52/48, which is sensible provided that the top-tier fashions need to do much more complicated work (one session per assessment, however with costly prolonged pondering and enormous output) whereas the standard-tier fashions deal with three sub-reviewers per full assessment. Kimi processes essentially the most uncooked enter tokens (11.7B) however prices “nothing” because it runs by way of Employees AI.
The per-agent breakdown exhibits the place the tokens really go:
Agent | Enter | Output | Cache Learn | Cache Write |
|---|---|---|---|---|
Coordinator | 513M | 1,057M | 20,683M | 5,099M |
Code High quality | 428M | 264M | 19,274M | 3,506M |
Engineering Codex | 409M | 236M | 18,296M | 3,618M |
Documentation | 8,275M | 216M | 8,305M | 616M |
Safety | 199M | 149M | 8,917M | 2,603M |
Efficiency | 157M | 124M | 6,138M | 2,395M |
AGENTS.md | 4,036M | 119M | 2,307M | 342M |
Launch | 183M | 5M | 231M | 15M |
The coordinator produces by far essentially the most output tokens (1,057M) as a result of it has to jot down the total structured assessment remark. The documentation reviewer has the very best uncooked enter (8,275M) as a result of it processes each file kind, not simply code. The discharge reviewer barely registers as a result of it solely runs when release-related recordsdata are within the diff.
The danger tier system is doing its job. Trivial evaluations (typo fixes, small doc adjustments) price 20 cents on common, whereas full evaluations with all seven brokers common $1.68. The unfold is strictly what we designed for:
Tier | Opinions | Avg Price | Median | P95 | P99 |
|---|---|---|---|---|---|
Trivial | 24,529 | $0.20 | $0.17 | $0.39 | $0.74 |
Lite | 27,558 | $0.67 | $0.61 | $1.15 | $1.95 |
Full | 78,611 | $1.68 | $1.47 | $3.35 | $5.05 |
So, what does a assessment seem like?
We’re glad you requested! Right here’s an instance of what a very egregious assessment seems like:
As you’ll be able to see, the reviewer doesn’t beat across the bush and calls out issues when it sees them.
Limitations we’re trustworthy about
This is not a substitute for human code assessment, no less than not but with right this moment’s fashions. AI reviewers frequently battle with:
Architectural consciousness: The reviewers see the diff and surrounding code, however they do not have the total context of why a system was designed a sure method or whether or not a change is transferring the structure in the appropriate path.
Cross-system impression: A change to an API contract would possibly break three downstream customers. The reviewer can flag the contract change, however it may possibly’t confirm that every one customers have been up to date.
Refined concurrency bugs: Race circumstances that rely upon particular timing or ordering are onerous to catch from a static diff. The reviewer can spot lacking locks, however not all of the methods a system can impasse.
Price scales with diff measurement: A 500-file refactor with seven concurrent frontier mannequin calls prices actual cash. The danger tier system manages this, however when the coordinator’s immediate exceeds 50% of the estimated context window, we emit a warning. Giant MRs are inherently costly to assessment.
We’re simply getting began
For extra on how we’re utilizing AI at Cloudflare, learn our publish on our inner AI engineering stack. And take a look at the whole lot we shipped throughout Brokers Week.
Have you ever built-in AI into your code assessment? We’d love to listen to about it. Discover us on Discord, X, and Bluesky.
Involved in constructing innovative initiatives like this, on innovative know-how? Come construct with us!



