For early-stage open-source initiatives, the “Getting started” information is commonly the primary actual interplay a developer has with the challenge. If a command fails, an output doesn’t match, or a step is unclear, most customers gained’t file a bug report, they’ll simply transfer on.
Drasi, a CNCF sandbox challenge that detects adjustments in your information and triggers instant reactions, is supported by our small staff of 4 engineers in Microsoft Azure’s Workplace of the Chief Expertise Officer. We’ve complete tutorials, however we’re transport code sooner than we are able to manually take a look at them.
The staff didn’t notice how large this hole was till late 2025, when GitHub up to date its Dev Container infrastructure, bumping the minimal Docker model. The replace broke the Docker daemon connection, and each single tutorial stopped working. As a result of we relied on guide testing, we didn’t instantly know the extent of the harm. Any developer attempting Drasi throughout that window would have hit a wall.
This incident compelled a realization: with superior AI coding assistants, documentation testing will be transformed to a monitoring downside.
The issue: Why does documentation break?
Documentation normally breaks for 2 causes:
1. The curse of data
Skilled builders write documentation with implicit context. After we write “wait for the query to bootstrap,” we all know to run `drasi record question` and look ahead to the `Operating` standing, and even higher to run the `drasi wait` command. A brand new person has no such context. Neither does an AI agent. They learn the directions actually and don’t know what to do. They get caught on the “how,” whereas we solely doc the “what.”
2. Silent drift
Documentation doesn’t fail loudly like code does. Once you rename a configuration file in your codebase, the construct fails instantly. However when your documentation nonetheless references the outdated filename, nothing occurs. The drift accumulates silently till a person stories confusion.
That is compounded for tutorials like ours, which spin up sandbox environments with Docker, k3d, and pattern databases. When any upstream dependency adjustments—a deprecated flag, a bumped model, or a brand new default—our tutorials can break silently.
The answer: Brokers as artificial customers
To unravel this, we handled tutorial testing as a simulation downside. We constructed an AI agent that acts as a “synthetic new user.”
This agent has three crucial traits:
- It’s naïve: It has no prior information of Drasi—it is aware of solely what’s explicitly written within the tutorial.
- It’s literal: It executes each command precisely as written. If a step is lacking, it fails.
- It’s unforgiving: It verifies each anticipated output. If the doc says, “You should see ‘Success’”, and the command line interface (CLI) simply returns silently, the agent flags it and fails quick.
The stack: GitHub Copilot CLI and Dev Containers
We constructed an answer utilizing GitHub Actions, Dev Containers, Playwright, and the GitHub Copilot CLI.
Our tutorials require heavy infrastructure:
- A full Kubernetes cluster (k3d)
- Docker-in-Docker
- Actual databases (similar to PostgreSQL and MySQL)
We wanted an setting that precisely matches what our human customers expertise. If customers run in a particular Dev Container on GitHub Codespaces, our take a look at should run in that identical Dev Container.
The structure
Contained in the container, we invoke the Copilot CLI with a specialised system immediate (view the complete immediate right here):
This immediate utilizing the immediate mode (-p) of the CLI agent provides us an agent that may execute terminal instructions, write information, and run browser scripts—similar to a human developer sitting at their terminal. For the agent to simulate an actual person, it wants these capabilities.
To allow the brokers to open webpages and work together with them as any human following the tutorial steps would, we additionally set up Playwright on the Dev Container. The agent additionally takes screenshots which it then compares towards these offered within the documentation.
Safety mannequin
Our safety mannequin is constructed round one precept: the container is the boundary.
Relatively than attempting to limit particular person instructions (a shedding sport when the agent must run arbitrary node scripts for Playwright), we deal with all the Dev Container as an remoted sandbox and management what crosses its boundaries: no outbound community entry past localhost, a Private Entry Token (PAT) with solely “Copilot Requests” permission, ephemeral containers destroyed after every run, and a maintainer-approval gate for triggering workflows.
Coping with non-determinism
One of many largest challenges with AI-based testing is non-determinism. Massive language fashions (LLMs) are probabilistic—generally the agent retries a command; different occasions it provides up.
We dealt with this with a three-stage retry with mannequin escalation (begin with Gemini-Professional, on failure attempt with Claude Opus), semantic comparability for screenshots as a substitute of pixel-matching, and verification of core-data fields relatively than unstable values.
We even have a listing of tight constraints in our prompts that stop the agent from occurring a debugging journey, directives to manage the construction of the ultimate report, and likewise skip directives that inform the agent to bypass non-compulsory tutorial sections like organising exterior providers.
Artifacts for debugging
When a run fails, we have to know why. Because the agent is working in a transient container, we are able to’t simply Safe Shell (SSH) in and go searching.
So, our agent preserves proof of each run, screenshots of internet UIs, terminal output of crucial instructions, and a remaining markdown report detailing its reasoning like proven right here:

These artifacts are uploaded to the GitHub Motion run abstract, permitting us to “time travel” again to the precise second of failure and see what the agent noticed.

Parsing the agent’s report
With LLMs, getting a definitive “Pass/Fail” sign {that a} machine can perceive will be difficult. An agent may write an extended, nuanced conclusion like:

To make this actionable in a CI/CD pipeline, we needed to do some immediate engineering. We explicitly instructed the agent:

In our GitHub Motion, we then merely grep for this particular string to set the exit code of the workflow.

Easy methods like this bridge the hole between AI’s fuzzy, probabilistic outputs and CI’s binary go/fail expectations.
Automation
We now have an automatic model of the workflow which runs weekly. This model evaluates all our tutorials each week in parallel—every tutorial will get its personal sandbox container and a contemporary perspective from the agent appearing as an artificial person. If any of the tutorial analysis fails, the workflow is configured to file a problem on our GitHub repo.
This workflow can optionally even be run on pull-requests, however to stop assaults now we have added a maintainer-approval requirement and a `pull_request_target` set off, which implies that even on pull-requests by exterior contributors, the workflow that executes would be the one in our foremost department.
Operating the Copilot CLI requires a PAT token which is saved within the setting secrets and techniques for our repo. To ensure this doesn’t leak, every run requires maintainer approval—besides the automated weekly run which solely runs on the `foremost` department of our repo.
What we discovered: Bugs that matter
Since implementing this method, now we have run over 200 “synthetic user” classes. The agent recognized 18 distinct points together with some severe setting points and different documentation points like these. Fixing them improved the docs for everybody, not simply the bot.
- Implicit dependencies: In a single tutorial, we instructed customers to create a tunnel to a service. The agent ran the command, after which—following the following instruction—killed the method to run the following command.
The repair: We realized we hadn’t instructed the person to maintain that terminal open. We added a warning: “This command blocks. Open a new terminal for subsequent steps.” - Lacking verification steps: We wrote: “Verify the query is running.” The agent received caught: “How, exactly?”
The repair: We changed the imprecise instruction with an express command: `drasi wait -f question.yaml`. - Format drift: Our CLI output had developed. New columns have been added; older fields have been deprecated. The documentation screenshots nonetheless confirmed the 2024 model of the interface. A human tester may gloss over this (“it looks mostly right”). The agent flagged each mismatch, forcing us to maintain our examples updated.
AI as a drive multiplier
We regularly hear about AI changing people, however on this case, the AI is offering us with a workforce we by no means had.
To copy what our system does—working six tutorials throughout contemporary environments each week—we would wish a devoted QA useful resource or a big funds for guide testing. For a four-person staff, that’s unattainable. By deploying these Artificial Customers, now we have successfully employed a tireless QA engineer who works nights, weekends, and holidays.
Our tutorials are actually validated weekly by artificial customers. attempt the Getting Began information your self and see the outcomes firsthand. And for those who’re going through the identical documentation drift in your personal challenge, think about GitHub Copilot CLI not simply as a coding assistant, however as an agent—give it a immediate, a container, and a purpose—and let it do the work a human doesn’t have time for.



