In mid-December, I began building KubeStellar Console from the ground up. It’s a multi-cluster management dashboard for Kubernetes, part of the KubeStellar project within the Cloud Native Computing Foundation (CNCF) Sandbox. The technology stack uses Go for the backend, React and TypeScript for the frontend, and Helm for packaging. There was no team — just me and two AI coding agents working in parallel terminal sessions.
The initial two weeks matched the honeymoon phase that everyone in this field talks about. The agents produced code faster than I could review it. Tasks I would have estimated at three days were completed in two hours. I maintained a mental checklist of features I’d always wanted to implement and kept ticking them off, one by one.
Then the problems hit.
Builds started failing in ways that were difficult to diagnose. Architectural decisions from the previous day were silently overwritten. Scope crept forward without any prompting. The agent kept modifying files I hadn’t directed it to touch, and the cascading failures were the most damaging issue — fixing one thing caused three others to break. I found myself spending more time rolling back changes than reviewing them. The promised 10x productivity boost started to feel like a net loss, so I decided to abandon the entire approach.
The real surprise in building KubeStellar Console with coding agents wasn’t how capable the model was, but how much heavy lifting the surrounding codebase had to do.
That trajectory — from excitement to grinding frustration — seems to be a universal experience. The standard industry advice is to give the agent more freedom: let it run longer, modify more files, and self-correct. In my experience, that usually amplifies the failure mode rather than fixing it. The leverage works the other way. The intelligence in an AI-assisted codebase resides less in the model itself and more in the feedback loops the codebase builds around it. If you want the agent to take on more, the surrounding code needs to measure more.
Four months later, KubeStellar Console is in a much stronger position. There are 63 CI/CD workflows, 32 nightly test suites, and test coverage at 91% across twelve shards. Over 82 days, PR acceptance stabilized around 81%. Community bug reports are progressing to merged fixes in roughly thirty minutes. Feature requests are turning into pull requests in about an hour. None of that came from switching to a better model. What changed was what the codebase itself had learned to measure.
Five tightening feedback loops got me there. I think of them as the stages of what I’ve been calling the AI Codebase Maturity Model — Assisted, Instructed, Measured, Adaptive, and Self-Sustaining. I’ll walk through them in the order they emerged, because I don’t believe they can be rearranged.
1. Document what you keep correcting (Instructed)
The simplest intervention, and likely the highest-impact, is to externalize your own preferences. I started with a CLAUDE.md file at the repository root, followed by a .github/copilot-instructions.md file for pull request conventions. After that came a card-level development guide that cataloged the top reasons I was rejecting AI-generated PRs.
That single guide ended up covering about 90% of my rejection criteria. Sessions became more consistent. The same mistakes stopped repeating across different agents. I wouldn’t call this measurement — at this stage, I was still relying on gut feeling — but it filtered out enough noise to make proper measurement possible.
2. Treat tests as the trust layer, not just the correctness layer (Measured)
This was the pivotal shift. Testing for an autonomous workflow is fundamentally different from testing for a human workflow. Test results are the only signal the agent has for knowing whether it’s improving the system or degrading it.
Over four weeks, I added 32 nightly test suites and pushed coverage to 91% across twelve parallel shards. The suites covered compliance, performance, nil safety, accessibility, internationalization, and visual regression. Alongside that, I began logging PR acceptance rates per category into auto-qa-tuning.json. That file turned out to be foundational for everything that followed.
Coverage volume matters. So does breadth. But the thing that nearly derailed me — and the issue I’d flag most strongly for anyone attempting this — is determinism.
“A flaky test in a human workflow is an annoyance. In an autonomous one, it’s a slow, quiet erosion of the entire trust model.”
One Playwright end-to-end test for drag-and-drop passed about 85% of the time. In a human workflow, that’s tolerable — you re-run it and move on. In an autonomous workflow where test results gate merges, an 85% pass rate is a disaster. Strong PRs were being randomly blocked, and weak ones were slipping through. I spent three days on that single test, and it turned out to be an animation-completion timing issue in CI. The lesson generalized: you can’t build automation on top of an unreliable signal. A flaky test in a human workflow is an annoyance. In an autonomous one, it’s a slow, quiet erosion of the entire trust model.
3. Don’t automate until you can measure (Adaptive)
With acceptance rates being tracked, automation became a safer bet. Auto-QA began running four times a day across eight layers of quality checks. The rotation weights that determine which work categories the system prioritizes started adjusting themselves based on the data. Accessibility PRs were landing at 62% acceptance, so their weight increased to 0.93. Operator-category PRs were landing at 8% (11 merges against 129 closed), so that weight dropped to zero and CI cycles were redirected elsewhere.
Several more loops closed around that core:
- A triage process scanned four repositories every 15 minutes.
- A PR monitor polled build status every 60 seconds.
- An error-recovery workflow used exponential backoff to handle stuck agents.
- A GA4 query ran hourly against production analytics and filed GitHub issues for error spikes before users reported them.
“Automation without measurement isn’t maturity — it’s failure at scale.”
The pattern across all of these is the same: measurement first, automation second. Reversing that order is how autonomous systems go off the rails. Automation without measurement isn’t maturity — it’s failure at scale.
Running operations efficiently at scale.
4. Transform the codebase into a living, self-sustaining operating manual
There was no single, definitive moment when the system reached a point where it could run entirely on its own. Eventually, its behavior became governed by its own artifacts: the instruction files, test suites, workflow rules, and acceptance rate history. Community members began submitting issues around the clock, and those reports were automatically being triaged, assigned, fixed, tested, and lined up for review before I even started my day.
A specific incident perfectly illustrated this transition. In April, a user reported a bug where a cluster appeared “healthy” even though pods were stuck in ImagePullBackOff. Before I could investigate, the system had already clarified that cluster health is based on infrastructure status (node readiness, API availability), which is fundamentally different from workload status. This wasn’t a bug—it was a misunderstanding of Kubernetes architecture that didn’t align with the dashboard display. The design rationale was already captured in the tests, health-check logic, and documentation; the agent could explain it because the system already contained that knowledge.
This practical example demonstrates what “the code embodies the model” truly means.
5. Ask “why” instead of “what”
A single prompting habit proved incredibly effective. Rather than instructing the AI to “fix this bug,” I began questioning, “Why did this error occur?” The first approach results in a quick patch. The second often leads to a thorough root-cause analysis and, as a bonus, creates new tests, guidelines, or rules that prevent similar issues in the future.
This method focuses on isolated fixes, leading to incremental improvements. Over time, these inquiries transform the codebase into a self-improving system and provide the foundation for creating new instruction files. Questioning is the key to turning the codebase into a continuously improving system, especially when you are starting from scratch.
Implications for maintainers and leaders
If you’re in charge of engineering teams, don’t focus on choosing the right model. Models are easily interchangeable, and switching between them is a simple task. However, the real effort lies in rebuilding the feedback system around them. The true value comes from the intelligent infrastructure: instructional files, test suites, metrics, and workflow rules.
For open source maintainers, this directly addresses the burnout issue discussed in CNCF community discussions. If a codebase can capture a maintainer’s expertise, agents can handle triage, create pull requests, and explain design choices to users; the community can primarily guide the project by submitting issues.
Maintainers transition from daily operators to system architects. This isn’t just a theory for KubeStellar Console—it’s a reality. The broader community will need to test its effectiveness beyond a solo-maintained Sandbox project. I’m eager to see the results.
Many teams are still in the initial phase, writing prompts and evaluating results. Everyone begins there. The goal isn’t to rush to the final stage, but to identify and address the current bottleneck.
The codebase stores my knowledge, and the tests handle what I can’t remember. What remains uniquely mine—and I believe this aspect will always be—is determining what’s valuable, what to reject, and defining the standards for success.



