As we shared in our earlier post on FluxCD, RBC Capital Markets has been on a deliberate journey to modernize our Kubernetes platform. GitOps with FluxCD gave us a solid deployment foundation. But as our platform grew, today we operate over 50 clusters spanning on-premises VMware environments and multiple clouds, we hit a set of problems that no single off-the-shelf tool was designed to solve together: How do you manage the lifecycle of the clusters themselves? How do you ensure every node is reproducible and tamper-evident at boot? And how do you integrate Kubernetes service discovery with enterprise DNS infrastructure without every record change going through a ticket queue?
This post is about the several projects that answered those questions for us, and what we learned building with them inside a regulated financial institution.
The challenge: Platform engineering at scale in a regulated environment
Managing 50+ Kubernetes clusters across hybrid infrastructure is not just an operational challenge, in capital markets it is also a compliance challenge. SOX, PCI-DSS, and Basel III create real requirements around auditability, configuration drift prevention, and network segmentation. Our platform teams cannot afford to have snowflake nodes, undocumented cluster state, or manual DNS records that accumulate over years.
When we stepped back and looked at what we were spending engineering effort on, three gaps stood out:
- Node configuration drift: VM-based nodes that had been patched and mutated over time were becoming impossible to reason about.
- Cluster provisioning: spinning up new clusters for trading desks or risk teams was a multi-day manual exercise with no single source of truth.
- DNS integration: every new service or ingress endpoint required a manual ticket to our network team, creating a bottleneck and an audit trail that lived outside our GitOps workflow.
We decided to solve each of these from the ground up, using cloud-native projects where they existed and building our own where they did not.
Kairos: Immutable OS for nodes you can trust
The first piece of the puzzle was node immutability. We evaluated several approaches, but Kairos, a CNCF Sandbox project, aligned most directly with what we needed: a Linux distribution designed from first principles to be immutable, declaratively configured, and reproducible.
With Kairos, every node in our fleet boots from an OCI image. That image is built from a known base (in our case RHEL-derived), baked with our approved security configuration, and published to our internal registry. The cloud-config model lets us define node behavior, SSH keys, network configuration, SSSD authentication against our Active Directory, Kubernetes agent registration, all as versioned YAML that flows through FluxCD just like any other platform component.
A CI/CD pipeline for operating system images
One of the less-discussed challenges of immutable infrastructure is the discipline it demands around image build and validation. We treat our Kairos images exactly like application container images: every change triggers a GitHub Actions pipeline that builds the image, runs integration tests against a live VM, and publishes a new OCI tag only on a clean pass. Nightly builds catch upstream regressions in base packages or the Kairos framework itself before they reach production.
This means our node image pipeline has the same properties we expect from application CI:
- Every commit is tested end-to-end, not just linted or statically analyzed.
- Nightly runs validate that the current pinned base image and package set still produces a bootable, correctly configured node.
- OCI tags are immutable artifacts. A tag that passed integration tests is never modified; rollback is a matter of pointing to a prior tag.
Kubernetes-native VM provisioning with VirtRigaud
The other half of the VMware story is how we actually provision VMs from our Kairos images. Rather than reaching for imperative vSphere tooling, we use VirtRigaud, a Kubernetes operator that provides declarative VM management across multiple hypervisors (vSphere, Libvirt/KVM, and Proxmox) through a unified CRD API.
The model is straightforward: our Kairos-built OCI image is registered as a VMImage CRD, and VMs are expressed as VirtualMachine CRDs referencing that image. FluxCD reconciles these manifests like any other platform resource. The result is that provisioning a new Kairos node on vSphere is semantically identical to deploying a workload, it is a pull request, reviewed, merged, and reconciled automatically.
VirtRigaud’s remote provider architecture also fits our security requirements well: provider credentials are isolated to their own pods, and the controller communicates with them over gRPC/TLS rather than embedding hypervisor credentials centrally.
The operational shift this created was significant:
- Drift is eliminated by design. There is no apt or yum running on production nodes. If a configuration change is needed, a new image is built, integration-tested, and nodes are rolled.
- Audit trails become trivial. Because every node’s configuration is an OCI digest in a registry and every VM is a versioned CRD in Git, we can answer “what was running on that node on that date?” with precision.
- VMware integration is fully GitOps-native. Nodes are provisioned, updated, and decommissioned through the same GitOps workflow as everything else on the platform.
The learning curve was real: getting kernel modules, NetworkManager, and enterprise authentication (SSSD/AD) right inside an immutable image took iteration. But once solved, the result is a node foundation we can genuinely trust, which matters when regulators ask questions.
k0rdent: Cluster lifecycle management as a platform
Immutable nodes solved the “what is running” problem. But we still needed to answer “how do clusters get created, updated, and decommissioned?” consistently across our entire fleet.
k0rdent, built on Cluster API (CAPI), gave us a Kubernetes-native control plane for managing Kubernetes clusters. Rather than treating cluster provisioning as a bespoke scripting exercise, k0rdent models clusters as CRDs. Combined with k0smotron for in-cluster control planes, we can now express our entire cluster topology declaratively, and FluxCD reconciles that state continuously.
Our choice of Kubernetes distribution for workload clusters was k0s, a CNCF Sandbox project. k0s
k0s is a fully self-contained, single-binary Kubernetes distribution that relies on nothing beyond the host kernel. This becomes especially valuable when your nodes run an immutable OS: k0s installs cleanly into a Kairos image without needing package managers, runtime systemd unit file changes, or the host-level assumptions that tools like kubeadm typically require. Together, Kairos and k0s give us a complete node-to-cluster stack where every piece is declaratively defined, packaged as OCI artifacts, and reproducible from a fresh boot.
k0smotron takes this further by letting Kubernetes control planes run as workloads inside the management cluster, so even the control plane itself is expressed as a CRD, reconciled by FluxCD, with no out-of-band state.
The architecture we adopted follows a hub-and-spoke model:
- A management cluster runs k0rdent, k0smotron, and the CAPI controllers.
- Workload clusters run k0s, provisioned and decommissioned through CRD manifests stored in Git.
- MetalLB handles load-balancing on bare-metal segments; Traefik provides ingress with consistent configuration across all spoke clusters.
Beyond initial setup, this approach changed how we handle ongoing operations:
- Cluster upgrades are a pull request. You update the desired Kubernetes version in a manifest, review it, and FluxCD applies it. There’s no ambiguity about who ran what command on which cluster.
- Cluster templates let us standardize configurations for common use cases, trading desk clusters, risk compute clusters, tooling clusters, and spin up new instances in minutes instead of days.
- Compliance posture is consistent by default. Since every cluster is expressed as code, our CEL-based admission webhooks and RBAC policies are applied uniformly at cluster creation time rather than added after the fact.
We’re also using k0rdent as the foundation for a spot-computing scheduler that lets donated physical server capacity be absorbed dynamically into our platform, a capability we plan to share more about in a future post.
bindy: Kubernetes-native DNS operations
The last gap, and the one where no existing project fully met our needs, was DNS. In capital markets, DNS isn’t a commodity concern. Our trading applications, market data feeds, and risk systems rely on DNS heavily, and the enterprise infrastructure behind them has been built and maintained over decades.
At RBC Capital Markets, that infrastructure is Infoblox, an enterprise DDI platform deeply integrated into our network operations. But the integration model was designed for a pre-Kubernetes world: every DNS record request went through a ticketing workflow, routed to the network team, and processed on a timescale of hours or days. As our platform grew to 50+ clusters, each spinning up dozens of services and ingress endpoints, that provisioning delay became a real operational bottleneck, and the audit trail for DNS changes lived entirely outside our GitOps workflow.
bindy was created by Erick Bourgeois to close this gap, a Kubernetes operator, written in Rust using kube-rs, that manages DNS zones and records as first-class Kubernetes resources. The core design philosophy was to make DNS a GitOps citizen, with the same reconciliation guarantees we apply to everything else on the platform:
- Zones and records are CRDs. A DNSZone or ARecord manifest in Git is the source of truth, continuously reconciled by bindy’s controllers.
- RFC 2136 dynamic updates let bindy push record changes to the DNS backend without manual intervention or ticket queues.
- bindcar, a sidecar REST API, provides an RNDC interface that bindy’s controllers use for zone lifecycle operations (zone creation, deletion, reload) alongside dynamic updates.
- Multi-controller architecture with strict write boundaries prevents split-brain scenarios. Selection controllers and sync controllers are separated; sync state is stored on the synced resource to support force-reconciliation patterns.
The impact has been immediate. DNS records for new services are created automatically as part of the same GitOps workflow that deploys the service itself, provisioning time drops from hours to seconds, and the audit trail is Git history, not a ticket system. The rigid integration boundary that previously required human coordination on every DNS change is replaced by a reconciliation loop.
bindy is currently being expanded to support compliance scoring (a CRD-based model for zone health) and a future MCP server interface for integration with AI-driven platform tooling.
How the three fit together
What makes this stack coherent is that each layer builds on the same foundational principle: everything is code, reconciled continuously, with no manual state.
Git (source of truth)
└── FluxCD (reconciliation engine)
├── k0rdent / CAPI manifests → cluster lifecycle
├── Kairos cloud-config → node configuration
└── bindy CRDs → DNS records
Kairos ensures every node boots from a known, auditable image. k0rdent ensures every cluster is expressed and managed declaratively. bindy ensures every DNS record is a versioned artifact. FluxCD ties them all together as the single reconciliation plane. The result is a platform where drift, at the node, cluster, or network level, is structurally prevented rather than operationally managed.
Challenges and lessons learned
Building this platform taught us several things we wish we’d known earlier:
- Immutable OS adoption requires patience with enterprise integration. SSSD, NetworkManager, and corporate CA trust chains all need explicit attention when baking immutable images. Document everything; the day-two operator debugging a boot failure at 2 AM is often not the person who built the image.
- CRD-based cluster management shifts responsibility left. When cluster provisioning is a pull request, platform teams need to invest in review processes and template governance up front, or the simplicity of “just a YAML file” becomes its own source of drift.
- Building operators in Rust is the right long-term call, but the ecosystem is still maturing. kube-rs is excellent, but patterns for multi-controller architectures with reflector/store caching require deliberate design decisions that the community is still converging on.
Looking ahead
Our platform continues to evolve. Some of the areas we’re actively developing:
- SPIRE/SPIFFE integration for workload identity across all 50+ clusters, replacing certificate-per-service approaches with a hub-and-spoke SPIRE architecture that satisfies our zero-trust requirements.
- Foundry, an internal self-service API layer, built in Rust, that will expose cluster and DNS provisioning capabilities to development teams through a governed, event-driven interface.
- Kairos-based spot computing using k0smotron and Kata Containers to absorb donated physical server capacity dynamically.
We’re proud to be building on and contributing back to the CNCF ecosystem, and we look forward to continuing to share what we learn. If you’re working through similar challenges in a regulated environment, we’d love to connect, find us in the Kairos, k0rdent, and FluxCD Slack communities, or reach out directly on LinkedIn.
Erick Bourgeois is Director and Head of Kubernetes Platform Engineering at RBC Capital Markets, managing 50+ Kubernetes clusters across multi-cloud and on-premises environments. He is a KubeCon and FluxCon speaker, FINOS Common Cloud Control member, and open-source developer at github.com/firestoned.



