I’ll never forget our first time staying up all night for something that wasn’t a software bug.
It happened on a Tuesday. Grafana dashboards displayed empty panels for Cilium network metrics. Hubble was functioning perfectly — DNS visibility, TCP flows, and HTTP latency all appeared correctly in the Hubble UI. But the engineer on call at 2 AM couldn’t access any of this data through Grafana. The problem? Prometheus wasn’t configured with ServiceMonitors connected to Cilium’s agent and operator pods. Two Cloud Native Computing Foundation (CNCF) projects were both set up properly, yet they couldn’t see each other at all.
This situation illustrates what we call the integration tax. It represents the hidden expense of operating multiple CNCF projects side by side in production, and this is where platform teams dedicate 80% of their effort — not deploying projects or fine-tuning them separately, but connecting them so they can actually communicate.
“This situation illustrates what we call the integration tax. It represents the hidden expense of operating multiple CNCF projects side by side in production.”
Every team builds the same stack. Each team encounters unique failures.
The CNCF landscape features roughly 250 projects. Most real-world Kubernetes platforms typically rely on the same essential collection of 20–30 cloud native tools. Prometheus for monitoring. ArgoCD for GitOps. Cilium for networking. cert-manager for TLS certificates. Velero for backup solutions. Sealed Secrets for credential management. Kyverno for governance policies. You install them and create configuration files. The integration work follows. Then the problems start appearing, and they never show up in any single project’s bug tracker.
Where CNCF projects collide
cert-manager conflicts with ingress controllers. We encountered this problem across three different cloud providers. cert-manager’s HTTP-01 ACME challenge requires serving a token through plain HTTP. However, if your ingress controller implements a universal HTTP-to-HTTPS redirect —which is the security best practice — every ACME verification request receives a 301 redirect before it ever reaches cert-manager’s solver pod. Certificate renewals fail without notification. You discover the problem when visitors encounter expired TLS warnings in their browsers. The solution involves switching to DNS-01 challenges through Route53, Cloud DNS, or Azure DNS. However, these cloud-specific IAM configurations aren’t included in any default Helm chart setup. You only learn about these constraints after something goes wrong.
Prometheus conflicts with kubelet. This particular issue required weeks of troubleshooting. kubelet provides metrics through four different scrape endpoints. Two of them — /metrics and /metrics/probes — both output process_start_time_seconds with exactly the same timestamps since they belong to the same process. Prometheus faithfully collects from both sources, detects duplicate samples, and triggers PrometheusDuplicateTimestamps. The alert creates constant noise. The underlying cause only becomes apparent by examining the kubelet source code itself. However, the solution requires a Jsonnet relabeling rule that eliminates one entire scrape endpoint. Neither of these problems qualifies as bugs. Each project behaves precisely as specified in its documentation. The issues emerge in the spaces between them.
“Neither of these problems qualifies as bugs. Each project behaves precisely as specified in its documentation. The issues emerge in the spaces between them.”
Cluster API delivered a unified workflow across four clouds
Prior to Cluster API (CAPI), creating clusters required relying on each cloud provider’s command-line tools. eksctl for AWS. gcloud container clusters create for GCP. az aks create for Azure. Every platform followed its own lifecycle patterns, upgrade procedures, and disaster recovery approaches. Your commitment extended beyond a single cloud provider; you became bound to that provider’s specific methods for managing Kubernetes.
Cluster API transformed the approach. Your cluster becomes a collection of Kubernetes-native resources — Cluster, MachineDeployment, MachinePool — with a cloud-specific provider converting these into actual infrastructure. We operate CAPA on AWS, CAPG on GCP, CAPZ on Azure, and CAPH on Hetzner bare metal systems. The initialization process stays consistent throughout: K3D management cluster → install provider → generate workload cluster → clusterctl transition to enable self-management.
The genuine advantage appears with ongoing operations. Upgrading a Kubernetes version involves modifying a single line in a MachineDeployment. Cluster API takes care of cordoning, draining, and rolling node replacement. A MachineHealthCheck eliminates faulty nodes without manual intervention. Disaster recovery requires rebuilding the management cluster, loading Velero backups from cloud storage, and allowing CAPI resources to synchronize. The full cluster reconstructs itself based on Git state. This demonstrates whether your integration efforts prove reliable when difficulties arise.
The architecture that ultimately resolved the issues
Following years of addressing integration problems across multiple cloud environments, we identified an approach that brought stability: a two-repository GitOps structure. This strategy works whether you’re utilizing commercial solutions or constructing your own stack from open-source tools.
Platform repository: Contains 100+ Helm charts with configurations tested in production. Cilium NetworkPolicies integrated within each chart. Prometheus ServiceMonitors pre-configured. cert-manager annotations set for appropriate challenge methods. These settings apply universally across every cluster and cloud environment.
Configuration repository: Single instance per customer or environment. Includes only the values that truly differ across clusters: domain names, node quantities, GCP project identifiers, AWS account permissions, and Hetzner server configurations.
ArgoCD monitors both repositories. When we resolved the Prometheus duplicate timestamps problem in the platform repository, that correction distributed to every cluster (AWS, GCP, Azure, bare metal) through a single version update. One pull request. No individual cluster tickets. No need for anyone to remember applying the relabeling rule adjustment across multiple systems; the integration logic exists in code.
Essential lessons learned from production experience
Automate your monitoring creation rather than building it manually. We rely on Jsonnet to generate the complete kube-prometheus stack from individual per-cluster variables files. Custom alerting configurations — Velero
Monitoring rules don’t have to be hand-written YAML. We manage backup age checks, CloudNativePG replication lag alerts, and kubelet certificate expiry as Jsonnet libraries that live side-by-side with upstream rules. A single build.sh script generates everything. The output is reproducible, diffable, and version-controlled. Then, when a Prometheus upgrade breaks your custom rules, the change shows up in the diff immediately — and you can test the fix before it ever hits production.
Embed NetworkPolicies inside your charts — don’t stash them in post-deployment runbooks. Across more than 20 Helm charts, we include Cilium NetworkPolicy templates directly in each release. Every chart declares exactly what external APIs it calls and which internal services it depends on. Trying to reconstruct network rules from Hubble flow logs after deployment is like writing tests after shipping to production. Policies drift. Security turns a guessing game. By baking policies into the charts, you guarantee each rule lives exactly where it’s maintained.
Automate disaster recovery right at bootstrap time. When provisioning begins, we create cloud storage buckets — S3, GCS, Azure Blob — for Velero backups as part of the initial cluster setup, not as a follow-up task that sits in a Jira backlog for six months. Once the bootstrap is complete, recovering from total cluster loss is possible with the tools already in place. Disaster recovery shifts from wishful thinking to a testable, repeatable process.
Encrypt secrets first — then commit them. Every credential — deploy keys, cloud IAM tokens, TLS certificates — is encrypted via Sealed Secrets before entering Git. The decryption key gets safely backed up to cloud storage. Your Git repo then becomes a fully auditable snapshot of every cluster’s complete state, secrets included. Drift detection works. Recovery comes down to one pull request and one clusterctl move.
Let machines enforce policy, not humans. Kyverno blocks any deployment missing resource limits. Kubescape continuously scans CIS benchmarks and feeds violations into Prometheus alerts. Paired with Cilium network segmentation, your security posture becomes something auditors can verify directly from Git history and live cluster state — not from a spreadsheet that last saw an update two quarters ago.
The compounding cost
The integration tax isn’t a one-time expense. Every Kubernetes version bump, every Helm chart upgrade, and every new CNCF project adds fresh integration work. If your monitoring stack relies on hand-written YAML, upgrading kube-prometheus from v0.13 to v0.17 means painstakingly diffing hundreds of generated files. If it’s managed as Jsonnet, the change is a single line — while the debt continues to compound if you ignore it.
The CNCF ecosystem is genuinely powerful. But power without deliberate integration is just a folder of Helm installs. The work that truly matters — drift detection, coordinated upgrades, disaster recovery automation — happens in the wiring between components. That’s where your platform either thrives into its second year or becomes a collection of tools you no longer trust.



