May 29, 2026 By Abu Hena Mostafa Kamal, CNCF Kubestronaut & Senior Software Engineer
CNCF Projects Featured In This Article
Modern software delivery is no longer limited by application code — it’s now shaped by the platform running it. This post explains how we designed a cloud-native Internal Developer Platform (IDP) using Kubernetes and tools from the CNCF ecosystem. You’ll see how Infrastructure as Code (IaC), GitOps, and security-first pipelines can work together to form a unified, operationally reliable platform. While some examples use managed AKS, these architectural patterns apply equally to any CNCF-compliant Kubernetes distribution.
Distributed systems today often struggle with several operational issues that inspired this platform design: Inconsistent deployments caused by manual processes No version control for infrastructure or drift management, leading to differences between environments Hardcoded secrets and weak security practices baked into CI/CD pipelines Scaling strategies that waste resources and increase costs Limited ability to recover from failed deployments or roll back changes Disjointed observability that slows down troubleshooting and root cause analysis The architecture presented here tackles all of these problems through declarative, automated, and policy-driven controls.
Design Principles
Our platform was built following key CNCF-aligned principles that guided every decision:
Declarative infrastructure — Every resource is version-controlled and reproducible
GitOps-based deployments with Argo CD — Git serves as the single source of truth for the cluster
Immutable infrastructure and containerized workloads — No manual changes to live systems
Security-by-design — Built into threat modeling, CI/CD, and runtime
Observability as a core capability — Not tacked on after deployment
Clear separation of concerns — Modular design across infrastructure, platform, and application layers
The platform is organized into three distinct layers, each with clear responsibilities. Merging them too early led to unnecessary complexity — something clearly reflected in our codebase, which keeps infrastructure, platform, and application components in separate repositories or directories. The Infrastructure Layer sets up the Argo CD GitOps controller. Once running, Argo CD takes over, continuously syncing both Platform Components and Application Layer resources to match what’s defined in Git.
This layer provisions all cloud resources using Terraform, organized into reusable modules:
Virtual Networks (VNet), subnets, and Network Security Groups
Mmanaged Kubernetes cluster
Container Registry
Identity, access settings, and Secret Stores
2. Platform Layer
Built on top of Kubernetes and powered by CNCF tools, this layer is installed and managed declaratively in its own repository or dedicated directories:
Argo CD — GitOps engine for continuous reconciliation
Istio — Service mesh handling traffic routing, mTLS, and service-level observability
Prometheus — Metrics collection and alerting
Grafana — Visualization dashboards
Loki — Centralized log aggregation
Kyverno — Enforces Policy as Code at admission time
3. Application Layer
Microservices run as containerized workloads and are independently managed through Git:
Independently deployable services — No shared release schedules
Helm-packaged for smooth environment promotion
Git-driven deployment lifecycle with full audit history
End-to-End Deployment Workflow
The platform uses a multi-stage delivery process that enforces strict separation between app building, security checks, and infrastructure setup. Here’s how everything flows — from static analysis to build to deployment.
Stage 1: Platform Prerequisites
Everything starts with a few essential components needed to power automation and pipelines:
A container image registry for storing signed, versioned artifacts
A Terraform remote backend for state management and team collaboration
A secure
cloud provider service connection for running pipelines
Stage 2: Application pipeline
The application pipeline runs with every commit made to application codebases (Java or Angular services). Its main task is to build a secure, tested, and deployable container image. Every update moves through these steps:
Source code compilation and build
Running unit and integration tests
Conducting static code analysis via SAST (Static Application Security Testing)
Checking third-party dependencies for known vulnerabilities with Trivy
Building the container image
Signing the image with Cosign to prove it is authentic and unaltered
Uploading the final signed image to the container registry
Only validated, versioned, and tamper-proof images are deployed into the environment. The sample pipeline config below illustrates the Cosign signing process used during CI.
Before any infrastructure update or deployment proceeds, a dedicated security validation pipeline adds another layer of verification. It checks both images and deployment configurations:
Confirming container image signatures with Cosign
Scanning images for vulnerabilities via Trivy against a minimum severity threshold
Validating Kubernetes manifests with KubeSec to spot misconfigurations and insecure settings
Only workloads passing all three steps get approval for deployment.
Stage 4: Infrastructure provisioning pipeline
Once security checks pass, the infrastructure provisioning pipeline runs. This phase sets up the Kubernetes foundation:
Setting up virtual networks (VNets, subnets, routing)
Deploying a managed Kubernetes cluster with auto-scaling node pools
Installing Argo CD as the central GitOps controller, a core platform feature
Configuring Argo CD Application CRDs during initial setup
Linking infrastructure Git repositories to Argo CD
The Terraform module below for the Kubernetes cluster shows the setup, including Key Vault integration via CSI driver and Calico network policies:
After infrastructure is ready, the platform follows a GitOps approach where Git serves as the single source of truth. Argo CD continuously synchronizes both platform and application components by tracking Kubernetes manifests and Helm charts. Updates committed to Git are automatically reflected on running clusters, keeping environments in sync. This approach offers:
Automatic synchronization — no need to manually run kubectl commands
Complete audit trail via Git history and sync status
Simple rollbacks using standard Git procedures
The Argo CD Application CRD below shows a microservice configured for automated syncing with self-healing and cleanup enabled:
With infrastructure and workloads live, external users reach the platform through a cloud load balancer. Requests are forwarded to the API Gateway or Ingress layer, which routes traffic to the correct Kubernetes Services. These services distribute traffic evenly across available application Pods, where requests are processed and responses are sent back.
Security architecture
Security is woven into every stage of the platform lifecycle — not tacked on at the end. It covers supply chain integrity, policy enforcement, runtime protection, and secret management.
1. Supply chain
Security
Security starts at the artifact level by guaranteeing that only trusted and verified components make their way into the system:
Trivy checks container images and their dependencies for any known vulnerabilities
KubeSec examines Kubernetes manifests to catch insecure configurations as early as possible
Cosign enables cryptographic signing and verification of container images, safeguarding both integrity and origin through keyless signing based on OIDC
Running these checks together guarantees that only scanned, validated, and signed artifacts proceed to deployment.
2. Enforcing Policies with Kyverno
Within the cluster, Kyverno applies policies during admission, blocking non-compliant workloads from ever being scheduled. One example of our baseline rules is preventing pods from using the “latest” tag:
Kyverno ClusterPolicy — Blocking the Latest Tag
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: disallow-latest-tag
annotations:
policies.kyverno.io/title: Disallow Latest Tag
policies.kyverno.io/description: >-
Enforce image tags tied to a specific version.
Using the 'latest' tag is unpredictable since it can change without warning.
spec:
validationFailureAction: Enforce
background: true
rules:
- name: require-image-tag
match:
any:
- resources:
kinds:
- Pod
validate:
message: >-
The 'latest' image tag is prohibited. Use a versioned tag instead.
pattern:
spec:
containers:
- image: "*:*"
- name: disallow-latest-tag
match:
any:
- resources:
kinds:
- Pod
validate:
message: "Use of the 'latest' image tag is forbidden."
pattern:
spec:
containers:
- image: "!*:latest"
3. Runtime Security
While controls before deployment are important, they aren’t enough on their own. Runtime security tools track system behavior and flag anomalies while workloads are running:
Falco detects suspicious activity in real time within containers and on the host, with alerts feeding directly into the monitoring stack
AppArmor applies kernel-level security profiles that limit container capabilities and shrink the overall attack surface
4. Handling Secrets
Sensitive data is kept outside of application code and deployment files to avoid any risk of exposure:
Key Vault, connected through the CSI Secrets Store driver, dynamically injects secrets into workloads when each pod starts
Secrets are never placed in Git repositories or baked into container images
Rotation is managed centrally in Key Vault and automatically picked up by active workloads
This method keeps secret handling centralized, auditable, and secure by design.
5. Networking and Traffic Control
The networking layer merges Kubernetes-native components with Istio’s service mesh features to deliver secure, observable, and policy-based traffic management:
Kubernetes Services internally expose workloads with stable DNS-based discovery
Azure Load Balancer manages external traffic with built-in DDoS protection at the network edge
Istio handles traffic routing, mTLS encryption between services, and service-level observability
Calico CNI applies network policies that block lateral movement between namespaces
A notable lesson from enabling Istio mTLS was that turning on Strict mode across the entire cluster too soon caused outages—not all workloads had their sidecars injected yet. Istio offers two modes: Permissive (accepts both plaintext and mTLS) and Strict (enforces mTLS only). The solution was beginning in Permissive mode and then progressively switching each namespace to Strict mode via PeerAuthentication, only after confirming that every workload in that namespace had its sidecar properly injected.
Monitoring and Observability
Observability is built as a cohesive system with three complementary data streams, all displayed through a unified Grafana interface:
Tool
Signal Type
Primary Use
Prometheus
Metrics
Resource tracking, SLO monitoring, alerts
Grafana
Visualization
Dashboards, SLA reporting, incident response
Loki
Logs
Centralized log collection, correlation with traces
We chose Prometheus, Grafana, and Loki to match a Kubernetes-native observability model. Prometheus captures metrics, Loki gathers logs using lightweight label-based indexing, and Grafana ties everything together in a single visualization layer. This setup cuts operational cost and complexity significantly compared to managing a separate Elasticsearch and Kibana stack.
Infrastructure as Code Strategy
Terraform is organized into modular components that mirror the platform’s layered design, allowing each one to be versioned and tested independently:
staging.tfvars — mirrors production topology with synthetic load testing
prod.tfvars — full-scale node pools, strict policies, backup schedules active
This approach ensures consistency across environments, maximizes reusability, and supports controlled, environment-specific adjustments without duplicating any module code.
Key Results
The following results were recorded in our internal lab and staging environments after fully adopting the platform:
Metric
Observed Change
Deployment reliability
Rose to around 95% success rate (up from roughly 70% under manual processes)
Infrastructure provisioning time
Dropped from hours or days to under 15 minutes thanks to Terraform automation
Deployment frequency
Grew from weekly to multiple releases daily
Configuration drift incidents
Nearly eliminated through GitOps continuous reconciliation
Pre-production vulnerability detection
80% of issues identified before reaching staging
Manual kubectl operations
Practically eliminated for routine deployments
Challenges and Lessons Learned
Working through the CNCF ecosystem highlighted the risk of onboarding too many overlapping tools too soon. The key takeaway was letting architectural needs drive tooling choices and postponing additions like OpenTelemetry until the platform had stabilized. Keeping a clean separation between infrastructure, platform, and application layers was critical for long-term maintainability. Early on, tightly coupling tools such as Argo CD and Istio with application code created unnecessary complexity; this was later addressed by reorganizing repositories into distinct folders. GitOps greatly improved consistency and traceability but brought synchronization challenges during repository restructuring, which were overcome using Argo CD app-of-apps patterns and application health checks. Shifting security scans earlier in the pipeline — running Trivy and KubeSec right after the build step — sped up feedback and cut down on late-stage failures.
Conclusion
This architecture demonstrates how Kubernetes and CNCF tools can be woven together to create a secure, automated, and scalable platform — where the true value lies in how deployment, security, and observability function as a unified whole. The guiding design principles are to define clear layer boundaries early on, bake security in from the start, and adopt GitOps with Argo CD from day one. Looking ahead, planned improvements include multi-cluster management using Argo CD ApplicationSets, tighter policy enforcement with Kyverno, deeper zero-trust networking through Istio, and integrating distributed tracing via OpenTelemetry into the observability stack.