I have rewritten the text content for improved readability and understanding while keeping the HTML structure intact.
During March, I presented a session at the KubeCon + CloudNativeCon Europe 2026 event in Amsterdam. Following the presentation, several recurring questions emerged on the CNCF Slack and in face-to-face discussions: Why build AI agents using cloud native principles? Which specific CNCF tools provide the core functionality? What is the role of human oversight, and how should teams structure themselves? This post provides a concise overview based on a system we are actively building and implementing at Orange Innovation.
The Background: We are developing an internal security operations platform to safeguard a highly regulated production environment. The system uses the A2A protocol (released in 2025, now managed by the Linux Foundation) for coordination between agents and the MCP standard (under the Agentic AI Foundation) for integrating various environment tools. Falco employs eBPF to intercept system calls on monitored workloads; these events pass through Kafka into an isolation forest model to screen anomalies before they alert the AI agents. Our primary focus is improving the speed of threat detection and response while reducing the manual effort required for rule creation by security experts. These insights are tailored for this scenario, but they are adaptable to any cloud-native infrastructure shared between a security operations center (SOC) and platform engineering teams.
Figure 1: System overview. A Coordinator Agent (LangGraph + A2A) orchestrates four specialised agents: Detect (Falco + ML), Analyse (Threat Analyst), Remediate, and Notify (Mattermost), plus a Human-in-the-Loop branch and a feedback loop that retrains the anomaly model. Adapted from my talk at KubeCon + CloudNativeCon Europe 2026.
Here are five key technical insights we have gathered so far, along with how we structured our team collaboration and community engagement, and a final thought on why the CNCF and Linux Foundation ecosystem is the ideal foundation for this type of architecture.
1. Treat Each Agent as a Separate Kubernetes Workload
We deploy every agent as an independent Kubernetes Deployment, complete with its own resource limits, identity, and restart policy. Most agents utilize LangGraph for their internal reasoning and tool-use loops, though a few are custom-built without a framework for scenarios requiring stricter control. The agent layer functions similarly to a standard microservice mesh, allowing for canary rollouts, Horizontal Pod Autoscaling (HPA), and namespace isolation without any custom engineering. The alternative—running all agents within a single process—might be easier to prototype locally but is unsuitable for production. If one agent hangs due to a model API timeout, it shouldn’t impact the others.
2. Secure Inter-Agent Communication with mTLS Instead of a Service Mesh
A2A messages contain proposed security rules and response actions; our threat model considers these just as sensitive as the data plane itself.
We chose not to implement a service mesh. Instead, cert-manager handles the issuance of unique identities for each agent, and agents perform mTLS directly at their gRPC/HTTP transport layer without sidecars. Cilium serves as the network foundation, and CiliumNetworkPolicy controls which agent identities can access specific MCP servers. This combination (cert-manager + agent-level mTLS + CiliumNetworkPolicy) is significantly simpler to manage than a mesh while providing the same level of security.
Of all our architectural decisions, A2A is the one I would make again without a second thought. Being open-sourced in 2025 and governed by the Linux Foundation, it isn’t tied to a single framework, allowing organizations to plan for long-term deployments of 3 to 5 years. Using A2A (LF) alongside the CNCF stack (LF) places the entire infrastructure under a single open governance model, which is a significant advantage for procurement in regulated industries.
3. Enforce Agent Safety Using Policy-as-Code, Not LLM Prompts
In our setup, a reviewer agent evaluates whether a proposed action is safe to proceed, such as deploying a detection rule, initiating containment, or modifying a firewall. The temptation is to embed these safety rules within the reviewer’s system prompt. Avoid this approach.
Before reaching the reviewer, a threat-analyst agent categorizes each escalation using the MITRE ATT&CK framework, providing the reviewer with structured data rather than unstructured text. We translated the reviewer’s safety constraints into OPA policies and Kyverno admission rules. The reviewer queries OPA via MCP, receives a definitive verdict, and proceeds accordingly. The reviewer’s prompt itself is intentionally simple and concise. The underlying policy is version-controlled, unit-tested, and reviewed like any other code artifact. If you make only one change to your architecture, make it this one.
4. Leverage A2A trace_id for Observability and GitOps for Configuration
The A2A protocol includes a trace_id with every task, which serves as the backbone of our observability. Agents produce structured JSON logs containing the trace_id, agent identity, MCP calls made, and LLM token usage. Prometheus collects per-agent metrics (request rates, MCP-call latency, and reviewer auto-execute/auto-reject/escalate ratios). Cilium Hubble provides network flow visibility to verify that the correct pods are communicating with the right services.
During development, when we first needed to explain a specific automated decision to an internal stakeholder, we retrieved all logs associated with that specific trace_id. The entire decision-making process was reviewed in approximately fifteen minutes. Without trace_id propagation through A2A, this investigation would have taken an entire day.
Every agent’s system prompt, tool list, and output schema is defined as a Kubernetes Custom Resource, managed by Argo CD from a Git repository. The reviewer’s policy bundle is stored in the same location. Deploying a change simply requires a pull request that is code-reviewed, audited, and reversible. This is a common failure point for early multi-agent systems: prompts scattered across notebooks and config files until an agent produces an unexpected result.
5. Filter Events with a Classical Anomaly Model Before Reaching the LLM
If every single event triggered the full suite of agents, the cost of the LLM tier would become the primary economic driver of the platform. A scikit-learn
An Isolation Forest model sits upstream of the agents, evaluating each incoming sample across 17 features in microseconds. Only those scoring above a carefully tuned threshold are forwarded to the agent fan-out stage. The LLM is called exclusively on the narrow subset that appears truly novel — precisely the kind of triage work a human detection engineer would historically perform. Both per-event latency and token expenditure remain predictable, and right-sizing the LLM tier becomes a routine capacity-planning exercise.
The Isolation Forest retrains on a weekly schedule by design, and shifts in feature distribution are themselves surfaced as a paged Prometheus alert. The anomaly threshold is not a hard-coded constant; it is a policy parameter that the reviewing agent consults at decision time. We can tighten or loosen it under load without redeploying any agents.
Keep the human in the loop — by protocol, not by culture
Every consequential decision terminates in one of three states: auto-execute, auto-reject, or escalate to a human SOC analyst via Mattermost with the full reasoning chain attached and ChatOps commands to approve, dismiss, or investigate inline. The third state is not an error path — it is a first-class output of the reviewer. It is designed to trigger under three conditions: reviewer confidence drops below its threshold; the asset sits on an always-escalate list (control-plane components, identity stores, anything customer-facing or compliance-sensitive); or the proposed action would exceed a configured blast radius.
“Should this case escalate?” is a deterministic policy verdict, version-controlled in Git, with its own SLO and dashboards. It does not depend on which analyst is on shift. If your human-in-the-loop story is “we’ll add an approval step later” or “the analyst can always intervene,” you don’t really have one yet.
How development and rollout actually go
As we move from development into rollout, the operational model already resembles any Kubernetes platform we have run before. Alerts are structural: policy bundle failed admission, MCP server p99 latency, anomaly-pre-filter drift, A2A queue depth above watermark — not “agent X gave a weird answer.” When an agent regresses during iteration, we treat it like any production microservice regression: roll back the Custom Resource via Argo CD, open a ticket, ship a fix through GitOps. No special agent-incident runbook to invent, and that is the point.
What changes for the SOC team is the nature of their work. Rule authorship has been the structural bottleneck for years; offloading it to the agent layer is the explicit goal. Engineers will curate the reviewer’s safety policy and spot-audit deployed rules instead of writing them. The day-to-day artefacts (CRDs, policies, GitOps pull requests) are ones the SOC and platform teams already know how to handle together.
How the work is organised across teams and the community
None of this works without joined-up teams. Three groups touch this system every week: the SOC, who own detection outcomes and the reviewer’s safety policy; the platform team, who own the cluster, GitOps pipeline, and agent runtime; and a small AI engineering group, who own the agent contracts and the anomaly model. We deliberately kept the contracts between them narrow and machine-readable (CRDs, OPA bundles, A2A schemas), so a change in one area never depends on a meeting in another.
The operational gain we are after is not just speed — it is capacity. Scaling detection coverage used to mean hiring more analysts to write more rules. With the agent layer, it means deploying more agent replicas and tightening the reviewer’s policy bundle — a meaningful lever on what was a headcount-bound problem, and time back for analysts on cases that genuinely need a human.
Externally, the system also exists in a community. The CNCF Landscape and the maturity signals attached to it (Sandbox, Incubating, Graduated, plus adoption and governance data) actively shape our technical choices: when we evaluated network policy enforcement, identity issuance, or anomaly tooling, the Landscape gave us a vendor-neutral starting point and the project maturity told us what we could responsibly run in a regulated production environment. The same lens decides where we contribute back. We track upstream issues in the A2A and MCP repositories, file what we hit, and feed lessons back into CNCF working groups. KubeCon talks and CNCF Slack threads are part of the loop, not afterthoughts. Picking cloud native and LF-governed protocols means we are not the only ones improving the substrate.
Why this stack
If any of this looks tractable on paper, it is because the CNCF and broader Linux Foundation projects we built on are simply that good. They let us treat agentic AI as a normal cloud native workload rather than a special case. Kubernetes makes deployment boring in the best way. Falco gives us a syscall-level detection substrate we did not have to write. Cilium and Hubble take identity-aware network policy seriously. cert-manager turns per-agent mTLS into a configuration. OPA and Kyverno make policy-as-code the default. Argo CD makes GitOps for agent CRDs a one-day implementation. Prometheus is the metrics layer the cloud native world runs on.
On the agentic-AI side, AAIF gives MCP a neutral home and A2A is governed under the Linux Foundation. LangGraph is the agent runtime we settled on after trying alternatives, but it is not the only path: frameworks like CrewAI, AutoGen, and LlamaIndex sit in the same space, and for some of our agents we deliberately keep the logic hand-written without any framework at all when we want full control over state machines, retry semantics, and tool-call sequencing. The protocols (A2A, MCP) are what we treat as the durable interfaces; the runtime is a choice we can revisit.
The two questions I keep getting are why cloud native at all for agentic AI, and where the human and the team sit in the loop. Agentic AI inherits all the operational problems cloud native already solved (identity, isolation, policy, observability, GitOps); inventing parallel substrates is wasted motion. And the human path and the team contracts have to be normal outputs of the system, not exceptions bolted on. Find me on the CNCF Slack, or at KubeCon.
About the author
Willem Berroubache is Lead Security Architect at Orange Innovation, where he leads cloud native security architecture for Orange’s 5G core. He is a CNCF Golden Kubestronaut and was selected in 2026 for the Orange Expert Group in Security. He has spoken at KubeCon + CloudNativeCon Europe 2026 in Amsterdam.



