Auto-diagnosing Kubernetes Alerts With HolmesGPT And CNCF Instruments

Posted on April 21, 2026
by Grace Park and Ihyeok Music, DevOps Engineer, STCLab SRE Staff

CNCF tasks highlighted on this publish

What a two-person SRE group discovered constructing an AI investigation pipeline. Spoiler: the runbooks mattered greater than the mannequin.

Why we constructed this

At STCLab, our SRE group helps a number of Amazon EKS clusters working high-traffic manufacturing workloads. We’ve received the complete observability stack in place: OpenTelemetry feeding into Mimir, Loki, and Tempo. Robusta OSS enriches Prometheus alerts with error logs, Grafana hyperlinks, and group mentions earlier than dropping them into Slack.

So the info was by no means the issue. The issue was what occurred subsequent. Each alert meant the identical drill: verify the pod, question Prometheus, dig via Loki, pull traces, attempt to correlate. Fifteen to twenty minutes, each single time. We wished that first move to occur robotically and present up in the identical Slack thread.

HolmesGPT: Letting the LLM determine what to analyze

Alerts Pipeline

We went with HolmesGPT (CNCF Sandbox) due to the way it works: the ReAct sample. The LLM reads an alert, picks a device, reads the outcome, then decides what to verify subsequent. If a pod restarts, it would begin with the exit code, pull Loki logs throughout clusters via VPC peering, then have a look at CPU stress in Prometheus. The trail isn’t scripted it relies on what the mannequin really finds.

That issues in our case, as a result of not each namespace appears to be like the identical. Some have the complete image: centralized logs, distributed traces, the works. Multi-tenant workloads typically have none of that; for these namespaces, it’s kubectl and Prometheus solely. We seize these variations in markdown runbooks, every with a metadata header:

## Meta
scope: namespace= solely
instruments: kubectl, prometheus, loki, tempo
warning: some containers excluded from log assortment → use kubectl logs

Holmes calls fetch_runbook early in its investigation. The metadata tells it which instruments can be found and which of them to skip.

Making it work with Robusta

Our customized playbook is about 200 traces of Python. It covers what HolmesGPT doesn’t.

Robusta posts the alert to Slack earlier than Holmes is completed investigating, so our playbook has to seek out the appropriate thread after the very fact and publish outcomes as a reply. When Prometheus fires one alert per pod throughout a rollout, the playbook fingerprints on the workload degree and suppresses repeats for half-hour. And since Robusta routes to completely different Slack channels by namespace, the playbook replicates that mapping to seek out the place to publish.

Runbooks modified every thing

We began by specializing in mannequin choice. What really decided investigation high quality was the runbooks.

With out runbooks, the mannequin simply guesses. It’d verify Istio metrics in namespaces that haven’t any sidecars, or question Loki the place nothing is being collected. Finally it loops, burns via its step finances, and comes again with “I need more information.”

What fastened this wasn’t a greater mannequin. It was telling the mannequin what not to do. As soon as we added exclusion guidelines to our runbooks (“no Loki, no Tempo, no Istio here; use kubectl and PromQL only”), wasted device calls dropped from 16 to 2 per investigation.

We ran a managed comparability to verify this: the identical ClickHouse handshake alert, examined 4 methods. With runbooks, Holmes matched the recognized error sample in 3 to 4 device calls and used the remainder of its finances to confirm. With out runbooks, it chased three totally completely different hypotheses (proxy scaling, schema mismatch, port misconfiguration) and burned via 20+ steps earlier than reaching a conclusion. Identical mannequin, similar alert. The runbook didn’t hand it the reply. It simply narrowed the search house sufficient {that a} 12-step finances was loads.

We now preserve seven runbooks, organized by namespace and alert kind. When an investigation comes again improper, the primary query we ask is “does the runbook cover this?” Not “do we need a better model?”

Screenshots: Same alert, same model. Left: with runbook guidance. Right: without runbooks.

Identical alert, similar mannequin. Left: with runbook steerage. Proper: with out runbooks.

The mannequin journey

We examined seven fashions throughout self-hosted and managed internet hosting.

Self-hosted got here first, working on Spot GPUs managed by KubeAI (CNCF). The 7B mannequin couldn’t produce legitimate device calls. The 9B mannequin’s considering mode clashed with the ReAct loop and returned empty responses. A 14B appeared promising, however Spot evictions stored killing our runs, and chilly begins took 5 to eight minutes whereas Karpenter spun up nodes.

Then we tried managed APIs via VPC endpoints, which retains cluster information inside our infrastructure. Most fashions didn’t work; a number of choked on HolmesGPT’s immediate caching markers. Just one mannequin household handed every thing we would have liked: Korean output, Slack formatting, runbook retrieval, and cross-cluster log correlation. We additionally contributed a three-line upstream repair for pod identification authentication (PR #1850, merged).

As we speak we run a hybrid setup: self-hosted in staging, managed API in manufacturing. Switching between them is one YAML block:

modelList:

 major:

   mannequin: "provider/model-name"  # swap supplier and mannequin ID

   api_base: " # managed API or self-hosted

   temperature: 0

Price comes out to about $0.04 per investigation, roughly $12 a month. Pipeline, playbook, runbooks — all unchanged no matter backend.

What really mattered

Some numbers. Workload-level deduplication takes round 40 uncooked day by day alerts right down to about 12 distinctive investigations. Engineers learn a threaded abstract in below two minutes as a substitute of spending 15 to twenty on guide triage. Roughly 40% of investigations resolve on their very own: OOMKilled, ImagePullBackOff, and different recognized patterns the place Holmes matches a runbook and the basis trigger is clear.

Right here’s what we’d inform one other group beginning this.

Runbooks over fashions. We ran a managed take a look at the place the identical mannequin scored 4.6 out of 5 with runbooks and three.6 with out, on the very same alert. The exclusion guidelines we wrote into our runbooks moved the needle greater than any mannequin swap ever did.

Glue code is actual work. That 200-line playbook handles timing, dedup, routing, and thread matching. HolmesGPT handles reasoning. You want each.

Design for mannequin migration. We’ve swapped backends 3 times now with out touching the pipeline. The playbook is the secure core. The mannequin is the half you substitute.

What’s subsequent: we’re Inspektor Gadget (CNCF) to feed eBPF-level community metrics — TCP retransmits, connection latency — into the identical pipeline via Prometheus. The structure stays the identical. Holmes simply will get higher information to work with.

About STCLab: STCLab builds high-performance on-line site visitors administration software program. NetFUNNEL (Digital Ready Room) and BotManager (bot mitigation) serve hundreds of thousands of concurrent customers throughout 200 nations.

Questions? grace@stclab.com | stclab.com

Top Posts

Whales Are Swallowing Bitcoin’s Plunge

Cybersecurity M&A in May 2026: 26 Deals Unpacked

VA EHR Expansion Accelerates: Four New Deployments Signal Nationwide Digital Health Push

Auto-diagnosing Kubernetes alerts with HolmesGPT and CNCF instruments

VA EHR Expansion Accelerates: Four New Deployments Signal Nationwide Digital Health Push

Unleashing Speed at Scale: KubeVirt Performance Reimagined with VirtBench

AWS Weekly Roundup: BYOM for Amazon RDS for SQL Server, AWS IoT Device SDK for Swift, and More

Transforming Cloudflare Threat Signals into Instant WAF Protections

Beyond One Data Center: Mastering Geo-Distributed AI with the k0smos Platform

Disabled Federal Workers Take Legal Action Against Justice Department Over Denied Accessibility Rights

Whales Are Swallowing Bitcoin’s Plunge

Cybersecurity M&A in May 2026: 26 Deals Unpacked

VA EHR Expansion Accelerates: Four New Deployments Signal Nationwide Digital Health Push

Decades of Remote Work: The 7 Laptop-Bag Essentials I Never Leave Home Without

AI Agents Outpace Traditional Search by 48x in Groundbreaking Harvard-Perplexity Study

Chromatix: A Differentiable, GPU-Accelerated Wave-Optics Library

Xiaomi’s MiMo Stuns with Breakthrough: Outpacing ChatGPT and Claude by 15 Times

Taming the Flood: Strategies to Lighten Your Tier 1 Burden

Trending

Whales Are Swallowing Bitcoin’s Plunge

Cybersecurity M&A in May 2026: 26 Deals Unpacked

Latest Posts

Not More Data, but Better World Models – Unite.AI

OpenAI Is Hiring Head of Preparedness, Amid AI Cyberattack Fears

Subscribe to Updates

Top Posts

Auto-diagnosing Kubernetes alerts with HolmesGPT and CNCF instruments

Why we constructed this

HolmesGPT: Letting the LLM determine what to analyze

Making it work with Robusta

Runbooks modified every thing

The mannequin journey

What really mattered

Related Posts