Evaluating AI Agent Retrieval Approaches For Kubernetes Bug Resolution

Posted on May 8, 2026
by Brandon Foley

CNCF projects highlighted in this post

I’ve been incorporating AI coding agents into my everyday engineering routine and wanted to gauge how effectively they handle genuine, real-world bugs.

To find out, I designed a series of structured experiments using bug reports pulled from the Kubernetes repository. The goal was to see whether these agents could deliver accurate, thorough fixes with no human guidance inside a massive, multi-million-line codebase. My starting hypothesis was straightforward: success would come down to retrieval. Whether through retrieval-augmented generation (RAG) or filesystem search, a model that locates the relevant code should be able to craft the correct fix.

That hypothesis only partially held up. Even when agents identified the correct files, they frequently struggled to link changes across multiple files, misjudged the actual scope of the bug, or produced fixes that made sense in isolation but were wrong in the broader context. The real bottleneck wasn’t just locating code — it was reasoning about that code within the full picture.

A flow chart showing the journey from input, through agent sessions, to output

The setup

I pulled open pull requests from the Kubernetes GitHub repo — real bugs being actively addressed by real contributors. I extracted only the issue description (not the PR description, not the diff, nothing that would reveal the solution) and fed each issue to three distinct agent configurations:

RAG Only: Hybrid retrieval over an indexed copy of the Kubernetes codebase via the KAITO RAG Engine (Qdrant), combining BM25 keyword matching with embedding-based semantic search. KAITO also offers an auto-indexing controller, which is ideal for indexing enormous git repos with support for incremental indexing. Results are merged and ranked before being returned as context snippets. No local files or web access — the agent only sees retrieved chunks.
Hybrid (RAG + Local): Same RAG index, but also equipped with a full local clone of the Kubernetes repository. The agent must begin with RAG for discovery, then can read local files for precision.
Local Only: Full clone, grep, find, cat. No RAG, no web. The agent navigates the codebase through direct filesystem traversal.

Each agent operated in a completely isolated session. Same model (Claude Opus 4.6), same timeout (5 minutes), same output format. The only variable was how they could access code.

One critical constraint: the RAG and Hybrid agents were instructed to run RAG queries before producing any fix. Early experiments revealed that without this requirement, hybrid agents would bypass RAG entirely when the issue referenced specific filenames, undermining the entire purpose of the comparison.

The test cases (Benchmark)

I assembled a benchmark from real, in-flight bug fixes in the Kubernetes repository. Each test case is an open pull request where the agent receives only the issue description and is asked to produce a patch. They span kubelet, scheduler, networking, storage, and apps — ranging from a one-line guard clause to a 900-line multi-file refactor.

PR	Bug	SIG	Size	Files
#134540	SubPath volume mount race condition (EEXIST)	sig/storage	XS	3
#138211	Image pull record poisoning (preloaded images)	sig/node	M	5
#138244	NUMA topology stall on GB200 systems (O(2^n))	sig/node	L	2
#136013	OOM score panic on zero memory capacity	sig/node	S	2
#138269	Scheduler SchedulingGates missing Pod/Update event	sig/scheduling	S	2
#138158	ServiceCIDR deletion race in allocator	sig/network	M	3
#135970	Missing initializer in StatefulSet slice builder	sig/apps	M	4
#138000	Windows kube-proxy stale remote endpoint cleanup	sig/network	XL	5
#138191	Container status sort by attempt (clock rollback)	sig/node	M	5

Each generated diff was compared against the corresponding pull request diff and graded across five dimensions (0–4 each):

Files: Did the diff touch the correct files?
Location: Was the fix applied in the correct function or layer?
Mechanism: Did the fix preserve correct system-level invariants and layering, rather than just patching symptoms?
Tests: Were existing tests updated or meaningful new tests added?
Completeness: Did the patch propagate changes across all required call sites and dependent paths?

This evaluation uses the PR diff as the reference implementation. That’s not a perfect ground truth — PRs reflect iterative discussion and may not represent the only or final correct solution. Here, it serves as a proxy for “what human contributors converged on under review constraints.”

Timing

Every session had a 5-minute ceiling. Here’s how long each actually took:

PR	RAG	Hybrid	Local
#134540	55s	3m00s	5m00s
#138211	2m30s	3m30s	2m30s
#138244	41s	44s	2m00s
#136013	31s	55s	41s
#138269	44s	1m12s	1m23s
#138158	1m41s	1m48s	52s
#135970	1m23s	2m43s	1m32s
#138000	1m28s	4m03s	4m57s
#138191	1m34s	3m54s	2m52s
Average	1m16s	2m25s	2m24s

RAG is consistently the fastest. It doesn’t read files, navigate directories, or wait for grep results. It queries the index, receives snippets, and generates. Average wall-clock: 1 minute 16 seconds. This suggests retrieval

While this approach cuts down on the effort of exploring the codebase, it also limits how deeply the agent can investigate the overall structure.

On average, the Hybrid method is the slowest, clocking in at 2 minutes and 25 seconds. It requires a mandatory RAG-first step, which adds delay before any local file exploration can start. For particularly complex pull requests like #138000 and #138191, Hybrid took over 3–4 minutes. The back-and-forth switching between tools ends up consuming more time than the actual reading of files.

Local exploration matches Hybrid’s average speed, but for entirely different reasons. Hybrid spends its time running RAG queries and then reading specific files. Local, on the other hand, burns through time with broad searches—grepping, locating files, and navigating directories. How much time this takes depends heavily on how clearly the issue description points the agent in the right direction.

Token economics

Token consumption isn’t just about how much code the agent reads—it’s primarily driven by how many times it calls the model.

Claude’s API is stateless, meaning every call re-sends the entire conversation history. This makes the number of calls the main factor multiplying your costs.

Here are a few key terms used below:

Calls: The number of times the model is invoked during a session
New Tokens: Tokens introduced for the first time (retrieved code, file contents, prompts)
Output: Tokens produced by the model in response
Cached (replay): Tokens re-sent on every call because the full conversation is replayed each time
Total: The sum of all tokens processed throughout the session

Averages across all runs:

Three key patterns emerge:

Hybrid is the most expensive approach—not because it reads more content, but because it makes the most model calls. The loop between RAG queries and local exploration creates repeated round-trips, and each one replays an ever-growing context window.
RAG and Local end up at similar total costs, but for opposite reasons. RAG pulls in more new context through retrieval, while Local burns tokens on exploratory calls. The totals happen to converge.
Fewer calls beats fewer tokens. Sessions that stayed under roughly 5 calls were consistently cheaper, no matter which approach was used.

Across every run, the number of model calls was the single biggest driver of both cost and latency.

What I actually learned

Agents fix locally, not systemically

Agents excel at fixing the specific problem right in front of them, but they struggle to reason about the broader system.

PR #134540 illustrates this well. The correct fix required preserving an error (%w) so the caller could handle it, and then updating the caller to tolerate that specific case. Every agent instead swallowed the error at the source. The end result was functionally similar but architecturally incorrect.

This pattern is consistent: agents address symptoms in isolation but miss the contracts and relationships between components.

Scope discovery is the real bottleneck

The most common failure mode wasn’t wrong fixes—it was incomplete ones.

Agents frequently fixed the “main” bug but overlooked adjacent changes that were also needed:

On #138191, RAG-only fixed one sort implementation but missed a second one.
On #138000, all approaches fixed the core bug but missed required changes in proxier.go and related integration logic.
On #134540, Local and Hybrid noticed a partial fix already in the codebase and stopped early, missing the second required change.

The recurring pattern: agents don’t ask “what else needs to change?” They stop as soon as the immediate issue appears resolved. This is especially visible in multi-file changes, where agents fix the local bug but fail to propagate required updates across integration points.

Retrieval changes discovery, not reasoning

RAG meaningfully affects how agents locate code, but it doesn’t change how they reason about it.

Enforcing RAG usage improved results in some cases. On #138211, mandatory retrieval pushed the agent to discover the policy evaluation layer before implementing a fix, leading to a better architectural choice.

But the limitation remains: once the relevant code is found, the agent still reasons locally. Retrieval helps with navigation, not with understanding system-wide implications.

Agents prefer adding over reusing

When given a choice, agents tend to introduce new abstractions rather than reuse existing ones.

On #138191, the correct fix used the existing RestartCount field. All agents instead introduced a new Attempt field. Functionally correct, but architecturally heavier.

This pattern appeared repeatedly. Agents don’t reliably recognize when the system already contains the concept they need. They solve the problem in isolation rather than integrating with existing design. This also shows up in tests—agents tend to add new tests rather than update existing ones, even when behavior changes require modifying assertions.

Issue quality dominates everything

Well-specified issues flatten the differences between approaches.

On PRs like #136013 and #138269, the issue descriptions named the exact file, function, and expected behavior. All approaches converged to high scores, with Local slightly ahead simply due to speed and direct access.

When the issue effectively acts as a spec, retrieval method matters far less. When the issue is vague, performance diverges significantly.

The bottom line

These results come from large codebases (millions of lines), where correctness depends on system-wide understanding rather than local edits.

They are not about raw model capability, but about how agent workflows behave under scale.

Indexing helps, but doesn’t solve scope

Semantic search (RAG) improves discovery in large repositories, but does not ensure coverage of all affected components.

Workflow matters more than model choice

Hybrid setups perform best only when retrieval is enforced. Otherwise, agents default to local reasoning and miss broader context.

Scope discovery is the core bottleneck

Agents reliably fix visible issues, but fail to consistently identify all dependent changes across the system.

This is the dominant failure mode in multi-file and integration-heavy changes.

Issue quality is the strongest lever

Well-defined bug reports reduce performance variance more than tooling or retrieval strategy.

Skills can help, but they don’t remove the system problem

This experiment did not use any explicit coding “skills” or curated agent playbooks—only prompts and tool access.

In principle, structured skills (such as repo exploration strategies or architectural summarization) could improve performance by guiding agents toward better system-level reasoning.

However, in large, evolving codebases, these skills would need to be continuously maintained and updated to stay aligned with the actual repository structure. That makes them less like a one-time improvement and more like an additional system to operate and maintain.

As a result, while skills may improve outcomes, they do not remove the core bottleneck observed in this study: scope discovery and system-level reasoning under scale.

The benchmark used Claude Opus via OpenClaw with KAITO RAG Engine indexing the Kubernetes/Kubernetes repository at HEAD. All sessions ran in isolated subagents with no shared state, April 2026. Ground truth was the actual PR difference from each open pull request.

Carter

Approach	Calls	New Tokens	Output	Cached (replay)	Total
RAG	4	44k	2k	141k	187k
Hybrid	8	27k	3k	234k	264k
Local	6	23k	4k	162k	189k

Top Posts

General Dynamics Fires Back: DISA’s Enclave Cloud Expansion Sparks Contract Clash

Wireless Logic Bolsters US IoT Reach with Strategic SIMETRY Acquisition

The Blackout Test: Crucial Mistakes I Made With Backup Power (And How You Can Avoid Them)

Evaluating AI Agent Retrieval Approaches for Kubernetes Bug Resolution

General Dynamics Fires Back: DISA’s Enclave Cloud Expansion Sparks Contract Clash

Hidden Fallout: The Lingering Echoes of the State Department RIF

Chaos in the Cloud: Flipkart’s Wild Ride Through KubeCon 2026

Beyond Hype: How Azure Databricks Quantifies Real Business Wins

Senate Targets TRICARE Pharmacy Audit Amid Conflict of Interest Fears

Beyond the Ruling: Navigating the Future After the Supreme Court’s Landmark Decision

General Dynamics Fires Back: DISA’s Enclave Cloud Expansion Sparks Contract Clash

Wireless Logic Bolsters US IoT Reach with Strategic SIMETRY Acquisition

The Blackout Test: Crucial Mistakes I Made With Backup Power (And How You Can Avoid Them)

Unlock Peak Performance: Your Command Protocol for GPT-5.6 Synergy

Iran Hunts US Military Phones: CrashStealer macOS Malware & the CVD Blueprint Unmasked

Benjamin Cowen’s Bold Q4 Forecast: Bitcoin’s $44K Bottom is Imminent!

Hidden Fallout: The Lingering Echoes of the State Department RIF

Dell XPS 16: The Sleek Powerhouse Redefining Creativity for Pros

Trending

General Dynamics Fires Back: DISA’s Enclave Cloud Expansion Sparks Contract Clash

Wireless Logic Bolsters US IoT Reach with Strategic SIMETRY Acquisition

Latest Posts

Not More Data, but Better World Models – Unite.AI

OpenAI Is Hiring Head of Preparedness, Amid AI Cyberattack Fears

Subscribe to Updates

Top Posts

Evaluating AI Agent Retrieval Approaches for Kubernetes Bug Resolution

The setup

The test cases (Benchmark)

Timing

Token economics

What I actually learned

Agents fix locally, not systemically

Scope discovery is the real bottleneck

Retrieval changes discovery, not reasoning

Agents prefer adding over reusing

Issue quality dominates everything

The bottom line

Indexing helps, but doesn’t solve scope

Workflow matters more than model choice

Scope discovery is the core bottleneck

Issue quality is the strongest lever

Skills can help, but they don’t remove the system problem

Related Posts