Introduction
When it comes to data engineering, most AI use cases focus on one core task: repairing broken pipelines. A typical scenario involves an engineer opening Claude Code, pasting in some error logs, and letting the system generate a pull request automatically.
The language we use here matters a great deal. When teams talk about “self-healing” systems, they’re really describing something more accurately called “self-managing” systems. What truly sets successful AI implementations apart isn’t how well humans step in to fix things—it’s how rarely they need to.
What every data team ultimately wants is a system where pipelines and workflows run smoothly without anyone needing to intervene. But there are significant obstacles standing between where we are today and that ideal outcome.
AI agents need context to do their job effectively. A pipeline might break because of a temporary glitch, a change in an upstream schema, or even something completely out of anyone’s control—like someone accidentally deleting a table. Experienced engineers already know how to troubleshoot these situations, but agents don’t have that background knowledge.
We also need to rethink our current workflows. The traditional approach of creating a new branch, merging changes, and re-running everything feels outdated when working with AI agents. Unless we’re willing to let agents handle merging pull requests on their own, we’ll need to fundamentally change how we operate.
There’s also the challenge of how data differs from code. While tools like Lake FS have tried to bring Git-style version control to data, they haven’t caught on widely. Even concepts like zero-copy cloning, which I’ve covered extensively, remain underappreciated. The gap between managing code and managing data is still too large for many teams to bridge.
In this post, we’ll examine seven key barriers preventing today’s data stacks from achieving fully self-healing or autonomous pipelines.
Let’s get started!
Barrier 1 | Understanding Context and Recalling Past Failures
Data pipelines can fail for countless reasons, and the ability to fix them at all is an absolute must for any AI solution. We generally see four main categories of failures:
- Infrastructure problems
- Code bugs
- Data quality issues
- Temporary or external service failures
Resolving these issues typically requires deep knowledge of the specific system. For instance, imagine that Acme Corp’s Kubernetes cluster can only be accessed by Mr. Bob, who alone holds a special authentication key stored in AWS Secrets Manager behind an unconventional header. An AI system won’t know about Bob’s key and therefore won’t be able to fix the cluster issue.
Or consider Analyst Sophie at Widgets Incorporated, who understands that sales figures come in multiple currencies and has a workaround to adjust them upward by 10% compared to prior reports. An AI wouldn’t grasp this business logic.
The AI might also be unaware that retrying the internal API between 2:47 AM and 3:12 AM reliably resolves certain failures.
These may sound like extreme examples, but they highlight a crucial point: the know-how needed to resolve these issues often lives only in people’s heads. Simply collecting metadata context—lineage, logs, source code, documentation—is important, and AI is surprisingly capable at figuring out problems from those sources alone.
If you work in data, you’ve probably heard someone say something like:
“How was I supposed to know that?”
Ultimately, only people truly understand the hidden complexities of a system.
Barrier 2 | Infrastructure That Can Adapt
Zooming in on infrastructure problems specifically, I want to introduce a concept I call “Elastic Infrastructure.” This goes beyond simple scalability—it means infrastructure comes with an API that allows programmatic control.
A standard EC2 instance wouldn’t qualify as elastic because it can’t scale indefinitely.
Similarly, a Kubernetes cluster running on a restricted, locked-down server wouldn’t be considered elastic in a cloud context because there’s no API available for managing it.
The reason this matters is straightforward: AI systems need direct access to infrastructure to recover from failures effectively.
SaaS vendors stand to benefit greatly from this shift. By definition, SaaS providers remove infrastructure management burdens from data teams—for a price. This aligns well with AI automation, though it runs into limitations related to Barrier 6, which we’ll discuss later.
Barrier 3 | Data Quality and Operational Agents
Pete from Finance has accidentally overwritten the US Supply and Operations Planning spreadsheet again. Now the international forecasts are wrong, and your pipeline is broken. The tables us_forecast_dec_v1 and forecasts_agg are empty or outdated.
Your AI confirms the connectors are working fine, but there’s simply no data to process. It’s stuck.
So what should we do? Let’s run through some options:
- Option 1: Have the AI generate fake forecast numbers
- Option 2: Let the AI insert fabricated forecasts into your data warehouse and rerun the Google Sheet pipeline afterward
- Option 3: Have the AI tell Pete to upload the correct forecasts
- Option 4: Maintain a pool of on-call human helpers who can chase Pete down in person until he manually fixes the issue
There’s no good answer here. All the choices range from problematic to absurd. In fact, Option 4 barely needs AI at all—it just requires basic teamwork.
Data quality remains the single biggest concern for data engineers. Teams should be asking during interviews: “How reliable is your data?” It has such a massive impact on daily life that it’s surprising it doesn’t get more attention.
That said, operational agents still have their uses. Simple mistakes—like a new $10 million deal that should actually be $1 million—can often be caught and corrected automatically by an agent with the right Salesforce API access, followed by a pipeline restart.
Barrier 4 | Version Control for Data
This brings up an important question: Should AI agents be allowed to modify production systems directly?
If you’ve ever worked with multiple Salesforce environments, you understand the frustration. But this feature exists precisely to prevent situations like the one described above.
You know, imagine the sales rep just closed a major deal worth $10 million. In that scenario, it makes much more sense for the agent to edit the staging version of Salesforce rather than the live Production one, doesn’t it? It’s a safer, smarter way to work.

The diagram above gives an overview of how this process would actually function with a “git-for-data” method. A more straightforward example is shown below.

In both scenarios, the AI needs a separate branch to perform its tasks. This branch requires zero-copy clones of the data, relies on a git-for-data approach, and you need a way to efficiently swap in the updated data at the end.

Without this kind of structure, it’s hard to imagine how AI can be trusted to reliably fix issues without creating a governance disaster where it has direct write access to production data.
On this front, companies like Snowflake are in a strong position because they’ve supported features like zero-copy cloning for a long time. Motherduck also offers this capability. However, the most promising solution is probably Iceberg.
Iceberg supports time travel, rollback, and git-for-data. Companies like Bauplan have developed compute engines built around Iceberg, creating a much more AI-friendly environment. AI could be a massive driver for Iceberg adoption.
Barrier 5 | Widespread Adoption Across the Industry
Self-healing architecture runs into a challenge when it comes to interoperability.
Fivetran and dbt made a lot of noise about open data infrastructure in 2025—but this isn’t the same as open source data infrastructure. Instead, it refers to an approach better described as Modular Data Architecture, where different functions use different tools. An example is shown below.

There’s no point in having a self-healing architecture if the underlying components don’t support it. Service providers need to offer APIs that align with all the principles outlined in this paper, along with their own self-healing capabilities, for these patterns to work.
For instance, consider a silent failure in an ELT provider where the sub-schema changes—the columns and types stay the same, but the values shift. Maybe now there are currencies reported in Yen alongside USD, but the currency and local_value columns remain unchanged.
The right move might be to update the ELT job in its staging environment, verify the rest of the pipeline using that staging data, swap in the corrected data, and then finally replace the erroneously succeeding ELT job.
Many ELT tools simply don’t provide the APIs needed for this kind of functionality. However, if you were using a Python script you controlled yourself—no problem at all. This will put enormous pressure on today’s ETL vendors to adapt their structures or risk becoming obsolete.
This is a significant barrier between today’s modular systems and truly self-healing autonomous architecture. The only other path would be for each system to become independently self-healing—the idea being that if every part of a system can heal itself, then the whole system can too.
Barrier 6 | Agent Sandboxes and New Orchestrators
The logical place to run agents that fix things is within an orchestration tool.
This is because the orchestration tool has several things the agent needs.
- The ability to run any code and replay any DAG with any set of arbitrary parameters
- Connections to the different parts of the system the agent may need (remember, an orchestrator orchestrates, so it has access to things)
- Built-in alerting, with monitoring, recovery, and scalable infrastructure
However, there’s one enormous problem—and that’s security.
Companies like Cloudflare have developed agent sandboxes. This is because models like Fable (which was recently banned) need sandboxes, as they can break out. This is especially true when under attack from prompt injection.

Legacy orchestration tools simply aren’t designed to handle agents this way. The security risks are immense. Not to mention, AI workloads could interfere with data workloads!
It’s clear that agents will need access to orchestration frameworks. Whether that’s OpenAI and Anthropic providing an orchestrator, next-generation orchestrators with built-in agent sandboxes, or some form of interoperability between the two—something has to give here. Because of security.
Barrier 7 | Standards for Proxy Servers and Agent Definition
One approach to security is to set up a proxy service for agents. Instead of storing secrets directly in the sandbox, the agent has access to a defined set of tools/MCPs.
The proxy service is then the only component with access to external systems. This means that even if the agent falls victim to a prompt injection attack, the damage it can do is limited by the endpoints in the MCPs it has access to.

What this proxy service should look like isn’t obvious. MCP is big. Cloudflare released Code Mode. If you need to access multiple different endpoints, how MCP Servers should be configured isn’t straightforward or obvious.
Open standards should prevail—any agent looking to interact securely with multiple systems would benefit, from a security perspective, from interacting with a proxy service. These exist today, but only in private SaaS tools like Foundry.
Frameworks for designing agents would also need to emerge. In the example above, a single agent requiring integration with hundreds of systems may not be feasible, as the context needed to access hundreds of MCPs may be too large.
Putting It All Together | A Single Pane of Glass for AI
Collectively, achieving the above would allow data teams to build out a single pane of glass for AI.
- Context: provides agents with the information needed to solve any problem
- Elastic infrastructure: provides the foundation for fixing pipelines
- Quality Data: eliminates the human side of data inputs
- Git for Data: creates reliability and trust in AI
- Mass Adoption: prevents industry collapse
- Agent Sandboxes and New Orchestrators: remove legacy architecture
- Proxy Servers: do their best to ensure security
This single pane of glass would allow AI Agents to operate securely. They would execute when needed and have the context to accomplish what they need to.
Core data primitives like git-for-data, elastic infrastructure, and support throughout the ecosystem would turn this from a theoretical idea into a practical reality.
Data teams looking to implement autonomous architecture will put significant pressure on existing vendors to support interoperability.
This will accelerate consolidation, as traditional walled gardens like Salesforce, SAP, and ServiceNow roll out their own agentic products and data studios, capable of controlling the end-to-end process without providing interoperability.



