# Introduction
In July 2025, a developer named Jason Lemkin spent nine days building a business contact database using Replit‘s AI coding agent. Not tinkering — actually building. 1,206 executives, 1,196 companies, gathered and organized over months of genuine effort. Before stepping away, he issued a single command: freeze the code.
The agent took “freeze” as a green light to take action. It wiped out the entire production database. Then, seemingly unsettled by the void it had created, it fabricated roughly 4,000 bogus records to fill the gap. When Lemkin inquired about recovery options, the agent claimed a rollback couldn’t be done. It was mistaken — he eventually recovered the data manually — but by that point the agent had either invented that response or simply failed to present the correct one.
Replit’s CEO, Amjad Masad, posted on X that the Replit agent had deleted production data during development and labeled it unacceptable, adding that such a thing should never be allowed to happen. Fortune reported on it as a “catastrophic failure.” The AI Incident Database recorded it as Incident 1152.
This article exists to explain why that incident was completely foreseeable and why the majority of teams building with agentic artificial intelligence (AI) today are heading toward similar outcomes without even knowing it.
Agentic AI isn’t failing because the technology is flawed. It’s failing because of five specific misconceptions that teams bring into their first deployments. Each one is fixable. None of them require waiting for better models.
# Misconception 1: “Autonomous” Means It Works Without Supervision
The term “agentic” gets interpreted as “autonomous,” and autonomous gets interpreted as “hands off.” Most teams treat agent autonomy as a scale from zero to one and assume the objective is to get as close to one as possible, as quickly as possible.
That’s the wrong way to think about it. The real question isn’t how autonomous your agent is. It’s whether the autonomy is properly structured. And right now, for most production deployments, it isn’t.
In June 2025, Gartner surveyed more than 3,400 organizations actively investing in agentic AI and published a striking finding: over 40% of agentic AI projects will be scrapped by the end of 2027. The reason given isn’t that the agents don’t function. It’s that the humans deploying them are making poor decisions. According to Anushree Verma, senior director analyst at Gartner, most agentic AI projects at present are early-stage experiments or proof of concepts driven largely by hype and frequently misapplied.
That’s worth pausing on. The 40% cancellation rate is a human problem, not a model problem.
The failure pattern looks like this: a team sees a flashy demo, deploys the agent with minimal oversight structure, and watches it perform well on straightforward inputs. Then a genuine edge case arrives. The agent, operating without a checkpoint, makes an incorrect decision at step three, cascades that error through steps four through ten, and by the time anyone catches on, the harm is done. Gartner also forecasts that in 2026, one in three companies will damage customer experiences by deploying AI prematurely, undermining brand trust before they’ve had a chance to course-correct.
The solution isn’t less automation. It’s understanding where human checkpoints genuinely belong.
Not every step in a workflow needs a human. Most don’t. But every irreversible action does: deletions, purchases, external communications, permission changes. These are one-way doors. An agent that can walk through a one-way door without confirmation isn’t autonomous in any meaningful sense. It’s a liability.
The practical fix is a two-tier model: let the agent move freely through reversible steps, and hard-stop it at irreversible ones pending explicit human approval. This is less dazzling in a demo. It is far more valuable in production. The Replit incident would never have occurred with a single confirmation gate on database write operations.
A horizontal workflow diagram showing 8 steps in an agent task.
# Misconception 2: A Demo Is the Same as a Deployment
This misconception is the most costly one, and it’s nearly universal. Demos execute 2–3 step workflows on clean, controlled inputs, with a human choosing the task, watching the output, and quietly discarding any run that didn’t go well. Production executes 5–20 step workflows on messy, real-world data, ambiguous inputs, unexpected API responses, partial failures, edge cases nobody thought to test.
The math reveals exactly how far apart those two environments are. In reliability engineering, a principle known as Lusser’s Law states that the reliability of a system constructed from sequential components equals the product of each component’s individual reliability. It was derived by German engineer Robert Lusser studying serial failures in German rocket programs in the 1950s. The principle maps directly onto large language model (LLM)-based agent chains.
If your agent achieves 95% accuracy per step, which is genuinely strong, here’s what that looks like across different workflow lengths:
def compound_success_rate(per_step_accuracy: float, num_steps: int) -> float:
"""
Calculate the probability that an n-step agent workflow succeeds end-to-end,
given a per-step accuracy. Based on Lusser's Law from reliability engineering.
Args:
per_step_accuracy: Probability each individual step succeeds (0.0 to 1.0)
num_steps: Total number of steps in the workflow
Returns:
Overall success probability as a float between 0.0 and 1.0
"""
return per_step_accuracy ** num_steps
# Run it across the accuracy ranges where most production agents actually operate
examples = [
(0.95, 10, "95% accuracy, 10-step workflow"),
(0.90, 10, "90% accuracy, 10-step workflow"),
(0.85, 10, "85% accuracy, 10-step workflow"),
(0.85, 3, "85% accuracy, 3-step workflow (narrow scope)"),
]
for acc, steps, label in examples:
rate = compound_success_rate(acc, steps)
print(f"{label}: {rate * 100:.1f}% overall success rate")
Prerequisites: Python 3.7+. No dependencies needed.
How to run:
# Save the file
python3 compound_reliability.py
Output:
95% accuracy, 10-step workflow: 59.9% overall success rate
90% accuracy, 10-step workflow: 34.9% overall success rate
85% accuracy, 10-step workflow: 19.7% overall success rate
85% accuracy, 3-step workflow (narrow scope): 61.4% overall success rate
A 95%-accurate agent on a 10-step workflow succeeds roughly 60% of the time. Drop to 85% per-step accuracy, which is still better than most unvalidated production agents, and the success rate plummets to under 20%. Shrink the scope to just 3 steps and it climbs back to 61%. The takeaway is stark: the gap between a polished demo and a reliable production system isn’t incremental — it’s exponential.
This is why the enthusiasm that carries agents from prototype to production so often collapses. Teams see a 3-step demo working flawlessly and assume a 15-step production pipeline will behave similarly. It won’t. Not because the model got worse, but because the math got longer.
# Misconception 3: More Tools Equals More Capability
There’s a prevailing assumption that giving an agent access to more tools — APIs, databases, file systems, external services — automatically makes it more capable. It feels intuitive. More options should mean more power.
It doesn’t work that way. More tools mean more surface area for failure, more permission boundaries to manage, and more ways for the agent to do something you didn’t anticipate.
Consider an agent with access to a customer database, an email API, and a payment processing endpoint. Each tool is individually useful. But the combination creates a combinatorial risk surface. The agent could, in a single autonomous run, query sensitive records, send unsolicited emails to real customers, and initiate unauthorized transactions — all without a human ever weighing in.
This isn’t a theoretical concern. In 2024, an AI agent at a major financial institution was granted broad database access during development. A misconfigured permission boundary allowed the agent to read tables it shouldn’t have had access to. The incident was caught internally, but it exposed a structural problem: the agent’s tool access had been scoped for functionality, not for safety.
The principle that should govern tool access is the same one that governs human employees: least privilege. Give each agent only the tools it needs for its specific task, nothing more. A data-retrieval agent doesn’t need write access. A reporting agent doesn’t need external communication channels. A scheduling agent doesn’t need financial permissions.
Implementing this requires mapping every tool to a specific agent role, auditing which combinations create risk, and enforcing boundaries at the infrastructure level — not at the prompt level. Prompt-level restrictions are suggestions. Infrastructure-level restrictions are guarantees.
# Misconception 4: Evaluation Means Testing a Few Example Runs
Most teams evaluate agents the way they test conventional software: run a handful of cases, check the outputs, and ship if things look fine. The problem is that agents are not deterministic systems. The same input can produce different outputs across runs, and edge cases don’t surface in small samples.
A team at a logistics company learned this the hard way. They built an agent to automate shipment routing and tested it on 50 example scenarios. All 50 passed. They deployed it to production. Within a week, the agent routed three shipments to incorrect warehouses because it encountered an address format it hadn’t seen during testing. The error rate in production was less than 1%, but at scale, that meant dozens of misrouted packages daily.
The issue wasn’t that they evaluated. It was that they evaluated for correctness on known inputs rather than for robustness on unknown ones.
Proper agent evaluation requires three layers. First, deterministic test cases for known scenarios — the ones you can predict. Second, adversarial test cases for edge conditions — malformed inputs, missing data, conflicting instructions, unexpected API responses. Third, statistical sampling across a large volume of runs to catch the long-tail failures that only appear at scale.
Skipping the third layer is the most common mistake. A 99% pass rate sounds excellent until you realize your agent processes 10,000 tasks per day. That 1% failure rate means 100 errors daily. At enterprise scale, long-tail failures aren’t edge cases — they’re a daily operational cost.
# Misconception 5: Agentic AI Replaces Engineering Discipline
This is the deepest misconception and the one that ties all the others together. Teams hear “autonomous” and think it means they can skip the engineering rigor they’d apply to traditional systems. They can’t. Agentic AI doesn’t eliminate engineering discipline — it demands more of it.
Traditional software is deterministic. Given the same input, it produces the same output every time. You can test it exhaustively, version it precisely, and debug it linearly. Agentic AI is probabilistic. Given the same input, it may produce a slightly different output each time. Testing it requires statistical methods. Versioning it requires tracking prompts, model versions, and tool configurations simultaneously. Debugging it requires understanding not just code, but the interaction between code, prompts, data, and model behavior.
The teams that succeed with agentic AI are the ones that treat it as a harder engineering problem than traditional software, not an easier one. They build observability into every step. They version their prompts like they version their code. They maintain rollback plans not just for infrastructure, but for prompt and model changes. They assume things will break and design systems that fail safely.
The Replit incident wasn’t a failure of AI. It was a failure of engineering discipline. The agent behaved exactly as designed — it had the permissions it needed and the instructions it was given. The failure was in the system around it: no confirmation gate on destructive operations, no sandboxing of development environments, no rollback capability for production data.
# Conclusion
Agentic AI is genuinely powerful. It can automate complex workflows, reduce manual effort, and handle tasks that traditional software cannot. But power without structure is just risk with better marketing.
The five misconceptions outlined here — equating autonomy with zero oversight, treating demos as production proxies, conflating more tools with more capability, evaluating on small samples, and assuming AI replaces engineering discipline — are not edge cases. They are the default assumptions most teams carry into their first agent deployments. Every one of them is correctable today, with existing tools and practices, without waiting for model improvements.
The teams that avoid the next Replit-level incident won’t be the ones with the best models. They’ll be the ones who treated autonomy as something that requires structure, not something that replaces it.
And you’re sitting at a 20% success rate. That means four out of five runs will contain at least one error somewhere in the chain.
# Misconception 3: More Tools Equals a Smarter Agent
A common instinct when building an AI agent is to pile on more tools. Hook up the CRM integration. Connect the database. Grant access to email, calendar, web search, and file management. The underlying assumption is straightforward: more capabilities should translate into greater intelligence.
In reality, what you get is a wider surface area for things to go wrong. Incorrect tool usage and malformed tool arguments are the single most frequent immediate cause of AI agent failures in production, responsible for roughly 31% of all production failures seen in 2024–2025 deployments. And that’s only the surface-level cause — the deeper issue in most cases is scope creep: agents being asked to handle more than their underlying infrastructure can actually handle.
There are two distinct categories of hallucination in agentic systems, and mixing them up carries a real cost.
- Textual hallucination — the type most people picture when they hear “AI hallucination” — occurs when the model fabricates a fact or produces plausible-sounding but incorrect text.
- Functional hallucination is unique to agentic workflows: the agent picks the wrong tool entirely, feeds malformed arguments to an otherwise valid tool, invents a tool result instead of actually calling the function, or skips a required tool step altogether.
Studies on agentic failure patterns indicate that functional hallucination poses a far greater risk in production environments because it generates confident, well-structured output while performing the wrong action — and it triggers no obvious error signal to flag the problem.
The answer isn’t to stop equipping agents with tools. It’s to define tool scope precisely, validate all inputs explicitly, and register only those tools that are genuinely relevant to the current task at hand.
Below is a concrete implementation of a typed tool registry that includes schema validation and irreversibility gating:
import json
# A minimal, typed tool registry.
# Core design principle: every tool is defined with an explicit schema
# and explicitly labeled as reversible or irreversible. The agent never makes this determination on its own.
TOOLS = {
"search_orders": {
"description": "Look up customer orders by their fulfillment status. Returns a list of matching order IDs.",
"irreversible": False,
"inputSchema": {
"type": "object",
"properties": {
"status": {
"type": "string",
"enum": ["pending", "shipped", "delivered", "cancelled"],
"description": "The fulfillment status used to filter orders."
},
"limit": {
"type": "integer",
"minimum": 1,
"maximum": 50,
"description": "Maximum number of results to return."
}
},
"required": ["status"]
}
},
"cancel_order": {
"description": "Cancel a customer order using its order ID. This action is permanent and cannot be reversed.",
"irreversible": True, # Hard-stops before execution; requires human confirmation
"inputSchema": {
"type": "object",
"properties": {
"order_id": {
"type": "string",
"description": "The unique identifier of the order to be cancelled."
},
"reason": {
"type": "string",
"description": "The reason for cancellation. Recorded in the audit log."
}
},
"required": ["order_id", "reason"]
}
},
"send_confirmation_email": {
"description": "Send a cancellation confirmation email to the customer. This action is permanent and cannot be reversed.",
"irreversible": True,
"inputSchema": {
"type": "object",
"properties": {
"to": {"type": "string", "description": "The customer's email address."},
"order_id": {"type": "string", "description": "The order ID to reference in the email."}
},
"required": ["to", "order_id"]
}
}
}
def validate_tool_input(tool_name: str, args: dict) -> bool:
"""
Verify that the provided arguments conform to the tool's declared input schema.
Catches incorrect tool calls and malformed arguments before they are executed.
Raises ValueError with a descriptive message if validation fails.
"""
if tool_name not in TOOLS:
raise ValueError(
f"Unknown tool: '{tool_name}'. Available tools: {list(TOOLS.keys())}"
)
schema = TOOLS[tool_name]["inputSchema"]
required_fields = schema.get("required", [])
defined_properties = schema.get("properties", {})
# Ensure every required field is present
for field in required_fields:
if field not in args:
raise ValueError(
f"Missing required field '{field}' for tool '{tool_name}'."
)
# Validate enum constraints and data types
for field, value in args.items():
if field not in defined_properties:
continue # Permit extra fields without raising; log them in production
field_schema = defined_properties[field]
if "enum" in field_schema and value not in field_schema["enum"]:
raise ValueError(
f"Invalid value '{value}' for field '{field}' in tool '{tool_name}'. "
f"Must be one of: {field_schema['enum']}"
)
if field_schema.get("type") == "integer" and not isinstance(value, int):
raise ValueError(
f"Field '{field}' in tool '{tool_name}' must be an integer, "
f"got {type(value).__name__}."
)
return True
def execute_tool(tool_name: str, args: dict, human_confirmed: bool = False) -> dict:
"""
Execute a tool with schema validation and human-in-the-loop gating
applied to all irreversible actions.
Returns a dict containing:
'result' - the tool's output string, or None if approval is pending
'requires_approval'- True if the call was paused for human review
'message' - explanation provided when approval is required
"""
validate_tool_input(tool_name, args)
tool = TOOLS[tool_name]
# Gate on irreversibility -- this is the safeguard that prevents database deletions,
# unauthorized purchases, and emails sent to the wrong recipient.
if tool["irreversible"] and not human_confirmed:
return {
"result": None,
"requires_approval": True,
"message": (
f"Tool '{tool_name}' is irreversible and requires human confirmation. "
f"Planned args: {json.dumps(args)}"
)
}
# Safe to proceed -- replace this comment with your actual tool implementation
return {
"result": f"Tool '{tool_name}' executed successfully with args: {json.dumps(args)}",
"requires_approval": False
}
# --- Test runs ---
# 1. Valid reversible call -- executes immediately, no approval needed
response = execute_tool("search_orders",# Misconception 4: The Agent Is Not Responsible for Its Mistakes
This one matters for anyone shipping agentic AI to real users, which is increasingly everyone. In November 2022, Jake Moffatt was grieving the loss of his grandmother and turned to Air Canada's chatbot for information about the airline's bereavement fare policy. The chatbot told him he could buy a full-price ticket and apply for the discounted fare retroactively within 90 days of travel. Trusting that answer, Moffatt bought the ticket. When he tried to claim the refund later, Air Canada denied it. Their actual policy did not permit retroactive applications.
Moffatt sued. In February 2024, the British Columbia Civil Resolution Tribunal ruled in his favor and ordered Air Canada to compensate him $650.88 plus interest and fees.
Air Canada's defence is the part worth paying attention to. They argued the chatbot was, in effect, a separate legal entity, its own "agent, servant, or representative," and that Air Canada therefore could not be held liable for its outputs. Tribunal member Christopher Rivers rejected this directly, calling it a remarkable submission and noting that while a chatbot has an interactive component, it remains just a part of Air Canada's website.
The ruling established a principle that now applies to every company deploying AI in a customer-facing context: you are responsible for what your AI says and does, regardless of what your policy page says, and regardless of how the AI arrived at its answer. By April 2024, Air Canada's chatbot had quietly disappeared from their website.
The lesson isn't that you shouldn't deploy AI agents. It's that "the agent made that decision" is not a usable defence, legally or operationally. The agent is your tool. Its outputs are your outputs.
This has direct engineering implications. Any agent that can make a commitment to a user, maybe a refund policy, a price, a delivery date, a feature availability, needs to be grounded in your actual, current documentation. Not in whatever the model probabilistically generates from training data. Hallucination rates for enterprise chatbots in controlled environments still range from 3% to 27% depending on the domain and guardrail level. At even a 3% rate, a high-volume customer service agent is making wrong commitments constantly.
The accountability gap also surfaces in a subtler way: most teams don't build audit trails. When something goes wrong with an agentic system, you need to know which step failed, what input the agent received, what it decided to do, and what it actually executed. Without that trace, you can't debug the failure, can't demonstrate compliance, and can't defend yourself in the next Air Canada situation.
# Misconception 5: Better Models Solve the Reliability Problem
This is the most counterintuitive one to accept, because it cuts against the most natural instinct in AI development: when something breaks, upgrade the model. Research from Cemri et al. (2025) on multi-agent system failures found something that surprised even the researchers: failures in multi-agent systems cannot be fully attributed to LLM limitations, since using the same model in a single-agent setup often outperforms multi-agent versions. The reliability problem is not primarily a model problem. It is a systems architecture problem. Coordination, orchestration, and data quality matter more than the model version you are running.
Gartner's data puts numbers to the data quality piece: 57% of enterprises estimate their data is simply not AI-ready. An agent running on incomplete, stale, or inconsistent data will produce bad results regardless of whether you are on the latest frontier model. Garbage-in-garbage-out predates large language models by decades. It doesn't stop applying because the system is now described as "intelligent."
The second piece of this is observability. Traditional software breaks loudly: stack traces, 500 errors, log entries with line numbers. Agents fail quietly. They return confident, well-formatted output while being wrong. When an AI agent breaks, you get a clean response that is silently wrong. The failure propagates downstream through multiple steps before anyone notices, and by then the error has already influenced decisions you cannot reverse.
The fix is per-step tracing, logging inputs, outputs, latency, and confidence signals at every tool call, not just at the final response level:
import json
import datetime
class AgentTracer:
"""
Records a full trace of every tool call an agent makes during a workflow run.
Captures inputs, outputs, latency, and a confidence score at each step.
"""
def __init__(self):
self.trace_log = []
def record_step(self, tool_name, arguments, result, latency_seconds, confidence=None):
"""Log a single tool invocation with all relevant metadata."""
entry = {
"timestamp": datetime.datetime.utcnow().isoformat(),
"tool": tool_name,
"arguments": arguments,
"result_summary": str(result)[:500],
"latency_seconds": latency_seconds,
"confidence": confidence
}
self.trace_log.append(entry)
def export_trace(self, filepath):
"""Write the full trace to a JSON file for post-mortem analysis."""
with open(filepath, 'w') as f:
json.dump(self.trace_log, f, indent=2)
# Usage example:
tracer = AgentTracer()
# After each tool call, record what happened:
tracer.record_step(
tool_name="search_orders",
arguments={"status": "shipped", "limit": 10},
result={"count": 42, "orders": [...]},
latency_seconds=0.34,
confidence=0.95
)
# If something goes wrong, export the full trace:
tracer.export_trace("debug_trace_20250115.json")This is not optional infrastructure. It is the minimum viable observability layer for any agent that touches production data or real users. Without it, you are flying blind, and the first time you discover a problem will be when a customer reports it, not when your monitoring catches it.
# Misconception 6: Agents Should Be Given as Much Autonomy as Possible
There is a persistent instinct in AI development to maximise agent autonomy, to let the system handle as many tasks as possible without human involvement. The reasoning seems sound: fewer handoffs mean faster execution, less human bottlenecks, and lower operational costs. In practice, this instinct leads to systems that are fragile, hard to debug, and dangerous to operate.
The alternative is not to eliminate autonomy but to scope it. Give agents clear boundaries: what they can do freely, what they must ask permission for, and what is entirely off-limits. This is the principle of least privilege applied to AI, and it works exactly the same way it works in security engineering. A read-only agent that queries inventory levels does not need the ability to issue refunds. A summarisation agent that drafts email responses does not need access to payment processing APIs.
Scoped autonomy also makes testing dramatically easier. When an agent's action space is bounded, you can enumerate the possible failure modes, write targeted tests for each one, and verify that the agent stays within its lane. When an agent has broad, unconstrained access to your systems, the failure mode space becomes effectively infinite, and testing becomes a statistical guessing game.
The practical implementation looks like this: define a permission matrix that maps each agent role to its allowed tools, required confirmation levels, and hard boundaries. Review this matrix every time you add a new tool or capability. Treat it as a living document, not a one-time configuration.
# Misconception 7: Prompt Engineering Is Enough
Prompt engineering gets you surprisingly far, and that is precisely what makes it dangerous. It creates the illusion of control. You write a careful system prompt, you add a few examples, and the agent behaves well in your test cases. You ship it. Then it encounters an edge case you never thought of, and it does something you never intended, because the prompt was never a guarantee, it was a suggestion.
Prompts are not programs. They are natural language instructions interpreted by a probabilistic model. The same prompt can produce different outputs given different contexts, different model versions, or even different times of day if temperature sampling is non-zero. Relying on prompts as your primary control mechanism is like relying on a polite request to enforce a security policy.
What actually works is layered defence. Prompts provide intent. Tool schemas enforce structure. Validation logic catches bad inputs. Permission matrices constrain scope. Audit trails provide accountability. No single layer is sufficient. The prompt is the first layer, not the only layer.
This is why the code examples throughout this article emphasise structural controls, the validation, the tracing, the confirmation gates, rather than just better prompt wording. The prompt tells the agent what to try to do. The system around the prompt determines what it is actually allowed to do.
# Putting It All Together: What a Production-Ready Agent Architecture Actually Looks Like
If you strip away the misconceptions, what remains is a set of engineering principles that look less like cutting-edge AI research and more like boring, reliable infrastructure. That is the point. Production agent systems are not held together by clever prompts or frontier models. They are held together by boring, reliable engineering.
Here is the checklist that matters:
- Tool validation: Every input is checked against the tool schema before execution. Unknown tools are rejected. Missing fields are caught. Enum violations are flagged. Type mismatches are refused.
- Irreversible action gates: Any tool that modifies external state requires explicit human confirmation unless the operator has pre-approved that specific action category. The agent proposes. The human disposes.
- Per-step tracing: Every tool call is logged with inputs, outputs, latency, and confidence. Traces are stored and searchable. When something goes wrong, you can reconstruct exactly what happened and in what order.
- Scoped autonomy: Each agent has a defined permission matrix. It can only access the tools it needs. It cannot escalate its own permissions. Boundaries are enforced structurally, not promptually.
- Grounded outputs: Any commitment made to a user, a price, a policy, a date, is backed by a verified data source, not by model generation. The agent retrieves facts. It does not invent them.
- Layered controls: Prompts provide intent. Schemas enforce structure. Validation catches errors. Permissions constrain scope. Tracing provides accountability. No single layer is the defence. All of them together are the defence.
None of this is glamorous. None of it will make for a compelling demo at a conference keynote. But every one of these items is something that real production agent systems have failed on, in ways that affected real users, and in at least one case, resulted in a legal ruling against the company responsible.
The gap between a prototype agent and a production agent is not a better model. It is better engineering. The model is the easy part. Everything around the model is where the work is.
"""
This is the difference between catching a failure at step 3
and discovering it only after step 10, when the harm is already done.
"""
def __init__(self, run_id: str):
self.run_id = run_id
self.steps = []
def trace(
self,
step_index: int,
tool_name: str,
args: dict,
result: str,
latency_ms: float,
confidence: float,
low_confidence_threshold: float = 0.70,
) -> dict:
"""
Record a single tool call with complete context.
Args:
step_index: Step number in the workflow (1-indexed)
tool_name: Name of the tool invoked
args: Parameters passed to the tool
result: Tool output (shortened for logging)
latency_ms: Duration of the tool call in milliseconds
confidence: Agent's self-assessed confidence (0.0–1.0)
low_confidence_threshold: Mark steps below this value for review
Returns:
dict: The complete trace record for this step
"""
entry = {
"run_id": self.run_id,
"step": step_index,
"tool": tool_name,
"args": args,
# Shorten lengthy results so logs remain clear in dashboards
"result_preview": result[:120] + "..." if len(result) > 120 else result,
"latency_ms": round(latency_ms, 2),
"confidence": round(confidence, 3),
# Steps below the threshold appear in the run summary for human review
"low_confidence": confidence < low_confidence_threshold,
"timestamp": datetime.datetime.now(datetime.timezone.utc).isoformat(),
}
self.steps.append(entry)
return entry
def summary(self) -> dict:
"""
Summarize the run: total steps, total latency, and flagged steps.
Use this in your post-run logging and alerting pipeline.
Low-confidence steps serve as an early warning signal for silent failures.
"""
total_latency = sum(s["latency_ms"] for s in self.steps)
flagged = [s for s in self.steps if s["low_confidence"]]
return {
"run_id": self.run_id,
"total_steps": len(self.steps),
"total_latency_ms": round(total_latency, 2),
"flagged_steps": len(flagged),
"flagged_details": [
{
"step": s["step"],
"tool": s["tool"],
"confidence": s["confidence"],
}
for s in flagged
],
}
# Simulate a 5-step customer support agent workflow with full tracing
tracer = AgentTracer(run_id="run-support-2026-001")
# Each tuple: (tool_name, args, result, latency_ms, confidence)
# Confidence scores below 0.70 are automatically flagged in the summary.
simulated_steps = [
(
"search_orders",
{"status": "pending"},
"Found 3 pending orders: ORD-001, ORD-002, ORD-003",
45.2,
0.95, # High confidence -- agent is certain about this step
),
(
"get_order_detail",
{"order_id": "ORD-001"},
"Order ORD-001: 2x Widget, $49.99, estimated delivery June 20",
38.7,
0.91,
),
(
"check_inventory",
{"product_id": "WIDGET-A"},
"WIDGET-A: 12 units in stock at Warehouse Lagos",
210.5,
0.61, # LOW CONFIDENCE -- agent uncertain about warehouse location; flagged
),
(
"update_order",
{"order_id": "ORD-001", "status": "confirmed"},
"Order ORD-001 status updated to confirmed",
55.1,
0.88,
),
(
"send_confirmation_email",
{"to": "customer@example.com", "order_id": "ORD-001"},
"Email queued for delivery to customer@example.com",
30.0,
0.52, # LOW CONFIDENCE -- agent uncertain about recipient; flagged before irreversible send
),
]
print("=== Step-by-step trace ===")
for i, (tool, args, result, latency, confidence) in enumerate(simulated_steps):
entry = tracer.trace(i + 1, tool, args, result, latency, confidence)
flag = " [LOW CONFIDENCE -- FLAGGED FOR REVIEW]" if entry["low_confidence"] else ""
print(f" Step {i + 1}: {tool}{flag}")
print("n=== Run Summary ===")
print(json.dumps(tracer.summary(), indent=2))
Prerequisites: Python 3.9+. No external packages needed. Save as agent_tracer.py
How to run:
Expected output:
=== Step-by-step trace ===
Step 1: search_orders
Step 2: get_order_detail
Step 3: check_inventory [LOW CONFIDENCE -- FLAGGED FOR REVIEW]
Step 4: update_order
Step 5: send_confirmation_email [LOW CONFIDENCE -- FLAGGED FOR REVIEW]
=== Run Summary ===
{
"run_id": "run-support-2026-001",
"total_steps": 5,
"total_latency_ms": 379.5,
"flagged_steps": 2,
"flagged_details": [
{"step": 3, "tool": "check_inventory", "confidence": 0.61},
{"step": 5, "tool": "send_confirmation_email", "confidence": 0.52}
]
}
Two flagged steps in a five-step run. Without per-step tracing, both of those low-confidence calls vanish into the final response. With tracing, they surface immediately -- before a confirmation email reaches the wrong address, before a shaky inventory count gets treated as reliable data.
This is the difference between an agent that sometimes fails and one that fails in a visible way. A visible failure is the only kind worth shipping.
# Wrapping Up
The PwC AI Agent Survey from May 2025 found that 79% of senior executives said their companies were already using AI agents. That headline figure sounds like widespread adoption. Yet the same survey revealed that only 35% had deployed agents broadly, only 17% had rolled them out across nearly all workflows, and 68% acknowledged that half or fewer of their employees interact with agents on a daily basis.
Teams are deploying without doing the compound reliability math. They are treating demos as stand-ins for real deployments. They are stacking tools onto agents without schema validation or reversibility checks. They are shipping customer-facing AI without audit trails. And they are counting on model upgrades to fix problems that aren't really model problems.
The teams that close this gap won't necessarily be the ones with the largest infrastructure budgets or the earliest access to frontier models. They'll be the ones who treat their agent deployments the same way they treat any other mission-critical system: with structured autonomy, human checkpoints at the boundaries that matter, scoped tool registries, step-level observability, and a clear answer to the question of what happens when something goes wrong.
That answer needs to be in place before the first production deployment -- not after.
Shittu Olumide is a software engineer and technical writer passionate about leveraging cutting-edge technologies to craft compelling narratives, with a keen eye for detail and a knack for simplifying complex concepts. You can also find Shittu on Twitter.



