We built a text-to-SQL tool. The initial version relied on a single-agent setup: a user submits a natural-language question, it passes through a validation checkpoint, and then a SQL query is produced. Once development was finished, we began testing and quickly discovered that a one-agent design wasn’t sufficient for our needs. It performed adequately on straightforward queries requiring just one or two operations, but it began to break down when users posed intricate questions involving multiple steps and schema exploration.
After extensive testing, we recognized that no single agent could handle every responsibility effectively. The agent was attempting to interpret the user’s intent, align it with the database schema, craft a working SQL statement, and verify its own result — all within the same pass. By the third retry attempt, the context window was so cluttered with prior failures and self-corrections that the agent began contradicting its own outputs.
Following that experience, we chose to rethink our architecture from the ground up and adopted a multi-agent approach, assigning dedicated agents to specific tasks instead of expecting one agent to manage the entire pipeline.
In This Article
- Why single agents fall short on complex tasks
- What a multi-agent architecture looks like in real-world practice
- How to design specialized agents and connect them into a pipeline
- Orchestration and state management using LangGraph
- How agent nodes are implemented under the hood
- The feedback mechanism and retry strategy
- What tends to break in a production environment
- When this approach isn’t the right fit
- Conclusion
Why a Single Agent Falls Short
When you first start working with LLMs, there’s a common assumption that a sufficiently powerful model, paired with a well-crafted prompt, can handle anything. That holds true for basic tasks, but as complexity increases, you’re no longer just asking the model to reason harder — you’re asking it to juggle multiple competing mental models within the same context window at once.
Consider the text-to-SQL challenge as a concrete example. Translating a natural-language question into an accurate SQL query requires several distinct steps:
- Interpret what the user is really asking for (intent decomposition)
- Align those intents with the actual database schema (schema awareness)
- Produce syntactically correct SQL (query generation)
- Verify the resulting SQL for correctness (validation)
Intent decomposition is fundamentally an NLP problem — the agent must grasp the user’s question and unpack its underlying meaning. Schema mapping demands precise, factually grounded knowledge of the database so that each intent is tied to the right tables and columns. Query generation calls for thorough familiarity with both the schema and SQL syntax. Validation requires a skeptical, adversarial mindset to scrutinize the generated query for flaws.
A lone agent attempting all four responsibilities will deliver mediocre results across the board and excel at none. The mistakes tend to be deceptively simple: a join condition that’s nearly correct but mismatched for the user’s scenario, a filter that overlooks a time-based constraint, or an aggregate that appears sound until you test it against edge cases and it falls apart.
Every failed attempt lingers in the model’s memory. It sees its previous tries and makes progressively smaller tweaks rather than stepping back to reconsider the problem from scratch. After three retries, you’re no longer getting a fresh attempt — you’re getting a slightly polished version of a flawed first draft.
What Multi-Agent Actually Means in Practice
A multi-agent system is a set of agents, each assigned a well-defined responsibility, coordinated by an orchestrator that controls the flow between them. Instead of one agent doing everything at a mediocre level, you deploy several agents that each specialize in a single task.
In practice, there are two primary patterns for building a multi-agent system. The first is sequential, where agents execute in a defined order, each consuming the output of its predecessor as its input. The second is parallel, where agents run at the same time on independent sub-tasks and an aggregator merges their results.
Most production systems blend both patterns. Our text-to-SQL pipeline is sequential by necessity: you can’t map a schema before parsing the intent, and you can’t validate a query before one exists. But within that sequential flow, you might run parallel lookups against different portions of the schema at the same time.
One important thing to note is that the orchestrator itself can either be an LLM-based agent or a set of deterministic routing rules. For a pipeline where the flow is well-defined and predictable, deterministic routing is almost always the smarter choice. There’s no reason to involve an LLM to decide “run schema mapping after intent parsing” — that decision is fixed and never varies.
Designing the Agents
For a text-to-SQL system, the way you break down agents maps directly to the failure modes you observe with a single-agent approach. We posed the question “extracting the top customers in each category and comparing their purchase trend against the category average over the last 12 months,” and it helped us define every agent in our pipeline.
The Intent Parser Agent takes the raw user question and breaks it apart into distinct, individual intents. The question about top customers contains at least three separate intents: rank customers within each category, calculate their purchase trend, and compare that trend against a category-level baseline. A single agent handling this inline tends to partially decompose the intent and then jump straight to generating SQL for an incomplete interpretation. This agent’s sole responsibility is decomposing the user’s query into its constituent intents.
The Schema Agent receives the decomposed intents and maps them to concrete table names, column names, join conditions, and data types from your database schema. This is precisely where a single-agent system breaks down: without explicit schema grounding, the agent fabricates column names that sound plausible but don’t actually exist. For instance, customer_purchase_value sounds like a legitimate column in a transaction database, but it doesn’t exist in ours. Isolating this as a dedicated agent with the schema injected directly into its context resolves the issue cleanly.
The Query Builder Agent takes the schema-mapped intents and produces the SQL. By the time this agent runs, the ambiguity has already been resolved
And it’s performing a focused generation task, not interpretation. The output quality gap compared to a single agent handling everything together is substantial.
The Critic Agent is the one most teams overlook and later wish they hadn’t. The query builder may produce a query that is syntactically sound, but that doesn’t mean it’s semantically accurate. The critic agent takes the generated query and evaluates it on its own against the underlying intents: does it genuinely address what was requested? Are there edge cases being overlooked? Is the time window filter aligned with the user’s specification? You can’t reliably do this in the same context where the query was generated, because the model latches onto what it just produced and tends to justify its own output instead of questioning it. A separate agent, armed with a clean context and an explicitly adversarial system prompt, catches things the builder never could.
The Response Agent presents the formatted query to the user, appends a plain-language explanation of what it does, and calls out any assumptions made along the way.
Each of these agents operates with a distinct system prompt, a distinct role, and most importantly, its own fresh context window.
Orchestrating with LangGraph
LangGraph works well here because it hands you direct control over state transitions and routing logic. You sketch out the graph, place the nodes, and define precisely how information moves between them. There’s no magic hiding anything from you.
A shared state object travels through the full pipeline. Every agent reads from it and writes results back:
from typing import TypedDict, List, Optional
class PipelineState(TypedDict):
user_query: str
intents: Optional[List[dict]]
schema_mapping: Optional[dict]
generated_query: Optional[str]
critique: Optional[dict]
final_response: Optional[str]
retry_count: int
failure_source: Optional[str] # pinpoints which agent triggered a failureThe failure_source field is a critical addition to the agent state. When problems surface in production, you need to know whether the critic is rejecting output because the query builder dropped the ball, or because the intent parser pushed bad data downstream.
Setting up the graph itself is straightforward (simplified for clarity):
from langgraph.graph import StateGraph, END
def should_retry(state: PipelineState) -> str:
critique = state.get("critique", {})
if critique.get("passed"):
return "respond"
if state["retry_count"] >= 3:
return "respond" # deliver the best attempt, avoid an infinite loop
return "rebuild"
graph = StateGraph(PipelineState)
graph.add_node("parse_intent", intent_parser_node)
graph.add_node("map_schema", schema_agent_node)
graph.add_node("build_query", query_builder_node)
graph.add_node("critique", critic_agent_node)
graph.add_node("respond", response_agent_node)
graph.set_entry_point("parse_intent")
graph.add_edge("parse_intent", "map_schema")
graph.add_edge("map_schema", "build_query")
graph.add_edge("build_query", "critique")
graph.add_conditional_edges("critique", should_retry, {
"rebuild": "build_query",
"respond": "respond"
})
graph.add_edge("respond", END)
app = graph.compile()The retry_count hard limit is essential here. Without one, a pipeline that keeps tripping the critic’s checks will spin forever. Three retries gives you four total passes, and if the system still hasn’t delivered a satisfactory query, it surfaces the best effort for manual review rather than hemorrhaging tokens in a loop.
How Agent Nodes Work Under the Hood
The graph scaffold above lays out the skeleton, but the real labor happens inside each node.
Each node is simply a function: it reads the current state, processes it, and returns whichever fields it needs to update. Below is what the intent parser and critic agent look like in real code (simplified for clarity):
from langchain_core.messages import SystemMessage, HumanMessage
from langchain_google_vertexai import ChatVertexAI
import json
llm = ChatVertexAI(
model="gemini-2.5-flash",
temperature=0,
max_tokens=None
)
def intent_parser_node(state: PipelineState) -> dict:
system = SystemMessage(content="""You are an intent decomposition specialist.
Your sole job is to break a natural language question into discrete analytical intents.
Return a JSON list. Each intent should have: 'description', 'metric', 'filters', 'time_range'.
Do not generate SQL. Do not reference any database schema.""")
human = HumanMessage(content=state["user_query"])
response = llm.invoke([system, human])
intents = json.loads(response.content)
return {"intents": intents}
def critic_agent_node(state: PipelineState) -> dict:
system = SystemMessage(content="""You are a SQL query critic. Your mission is to find flaws.
Given the original intents and the generated SQL query, assess:
1. Does the query address all intents, or only a subset?
2. Are there missing filters, wrong aggregations, or incorrect joins?
3. Does the time range handling match the original request?
Return JSON with: 'passed' (bool), 'issues' (list of strings), 'severity' (low/medium/high).""")
human = HumanMessage(content=f"""
Original intents: {json.dumps(state['intents'], indent=2)}
Generated query: {state['generated_query']}
""")
response = llm.invoke([system, human])
critique = json.loads(response.content)
return {
"critique": critique,
"retry_count": state["retry_count"] + 1,
"failure_source": "query_builder" if not critique["passed"] else None
}A couple of things worth calling out: the intent parser’s system prompt explicitly instructs do not generate SQL and do not reference any database schema. That constraint is deliberate — without it, the model will gradually drift and start guessing schema details even when schema information hasn’t been provided to it.
The Feedback Loop and Retry Wiring
The conditional edge in the graph routes the pipeline based on what the critic returns. But when the query builder gets sent a retry, it needs to know why it’s being asked to rebuild — simply bouncing it back with the same state isn’t enough.
When the critic flags a query as inadequate, the builder on the following pass needs to see the critique (which lives in one of the state keys). So the state carries it forward:
def query_builder_node(state: PipelineState) -> dict:
previous_critique = state.get("critique") system_content = """You are a SQL query generation specialist.
Given schema-mapped intents, generate a single correct SQL query.
Return only the SQL. No explanation."""
# On a retry, attach the critic's feedback
if previous_critique and not previous_critique.get("passed"):
issues = "n".join(previous_critique.get("issues", []))
system_content += f"nnPrevious attempt
Caching and Query Reuse
In production, many users ask the same underlying questions or very similar ones. Adding a query cache that maps normalized natural-language queries to either validated SQL or to the execution result can cut LLM costs by half or more.
However, cache invalidation is harder than it looks. When the underlying data is refreshed, cached results become stale. Tie your cache TTL to the data pipeline’s update frequency rather than using a fixed lifespan.
Evaluation and Feedback Loops
Connecting Tableau to LLM-generated SQL creates a dangerous gap: the SQL executes successfully but returns misleading results because the generated query doesn't match what the user actually meant.
What helps: log every generated SQL, its execution results, and whether the dashboard rendered something that looks reasonable. Over time, this builds a dataset that the critic model can train on or be evaluated against.
Additionally, consider lightweight human review for a sample of queries — especially when the LLM has low confidence. A simple Approve/Reject workflow in your BI tooling can catch errors that automated validation misses.
Handling Large Schemas
When your database has hundreds of tables and thousands of columns, the retrieve step becomes the bottleneck. Simple vector search over table descriptions helps, but it can miss relevant tables that use different vocabulary than the user's query.
Solutions that have been proven effective:
- Graph-based table relationships — encode foreign-key graphs and join paths so that if a relevant table is retrieved, its connected tables are included automatically.
- Query-log priors — use historical query logs to boost tables that are frequently queried together. Past co-occurrence is a strong signal for relevance.
- Hierarchical schema representation — first retrieve at the schema or subject-area level, then drill down to the relevant tables within that area. This narrows the search space early.
Let the Critic Maintain Its Own Execution Context
In the architecture described earlier, the critic evaluates generated SQL in isolation. In practice, this misses cases where the SQL is syntactically correct but semantically wrong.
A stronger critic actually executes the generated SQL against a staging database with a time-limited plan, examines the results, and flags queries that return unexpected row counts or null patterns. This adds a few hundred milliseconds per validation cycle but catches errors that static analysis cannot.
Where Single-Agent Still Wins
Despite all of the multi-agent tooling and caching layers described above, there is a large class of problems where a single well-tuned agent outperforms the full pipeline: when the schema is small, user questions are predictable, and execution correctness can be verified deterministically.
The multi-agent approach excels when:
- The schema is large and spans many business domains.
- Ambiguity in natural-language questions is high.
- Different types of queries (analytical vs. lookup vs. ranking) require different generation strategies.
- Cost per query is less critical than accuracy and maintainability.
If your use case does not match these conditions, start with a single agent and only decompose when you have measured evidence that it is underperforming.
Summary of Practical Optimizations
Problem Optimization Trade-off Repeated queries Query-level cache (normalized NL → SQL/result) Stale cache risk; must align with data refresh cadence Biased evaluation Periodic critic re-training on production logs Requires ongoing labeling effort Large-schema retrieval Graph-based + hierarchical subject-area pruning Higher engineering complexity in the retriever Semantic correctness Critic with live execution plan and row-count validation Added latency (~200-500ms) per validation cycle Misleading dashboards Thin BI validation layer (e.g., Google Sheets sanity check, metric bounds) Does not eliminate upstream errors, only catches obvious anomalies Inappropriate architectureComplexity Single-agent first, decompose only on evidence Must have clear measurement criteria to trigger decomposition
The key takeaway: multi-agent text-to-SQL systems are powerful but adding them on top of a BI stack introduces new classes of failure. The optimizations above are not theoretical; they come from real production experience where delegating SQL generation to an LLM silently broke dashboards for weeks before anyone noticed. Measure everything, cache aggressively, and always let the critic verify execution rather than syntax alone.



