It takes a five-minute conversation and delivers eight neatly organized sections: Decisions, Action Items, Risks, Open Questions. Each section appears as though it was crafted by someone who was truly listening.
But look at the original transcript, and you’ll discover that two of those sections were drawn from a single vague statement, one was completely fabricated, and three were filled in based on what the model expected a meeting summary should look like. The result is polished, well-formatted, and structurally identical to a summary of a meeting where those things genuinely occurred.
This isn’t the typical hallucination issue. The model isn’t inventing a real-world fact. It’s inventing a fact about the meeting itself. And the flaw isn’t apparent in the output. It’s just confident, well-organized text that the reader can’t easily check against the original source.
There’s a term for this kind of failure in another discipline, and it predates language models entirely. It’s what happens when you estimate without first establishing identification.
This article isn’t introducing a new summarization benchmark. It’s making the case for a design pattern I haven’t seen treated as the core design principle in AI engineering literature: treat LLM-generated summaries as structured claims tied to a source, require each claim to state its support category, and limit review stages to only weakening unsupported claims rather than polishing the output. I’ll walk through how this works in practice, what it produces, and where it falls short.
The missing step
Causal inference is the analytical tradition that formalizes the distinction between identifying a quantity and estimating one. Identification is the reasoning that shows the data you have can justify the claim you want to make. Estimation is the process that generates a number once identification is established. The sequence is fixed. You can’t estimate a treatment effect you haven’t first argued is identifiable from your observational data, because the resulting number is meaningless. It resembles an effect. It isn’t an effect.
Analysts working in observational settings dedicate a significant portion of their effort to identification. They map out causal graphs. They debate confounders. They separate what the data can justify from what it can’t. The estimation step, when it finally arrives, is often the straightforward part.
Now think about what an LLM summarizer does. It takes a transcript. It generates structured claims about that transcript’s content: decisions reached, commitments made, risks identified, next steps outlined. Each claim is, in a meaningful sense, an estimate of an underlying truth. The decision was reached or it wasn’t. The commitment was made or it wasn’t. The summary is stating a value for each of these truths.
There is no identification step. The model doesn’t check whether the transcript holds enough evidence to back the claim. It generates the claim because the format demands one.
LLM summarization operates like observational analysis, yet it’s frequently deployed without anything resembling an identification step.
The AI engineering literature hasn’t ignored the underlying issue. Hallucination detection, calibrated uncertainty, selective prediction and abstention, RAG grounding, citation verification, factual consistency, and claim verification: each represents a meaningful research direction, and each tackles a genuine aspect of the problem. What unites them is that they approach fabrication as a model behavior to be measured, scored, or filtered after the fact.
Identification operates at a different level. It doesn’t evaluate the output for reliability. It restricts what the model is permitted to assert from the start by requiring every claim to state what it is and where it originated. The two levels complement each other. A pipeline that handles identification well still gains from calibration and grounding efforts downstream. A pipeline that only addresses downstream issues is filtering output that should never have been generated in the form it was generated.
What identification looks like for a transcript
Identification in observational data is a question about what the data can justify. Identification for a transcript is the same question, focused on a specific source. Given this transcript, what can be seen directly, what can be reasoned with stated assumptions, and what can’t be backed at all?
That’s the entire shift. Every claim a summarizer generates should state which of those three categories it falls into. Observed claims reference a specific portion of the transcript and assert nothing beyond what that portion states. Inferred claims state the assumption being made and the evidence the inference is connecting. Recommendations state that they are the model’s suggestion, not the participants’ decision.
A summarizer that can’t assign a claim to one of those categories has no reason to produce the claim. The correct output in that case isn’t a more polished claim. It’s no claim at all.
This is unsettling for the summary consumer, because it means many sections will be empty when the underlying conversation was sparse. That discomfort is the point. It’s information. It tells the reader that the meeting didn’t actually yield eight sections of substance, regardless of what the summarizer wished to write.
A pipeline that enforces the discipline
The architecture follows from the conceptual framing. Three LLM stages paired with a deterministic renderer.
Image by Author
The first stage pulls structured facts from the transcript. Speaker turns, explicit commitments, explicit decisions, explicit quantities. This stage is intentionally cautious. It’s permitted to overlook things. It’s not permitted to fabricate them.
The second stage assembles those facts into claim objects across eight sections. Each claim carries a label: observed, inferred, or recommendation. Each claim carries a reference to the evidence in the extracted facts. Synthesis is where the analytical effort happens, and it’s also where the model is most prone to drift.
The third stage audits. This is the stage that performs the identification work, and the restriction on it is the aspect of the design that matters most.
The audit stage can’t rewrite the analysis into something more polished. It can’t add a more compelling recommendation. It can’t fabricate missing context.
It’s given a limited set of operations and barred from doing anything else. It can remove a claim. It can downgrade a claim from observed to inferred, or from inferred to recommendation. It can relocate a claim to a more suitable section. It can substitute a claim with an explicit insufficient-evidence placeholder. It can collapse
an entire section when nothing in it survives review.

Anything not on this list is forbidden, including writing better claims.
Image by Author
The replace_with_insufficient_evidence operation deserves its own line. It is the system literally typing a placeholder into the output where a confident claim used to be. That is identification work made operational. The reader sees, in prose, exactly where the synthesis stage produced a claim that the source could not support.
Why the asymmetry matters. A reviewer that is allowed to improve the analysis becomes another source of the same problem the system is trying to solve. A reviewer that is only allowed to weaken or remove can only fail in one direction: by being too cautious. That is a tolerable failure mode. The opposite is not.
What the design produces, and what it refuses to produce
This is not a benchmark. It is a small fixture-based stress test designed to check whether the architecture produces the behavior it was built to produce. Three transcripts are not enough to make general claims about LLM summarization. They are enough to check whether a specific design choice has the consequences the design predicted.
The fixtures are: a decision meeting in which a pricing model was selected among three real alternatives, a working session that surfaced a measurement problem without resolving it, and a thin two-person sync that contained almost no decision content.
What did not happen. Across the three runs, the pipeline produced zero fabricated commitments and zero ungrounded quantities. This is what the architecture is designed to make harder. A claim cannot survive the pipeline if it does not have a pointer to evidence, and the audit stage cannot manufacture evidence to keep a claim alive. The result is not a guarantee. The deterministic renderer is the only stage that gives guarantees. Extraction, synthesis, and audit are still LLM calls and can still fail. The point is that the architecture pushes their failures toward removal rather than toward fabrication, and the fixtures are consistent with that.
What did happen. The result that I find more interesting is the abstention rate.

Across three fixture transcripts, the share of empty section slots rose from 17% to 58%.
Across all three fixtures: 0 fabricated commitments, 0 ungrounded quantities.
Image by Author
On the rich decision meeting, the pipeline left seventeen percent of section slots empty or replaced with the insufficient-evidence placeholder. On the working session, the figure rose to twenty-five percent. On the thin sync, it reached fifty-eight percent. The system produced roughly three and a half times as many empty sections when the input signal was thin compared to when it was rich.
That is the behavior the design is trying to produce. A summarizer that fills the same eight sections regardless of input is not summarizing. It is generating output that conforms to a template. The template is doing the work, and the model is the cosmetic finish.
A summarizer that abstains in proportion to the thinness of the input is doing something different. It is treating the transcript as a source whose content varies, and it is letting that variation show up in the output. The empty sections are not failures of the model. They are the model declining to assert what the source does not support.

Excerpts from the decision-meeting fixture, with the categorical labels surfaced inline.
Image by Author
Reading the result. The labels are not decoration. They change what the reader does with the output. An observed claim invites verification against the transcript. An inferred claim invites scrutiny of the assumption that produced it. An insufficient-evidence placeholder invites the reader to either look at the source themselves or accept that the meeting did not, in fact, produce a claim of that shape.
The objection from the consumer
There is an argument that empty sections are a usability problem. The reader expected a summary. The reader got a partial summary with explicit gaps. The reader has to do more work.
That objection deserves a direct answer. The reader who got a fluent eight-section summary of a five-minute exchange was already doing more work, just invisibly. They were going to read the summary, act on it, and at some point discover that two of the action items were not actually agreed to and one of the risks was never raised. The cost of that discovery is high. It is paid in misallocated meetings, missed commitments, and the slow erosion of trust in the tooling.
Honest emptiness pushes the cost forward. The reader sees the gap immediately and can decide how to handle it. Open the transcript. Ask a participant. Treat the meeting as inconclusive. Each of those is a better response than acting on a confident summary that was generated from a confidence the source did not earn.
This is the same trade observational analysts make when they refuse to report a point estimate without identification. The consumer would prefer a number. The analyst declines. The decision the consumer makes from no number is, on average, better than the decision they would have made from a number the data could not support.
Generalizing the pattern
The architecture transfers. Any LLM workflow that produces structured claims from a source can be reframed as observational analysis and given an identification layer.
Document review for legal discovery. Patient note summarization. Customer call analysis. Code review summaries. Each of these is currently deployed as a one-shot generation problem, with a model producing structured output from a source and the consumer trusting the result. Each of them has a version of the same failure mode the meeting summarizer has, and each can be made more auditable with a similar architecture: an extraction stage that is conservative about what it pulls from the source, a synthesis stage that produces labeled claims with evidence pointers, and an audit stage that is forbidden from adding or strengthening anything. The implementation and the risk profile differ across these
The underlying pattern carries over across domains. The specific details do not.
Labels and evidence references are not add-ons. They are the practical mechanism by which identification happens. A claim that lacks a label cannot be tracked. A claim that lacks an evidence reference cannot be verified. The audit stage’s rule of only weakening, never strengthening, is what keeps a model from erasing the identification work in pursuit of more polished prose.
What this means for the teams building these systems
Well-calibrated uncertainty scores are useful. Hallucination evaluation suites are useful. Grounding and citation methods are useful. But none of them replace the core discipline of declining to state something the source material does not actually support.
That discipline is absent from many LLM-based systems, and the reasons are partly cultural. The field emerged from machine learning, where a model is expected to generate a response for every input. The idea that the correct response is sometimes no response at all is not unheard of in the research, but it runs against the grain of a generative model trained to always predict what comes next. Yet that idea is second nature in observational analysis, where the honest answer to many questions is simply that the available data cannot support one.
So the methods that will make LLM analytical tools trustworthy may not originate mainly within the LLM research community. They may come from fields that have already figured out how to conduct rigorous analysis when the source material is the hard constraint. Causal inference is one such field. Survey methodology is another. Forensic accounting is another.
The practitioners who already understand how to decline an estimate without a valid identification strategy have a uniquely clear view of what is broken in today’s LLM analytical tools, and how to fix it.
Causal inference trained a generation of researchers not to estimate anything they had not first identified. LLM summarizers commit the same error, just in natural language rather than in statistical form. The solution is not merely a better model. The solution is to restore the step that observational analysis never abandoned, and to enforce it through an architecture that cannot be persuaded to skip it.
A few common pitfalls to avoid
- Treating labels as surface decoration. If labels are not enforced at the point of creation, they are just ornamentation. They must be assigned during synthesis alongside a reference to supporting evidence, and then verified downstream against that reference. A synthesis step that generates a label without pointing to evidence is not performing identification. It is generating a category that merely resembles identification.
- Allowing the audit stage to be helpful. This is the tempting mistake. A reviewer that can insert a recommendation, fill in missing context, or rephrase an awkward claim feels productive. But it is precisely the same failure mode the synthesis stage already exhibits, now disguised as quality assurance. Restrict the audit to a predetermined set of weakening operations. Anything beyond that is the system debating with itself.
- Mistaking abstention for poor quality. A summarizer that returns mostly blank sections when given a sparse meeting transcript is not underperforming. A summarizer that produces a confident, eight-section summary from that same thin transcript is underperforming, just in a way that is hard to see. The correct way to evaluate these systems is not by how complete the summary is, but by whether the rate of abstention rises and falls in proportion to the amount of signal in the source.
- Drawing broad conclusions from a handful of test cases. Three transcripts are sufficient to verify whether a design choice yields the behavior it was intended to produce. They are not sufficient to support sweeping claims about LLM summarization as a whole. If you build your own version of this, you will need your own set of test cases and your own criteria for what constitutes the appropriate level of abstention in your particular context.
The asymmetry that counts
A pipeline that can only weaken its outputs has exactly one failure mode: excessive caution. A pipeline that can strengthen its outputs is vulnerable to every failure mode the research literature has catalogued over the past several years.
Opting for the first type over the second is not a purely technical choice. It is a choice about the system’s purpose. If the goal is to generate fluent prose, the second type wins on every measure. If the goal is to produce claims a reader can verify before taking action, only the first type is defensible.
Most existing tools are engineered for the first goal and then deployed as though they were built for the second. Recognizing that gap as a methodological problem rather than a model-quality problem is what opens up different paths to a solution.
The repository, evaluation harness, and sample outputs are available on GitHub. The complete notebook walks a single transcript through every stage and runs the evaluation harness across all three test cases.
Staff Data Scientist specializing in causal inference, experimentation, and decision science. I write about transforming ambiguous business questions into analysis that is ready to inform decisions.
More pieces like this on LinkedIn 👇



