A recurring question in the payments space is whether a large-language-model agent can take over from a gradient-boosted decision-tree scorer on the synchronous authorization hot path. The question makes intuitive sense — agents are already handling investigation queues that previously required a seasoned analyst flipping across five dashboards, so extending that capability to real-time transaction scoring feels like a natural next step.
I put together a compact benchmark to find out. The entire setup runs on a laptop. There’s no GPU, no API key, and no cloud account needed. The source is available on GitHub at github.com/sandeepmb/fraud-agents-benchmark. Every chart and figure in this article is produced by that same Python repository, so you’re welcome to reproduce the results yourself.
The headline finding: classical machine learning still belongs on the synchronous critical path, while agents fit the asynchronous cold path. The remainder of this article walks through the three measurements that define the boundary between those two tiers, and the hybrid architecture I ultimately settled on.
TL;DR
- Running on a single CPU core, the gradient-boosted scorer achieves p99 latency of 0.15 ms. A calibrated LLM-latency simulator (not a live API call) places the LLM scorer’s p99 at roughly 1,200 ms. The ISO 8583 authorization window is about 100 ms.
- Processing 50,000 transactions per second over an hour runs the GBDT scorer approximately $54. A gpt-4o-mini-class model racks up $16,200. A frontier-tier model (Claude Sonnet 4.6) reaches $351,000. These numbers cover scoring alone; agent-driven reasoning would push them even higher.
- Across 500 calls with byte-identical input, the GBDT outputs exactly 1 unique score. A non-deterministic LLM produces 498. Hosted LLM inference can remain non-deterministic even with temperature set to zero, which makes validating a hot-path scorer under regulatory scrutiny very difficult.
- Deliver genuine value on the asynchronous cold path: drafting Suspicious Activity Reports, collecting evidence through MCP-typed tools, and running an agent-as-a-judge pass before a human reviews and signs off.
Scope and Limits
Four candid caveats before diving into results.
This is not an argument that LLMs have nothing to offer fraud teams. The latter half of this article covers several areas where their contribution is clearly strong. Nor is it aimed at fine-tuned tabular transformers or deep-learning tabular models. The comparison pits a deterministic gradient-boosted scorer against LLM-style scoring within a synchronous authorization context.
The GBDT path is measured directly on a local CPU. The LLM latency path comes from a calibrated distribution rather than live API measurements. Cost estimates are derived from publicly listed per-token pricing. Determinism is demonstrated two ways — empirically for the GBDT, and for the LLM through simulator output plus corroborating external evidence.
| Component | Measured, simulated, or calculated | Rationale |
| GBDT latency | Measured | Local single-core CPU benchmark |
| LLM latency | Simulated | Calibrated log-normal distribution; no API or GPU needed |
| Cost | Calculated | Publicly available May-2026 per-token pricing |
| Determinism | Measured (GBDT) and cited evidence (LLM) | Local benchmark plus published references |
The Setup
My goal was a benchmark anyone could re-run without access to an A100 or an OpenAI API key. That constraint drove three design decisions.
The dataset is synthetic but shaped like ISO 8583 traffic. Each transaction carries twenty features — the sort of fields a card-not-present hot path scorer actually sees: amount, MCC risk score, device age, geolocation distance, velocity counters at one-hour and twenty-four-hour windows, chargeback history, and a handful of binary flags. The fraud rate sits at 1.5%. The generator exposes a stealth-fraud parameter so that about 15% of fraudulent rows are drawn from the legitimate-class distribution. This models sophisticated mimicry and gives the benchmark an irreducible Bayes-optimal error floor. Without that parameter, a tree ensemble would hit PR-AUC around 0.999, which would make the entire benchmark look contrived.
# src/fraud_benchmark/data.py (abridged)
def generate(n_rows, fraud_rate=0.015, seed=42, stealth_rate=0.15):
rng = np.random.default_rng(seed)
n_fraud = int(round(n_rows * fraud_rate))
n_stealth = int(round(n_fraud * stealth_rate))
legit = _draw_class(rng, n_rows - n_fraud, is_fraud=False)
overt = _draw_class(rng, n_fraud - n_stealth, is_fraud=True)
stealth = _draw_class(rng, n_stealth, is_fraud=False) # mimicry
...After training a HistGradientBoostingClassifier on 200,000 rows sampled from this distribution, the model reaches PR-AUC 0.847 and ROC-AUC 0.931 on a 50,000-row holdout set. These are realistic figures for a production-grade card-not-present scorer.
The scorer itself uses an optimized batch=1 inference path. Invoking sklearn’s predict_proba on a single row takes roughly 14 ms on this laptop — largely because of Python-level input validation overhead. That figure is not representative of how XGBoost or LightGBM perform in production, so for a fair comparison I extracted the trained model’s internal tree structures into per-field numpy arrays and wrote a compact traversal routine. It matches sklearn to float64 precision and runs about 100 times faster.
The LLM scorer is handled through simulation. This is the one area where running everything on a laptop required calibration rather than direct measurement. The simulator draws per-call latency from a log-normal distribution with a 540 ms median and σ = 0.35. The calibration is informed by three publicly available sources: NVIDIA Triton’s published time-to-first-token figures for Llama-3-8B q4 on an A10 GPU, vLLM benchmarks for Qwen2.5-7B on an RTX 4090, and the p50 and p99 latencies that OpenAI and Anthropic publish for their hosted APIs. The simulator also generates non-deterministic score outputs on identical inputs — a property needed for the determinism experiment.
With this foundation in place, three experiments follow.
Break #1: Inference Falls Outside the ISO 8583 Budget
Five thousand single-transaction calls against the GBDT scorer on one CPU core at batch size 1. Four hundred draws from the calibrated LLM latency distribution.
The entire measured GBDT distribution lies to the left of the 100 ms ISO 8583 inference budget. The entire sampled LLM distribution lies to the right. There is zero overlap. The classical scorer’s p99 is 0.15 ms. The LLM-latency simulator’s p99 is 1,212 ms — roughly 8,000 times the classical p99 and 12 times the full authorization budget.
Once you sit with these numbers they stop feeling counterintuitive. A gradient-boosted tree ensemble performs a few hundred branching integer comparisons on a numeric feature vector. An autoregressive transformer executes a prefill pass over a prompt and then decodes output tokens one by one, with each token requiring a complete forward pass through billions of parameters. These occupy fundamentally different computational regimes. Quantization and distillation can narrow
There is a divide — a meaningful one — but doing so doesn’t eliminate the fundamental category distinction between walking a decision tree numerically and generating tokens one at a time through an autoregressive process.
ISO 8583 serves as the global standard for payment transaction messages originating from cards. Its workflow is synchronous by design. When a point-of-sale terminal sends an authorization request, it waits for a response within a matter of milliseconds, and the bulk of that tight window gets spent on operations that have nothing to do with inference.

Those operations include network transmission, message deserialization, feature-store queries, rule-engine evaluation, and response packaging. The only phase that changes depending on which model you choose is swapping one inference engine for another. Replace a gradient-boosted decision tree with a large language model, and the round-trip leaps to 563 ms from 32 ms — surpassing the budget by a factor of five or more.
The common rebuttal from the LLM community is “we’ll batch things together simply.” That isn’t feasible here. Synchronous payment authorization means each transaction trickles in independently over the network at unpredictable times and must be scored the moment it arrives. Continuous batching — the method that gives modern GPU inference its high throughput — only works when many requests are simultaneously in flight and can be grouped together by the runtime. When every “batch” holds just a single request, the GPU spends most of its compute cycle doing nothing, and the cost justification falls apart right along with it.
This brings us to the second point where things fall apart.
Break #2: The Cost Divide Ranges From 200x to 6,500x
During a major retail peak event, a large card acquirer might reasonably process as many as 50,000 transactions per second. The cost model presented here was intentionally kept transparent and easy to verify. The LLM pricing tiers use publicly listed per-token rates multiplied by a fixed token allocation per request, meaning every dollar figure can be recomputed from scratch.
requests/hour = TPS × 3600
cost/hour = requests/hour × (prompt_tokens × input_price
+ response_tokens × output_price) / 1,000,000The calculation assumes 50,000 TPS, a 400-token input prompt, and a concise 50-token approve-or-decline response per scoring call. The entry-level tier uses OpenAI’s gpt-4o-mini at $0.15 per million input tokens and $0.60 per million output tokens. The top-tier option is Anthropic’s Claude Sonnet 4.6, priced at $3 and $15 respectively. Both reflect the published pricing as of May 2026. Tree-based scorers are costed from amortized CPU infrastructure (a c7i.4xlarge spot instance) rather than per-token billing.

LightGBM on standard CPU instances costs roughly $54 per hour. XGBoost comes in at $72. The gpt-4o-mini tier jumps to $16,200 per hour. The Claude Sonnet 4.6 tier reaches $351,000 per hour. Even at the smallest LLM tier, the bill is about 225 times the cost of a tabular model. At the cutting-edge tier, the ratio climbs to approximately 6,500 to one.
These figures represent the optimistic end of the spectrum. Real-world agentic reasoning — involving tool invocations, chain-of-thought token generation, and multi-step deliberation — inflates the output token budget by a factor of 10 to 50, pushing the total cost right alongside it. Running a full agentic investigation per transaction could push the frontier tier into millions of dollars every hour.
The calculation also presumes a batch size of one, which accurately reflects the reality of synchronous authorization. GPU economics only make sense with continuous batching over many concurrent requests. A shared hosting API spreads that amortization across all its users, but at the consumer tier you’re still paying the same per-token rate regardless.
This is the point where discussions with vendors stop being about engineering and start being about raw arithmetic. A major card issuer processing a billion transactions daily would watch its daily inference cost jump from a few hundred dollars to somewhere between tens of thousands and a couple of million — with no measurable accuracy benefit. Underneath it all, the data is tabular, numeric, and tightly structured — precisely the kind of data that gives no natural edge to a language model. Tree ensembles have outperformed on structured data benchmarks for years, and the reasons haven’t changed.
Break #3: Identical Inputs Yield Different Outputs
The third fault line is the one that ultimately determines whether a bank can actually deploy such a system in a speed-critical, customer-facing pipeline — independent of how the first two problems evolve over time.
The entire framework of banking model-risk regulation rests on reproducibility. The 2011 Federal Reserve and OCC model-risk guidelines known as SR 11-7 were replaced in April 2026 by the interagency Revised Guidance on Model Risk Management, designated SR 26-2. Under this guidance, any model that drives decisions affecting customers or subject to regulatory review — including transaction denials, holds, account restrictions, and alert escalations — must undergo independent validation. This means that objective third-party examiners must be able to inspect the model’s assumptions and reproduce its outputs on command. A model that gives different results from the same inputs simply cannot produce the kind of verifiable, reproducible validation evidence that regulators require.
# src/fraud_benchmark/benchmark.py: determinism experiment
def determinism(scorer, n=500, seed=7):
score_fn = getattr(scorer, "score_only", scorer.score_one)
x = single_payload(seed=seed)
outputs = np.array([float(score_fn(x)) for _ in range(n)])
rounded = np.round(outputs, 6)
return DeterminismSummary(
distinct_count=int(np.unique(rounded).size),
spread=float(outputs.max() - outputs.min()),
std=float(outputs.std()),
n=n, outputs=outputs,
)The experiment made 500 calls to each scorer using a feature vector that is precisely identical, bit for bit. The gradient-boosted tree returned the exact same float64 score all 500 times. The simulated LLM, however, produced 498 distinct values, with a spread of 0.51 and a standard deviation of 0.077.

This isn’t just about adjusting the temperature setting. Even with temperature set to zero, the seed fixed, and the model version pinned, you still get divergent answers in a typical high-throughput or hosted deployment. The root cause lies beneath the API layer. Floating-point associativity inside GPU kernels changes depending on the order of reduction operations. Continuous batching rearranges the attention computation across different requests. Tensor-parallel communication relies on non-deterministic AllReduce operations across most cluster configurations. A September 2025 writeup from Thinking Machines Lab offers the clearest recent analysis of this phenomenon. It documents dozens of unique completions from requests using identical greedy-decoded outputs, and it also demonstrates that the drift can be corrected with batch-invariant kernels — though at a measurable cost in throughput. I revisit that idea near the end of this article.
For a fraud-scoring model operating under regulatory oversight, this is the core problem. When an examiner demands to know why a particular transaction was declined, the institution needs to deliver a reproducible and defensible record. A versioned tree model paired with a fixed feature vector gives validators a deterministic score, a traceable rule path, and TreeSHAP attribution values — a complete audit package that can be regenerated at any moment. A nondeterministic LLM output doesn’t offer anything comparable to hand back.
Where Agents Find Their Place: The Cold Path
If the hot, real-time pipeline is the natural home of deterministic tree models, then what about the cold path — the
the background processing that kicks in after a transaction gets flagged?
Evidence collection, case prioritization, narrative composition, Suspicious Activity Report submission, human review. Delays at this stage are measured in minutes or hours, not milliseconds. The requirements for deterministic behavior are less strict because a human gives final approval before any adverse action is taken. The financial considerations differ as well, because only one to five percent of all transactions ever reach this tier.
This is precisely the kind of work where agents excel.

The architecture I ultimately landed on consists of two physically distinct layers. The fast path is a streaming pipeline. It handles Kafka data ingestion, Flink feature enrichment from an online feature store, a gradient-boosted decision tree model that outputs a probability score along with TreeSHAP explanations, and a rules engine that translates the score and associated reason codes into one of three outcomes: approve, decline, or challenge. Every single transaction flows through this layer. Every decision it makes is deterministic, auditable, and mathematically reproducible.
Transactions that fall into the challenge bucket are routed over to the cold path through a queue. That is where the agents come in. A supervisor agent picks up the alert and assigns specialized roles. A geographic analyst retrieves device and IP history through an MCP-typed tool. A temporal analyst retrieves the account’s velocity baseline. An external-intelligence analyst taps into a consortium risk data feed. A drafter assembles a SAR-compliant narrative structured around FinCEN’s Who-What-Where-When-Why-and-How format. An adversarial judge cross-checks every assertion in the draft against the raw evidence record before a human ever lays eyes on it. A human operator then gives final sign-off.
In a production system, each agent is an LLM invocation, each MCP tool is a typed JSON-RPC client connected to a real backend, and the judge pass generates its own independent audit trail. That trail serves as documented, verifiable proof of every assertion — exactly the type of validation artifact that model-risk guidance requires. The benchmark repository ships a sketch of this entire orchestration using only the standard library, roughly 200 lines of code, so the overall structure is understandable without needing a full LangGraph runtime.
The Judge: Detecting Fabricated Content Before a Human Ever Sees It
The most critical agent in the cold path is not the drafter. It is the judge.
# scripts/cold_path_demo.py (abridged)
def judge(draft, evidence, alert):
issues = []
evidence_dict = evidence.as_dict()
for claim in draft.claims:
resolved = _resolve(evidence_dict, claim.source_key)
if resolved is None:
issues.append(f"unresolved source_key {claim.source_key!r}")
continue
if not _claim_cites_value(claim.text, resolved):
issues.append(f"claim does not cite {resolved!r} from {claim.source_key!r}")
return JudgeVerdict(approved=len(issues) == 0, issues=issues)The drafter generates a narrative alongside a collection of structured Claim objects. Each claim is tagged with a dot-separated source_key — such as geo.distance_km or external.consortium_risk — that maps back to a specific entry in the evidence ledger the supervisor compiled. The judge iterates through every claim, retrieves the corresponding value, and rejects the submission if either of two conditions holds. Either the source_key references evidence that was never actually gathered, or the narrative text fails to genuinely cite the value it purports to draw from it.
The benchmark’s test suite deliberately introduces two types of hallucination and confirms the judge catches both. The first is an unresolvable source key. A claim points to external.offshore_bank_flag when no such field exists anywhere in the evidence dictionary. The second is value drift. A claim’s source_key resolves correctly, but the text invents a figure — stating “99 km apart” when the actual resolved value is 7,843 km. Both get blocked. The deliberation log exchanged between the drafter and the judge itself becomes retrievable evidence of an independent review, which is precisely what model-risk examiners look for.
This is the agent-as-a-judge pattern adapted for a regulated environment. The approach is broadly applicable and functions for any cold-path agent that must produce structured output an auditor might later examine. It is especially critical here because the alternative is asking analysts to manually verify every line of every LLM-drafted SAR narrative. At present, SAR drafting consumes hours to days of analyst effort per case. A judge-validated agent pipeline reduces that dramatically, and the judge is the component that makes such compression trustworthy.
Where My Initial Thinking Was Off
When I first built the benchmark, I assumed the primary argument against using agents on the hot path would center on cost. Latency and reproducibility turned out to be the more fundamental obstacles, and they are more fundamental because they do not shift the way cost does.
Cost is a figure you can move. In 2022, a million input tokens on a leading-edge model ran about $30. By 2026, on a comparable frontier model, the price is roughly $4. Another two orders of magnitude of cost reduction is plausible before the decade ends. The gap in the benchmark would narrow. It would not vanish, because the batch=1 constraint eliminates most GPU economies of scale, but it would narrow.
Latency is tougher to shift but not unchangeable. Speculative decoding, Medusa-style parallel heads, mixture-of-experts pruning, and purpose-built inference accelerators all eat away at time-to-first-token. A dedicated chip running a compact distilled fraud-detection model in 30 milliseconds is thinkable within a couple of years.
Reproducibility is the hardest of the three to achieve in mainstream high-throughput inference settings. It is rooted in how GPU arithmetic operates at the hardware level and in the software layers built on top of it. Research from the Thinking Machines Lab demonstrates that it can be addressed through deterministic kernels, fixed batch ordering, and restricted collective operations. The fixes impose a real throughput penalty, no hosted API provider ships them as defaults, and running them on-premises yourself chips away a meaningful portion of the computational efficiency that justified the GPU purchase in the first place.
The regulatory landscape is more nuanced than a blanket prohibition. When the US interagency model-risk guidance was revised in April 2026 (SR 26-2), it explicitly placed generative and agentic AI outside its scope on the grounds that these technologies are “novel and developing rapidly.” That is not permission. It means there is no established supervisory framework for validating a non-deterministic model when it influences outcomes that affect customers. A financial institution that deploys an LLM on the authorization hot path is operating ahead of its own compliance examiners, yet still owes those examiners the same transparency a tree model provides and an LLM cannot. Explain why this transaction was declined. Reproduce this exact score. Produce the validation records. The EU AI Act pushes in a similar direction and classifies credit scoring as a high-risk use of AI, while carving out a specific exception for fraud detection. The common thread across both regulatory regimes is reproducible, independently verifiable model behavior.
My prediction, for what it is worth. Latency and cost will continue improving for LLM inference, and the case against using LLMs on the authorization path for those reasons will weaken over time. The reproducibility concern, however, is the enduring one. Placing a non-deterministic scorer in front of a customer-impacting decision within a regulated workflow is very difficult to justify. Not because a single rule forbids it, but because the entire regime — SR 26-2, the EU AI Act, and the auditing infrastructure now taking shape — is converging on a principle that makes irreproducible model behavior steadily harder to defend.
Rather than forbidding it outright, the real issue is that the entire model-risk framework is structured around being able to reproduce outputs and independently validate them — and that’s precisely what a non-deterministic model fundamentally cannot provide. The regulatory guidance will continue expanding to address generative and AI-agent systems, but reproducibility will remain the core litmus test.
How to Approach This Decision
Retain your deterministic scorer on the critical path. An XGBoost, LightGBM, or CatBoost model trained on tabular features and deployed through an online feature store will serve you well. Track your p99 latency against a strict budget. If latency is a bottleneck, prioritize investing in ONNX Runtime or a C++ inference layer before exploring anything else.
Divert edge cases to a cold processing path. Treat the queuing mechanism as a core architectural component rather than a bolt-on afterthought. Plan on roughly one to five percent of authorizations landing there.
Design the cold path agent-based from the start. A supervisor-with-specialists pattern using MCP-based tools gives you modular, composable evidence-gathering capabilities. Insert an agent-as-judge review stage before any case escalates to a human analyst.
Prioritize SAR narrative generation as your highest-value first deployment. It recovers hours of analyst effort per case, the output format is clearly defined, and regulators have spelled out explicit criteria for what constitutes an acceptable result.
Never connect cold-path agents directly to the hot-path decision engine. The challenge flag should function as a queue entry, not a synchronous callback. Maintain strict physical separation at the authorization layer.
Instrument the judge review stage thoroughly. The deliberation logs serve as auditable evidence of independent review, and the cost of storing them is negligible.
If you’d like to reproduce the analysis from this article, the code is available at github.com/sandeepmb/fraud-agents-benchmark. Two commands — python scripts/run_benchmark.py and python scripts/generate_diagrams.py — regenerate every chart on an ordinary laptop in under a minute. A working sketch of the cold-path orchestration can be found in scripts/cold_path_demo.py. Sixty-four tests cover the data generator, the fast scorer, the benchmark harness, the visualizations, and the judge. On your own machine, the performance gap should look comparable.
All figures are the author’s own. The four data plots are produced by scripts/generate_diagrams.py in the linked benchmark repository; the architecture diagram was created by the author in Figma.



