In my work with enterprise AI teams, I’ve noticed a recurring pattern: when things go wrong, the model gets blamed almost immediately. While this reaction is understandable, it’s often misguided—and it can end up costing organizations significant time and resources.
Here’s how it typically plays out. Outputs start looking inconsistent, and the moment someone flags it, the instinct is to point the finger at the model. The proposed solution? More training data, another round of fine-tuning, or switching to a different base model. Weeks of effort later, the problem persists—barely improved, if at all. Meanwhile, the actual culprit, often lurking in the retrieval layer, the context window, or the task routing logic, never gets investigated.
I’ve witnessed this pattern repeat so many times that I felt it was worth putting into words.
Fine-tuning is valuable, but it’s overused
To be clear, fine-tuning absolutely has its place. When you need domain adaptation, tone alignment, or safety calibration, it should be part of your workflow. I’m not arguing against using it altogether.
The issue is that it’s become the default response to every problem, even when it’s not the right tool for the job. Part of the reason is that it feels productive. You kick off a fine-tuning job, something tangible happens, and there’s a clear before-and-after. It gives the illusion that you’re making progress when, in reality, you might not be addressing the root cause at all.
I recall watching a team debug a contract analysis system. The outputs were unreliable on complex documents, and the initial assumption was that the model simply lacked legal reasoning ability. They went through multiple tuning iterations with no improvement. Eventually, someone discovered that the retrieval layer was pulling the same documents repeatedly and stuffing them into the context window. The model was trying to reason through a mountain of redundant, low-value text. Once they fixed the retrieval ranking and added context compression, performance improved dramatically.
The model itself was never the problem. And this kind of story is far more common than most people realize.
What’s happening at inference time
For a long time, inference was treated as a passive step—you simply ran the model and collected the output. All the meaningful decisions were made during training. That’s starting to change.
One driver is that some models began shifting more compute toward the generation phase rather than relying solely on what was baked in during training. Another factor was research showing that behaviors like self-checking or rewriting responses could be learned through reinforcement learning. Both of these developments highlighted inference itself as a lever for improving performance.
What I’m seeing now is engineering teams beginning to treat inference as something you can actively design around, rather than a fixed step you just accept. How much reasoning depth does this particular task require? How is memory being handled? How is retrieval being prioritized? These are becoming deliberate design questions instead of afterthoughts.
The resource allocation problem
One thing that’s consistently overlooked is that most AI systems apply the same uniform process to every query. A simple account status question goes through the exact same pipeline as a multi-step compliance workflow that requires reconciling information across multiple conflicting documents. Same cost, same process, same compute.
That doesn’t hold up under scrutiny. In any other engineering context, you’d allocate resources based on the complexity of the work. Some teams are starting to do exactly this with AI—routing lighter inferences to cheaper, faster workloads and reserving heavier compute for tasks that genuinely need it. The economics improve, and the quality of the harder tasks gets better too, since they’re no longer starved of resources.
These systems are more layered than people realize
When you peel back the layers of a production AI system today, it’s rarely just a single model answering questions. There’s typically a retrieval step, a ranking step, possibly a verification step, and a summarization step—all working in sequence to produce the final output. Success depends not just on the underlying model’s capability, but on how well all those components work together.
If the retrieval ranker isn’t properly calibrated, the outputs will look like model errors. A context window that grows without limits will quietly degrade reasoning quality without anything obviously breaking. These are systems-level issues, not model-level issues, and they demand systems-level thinking to resolve.
A practical example of this approach is speculative decoding. The idea is straightforward: a smaller model generates candidate outputs, and a larger model verifies them. It started as a way to reduce latency, but it’s really about distributing reasoning across multiple components instead of expecting one model to handle everything. Two teams using the same base model but different inference architectures can end up with very different results in production.

Memory is becoming a real issue
Larger context windows have been helpful, but beyond a certain threshold, more context doesn’t improve reasoning—it hurts it. Retrieval gets noisier, the model loses track of what matters, and inference costs climb. Teams running AI at scale are investing real effort into techniques like paged attention and context compression. These aren’t glamorous topics, but they matter enormously in practice.
The goal is to have the right amount of context, well-managed—not too little, not too much.
Takeaway
Model selection matters less than it used to. Strong foundation models are now available from multiple providers, and capability gaps have narrowed for most use cases. What’s actually determining whether a deployment succeeds is the infrastructure surrounding the model—how retrieval is tuned, how compute is allocated, and how the system handles edge cases over time.
The teams that will be well-positioned in the coming years are the ones treating inference architecture as something worth engineering with care, rather than assuming a good-enough model will take care of everything else. In my experience, it usually doesn’t.



