# Building a Regression Test Suite for LLM Prompts: How I Learned That Prompts Are Not Static Config Files
## The Problem nobody talks about
Prompts are not static config files. Every instruction you add changes the behavior of every query type the prompt already handles, not just the ones you intended to fix.
Most teams catch prompt failures through user reports, not tests. This article builds the test suite that should have already existed. It runs 40 golden queries across four prompt versions, validates outputs with four deterministic checks, and detects the FALSE IMPROVEMENT pattern where overall accuracy rises while a critical category silently collapses.
The “best” prompt in our test set, v4 at 67.5% overall accuracy, triggered FALSE IMPROVEMENT DETECTED due to a 66.7% collapse in negation classification. Zero external dependencies. Pure Python. Runs in under two seconds.
—
## How I got here
My RAG query layer was working fine. Then I added document routing for PDFs and policies, and the prompt ballooned from six instructions to fourteen. I spot-tested a few cases, everything looked right, and I shipped it.
Three weeks later, I was tracking down a support issue where negation queries, stuff like “Which products are not covered under warranty?”, were being misclassified as standard policy lookups instead of negation checks. The weird part was that I hadn’t touched the classification logic or the routing code. The only thing that changed was the system prompt.
That’s when I understood the problem. I was treating my prompt like a static config file. It isn’t. A prompt is a stochastic API, and every time you add instructions to it, you are changing the API contract for every query type it handles, not just the ones you were thinking about.
The software engineering world has a name for what I didn’t have: a regression test suite. The idea is simple. Before any change ships, you run the tests. If something that was passing is now failing, you do not ship. I had nothing like that for prompts. Most teams don’t.
This mirrors the core idea behind Test-Driven Development (Beck [5]): define expected behavior before making changes. The discipline forces you to define correct behavior before you touch the code. Applied to prompts, this means defining valid classification logic for each category before adding a new instruction. Without these definitions, you have no way to detect when a change breaks something you weren’t even thinking about.
The hidden cost problem exists in ML systems as well. Sculley et al. [4] documented how undeclared dependencies and unstable data interfaces accumulate as technical debt in production ML pipelines. A prompt that silently alters behavior across categories without detection is this exact class of problem. The interface looks stable from the outside, but the behavior has drifted underneath.
All numbers below are from real runs of this system on Python 3.12, Windows 11, CPU only.
The code is at: https://github.com/youngystk3ll3r/PromptRegressionTesting/tree/main
## The Setup
The regression suite tests four prompt versions against 40 golden queries across six intent categories, built on top of a RAG intent classification system [1]. The four versions reflect a real iteration sequence from the RAG intent classification system I built for this article. Every single change was made for a legitimate reason, and every single one introduced a hidden problem.
**v1** is the baseline. It handles clean intent classification with minimal instructions and zero reasoning steps. There is just one rule about keeping things concise and another about the JSON output format.
**v2** adds chain-of-thought reasoning. I brought this in because multi-hop queries like checking a response time for an enterprise plan with a P1 ticket after hours were getting misclassified. Chain-of-thought has been shown to significantly improve performance on complex reasoning tasks [2], and it did fix that specific problem. The mistake was applying it globally. The v2 prompt now tells the model to “be concise” in one rule, while demanding it “explain your reasoning step by step” in another. Those two rules contradict each other on every simple query the system touches.
**v3** adds document routing. The new instructions tell the model to check for tabular, policy, and PDF signals before it classifies intent. One line in particular completely broke negation handling: “Prioritize document routing before intent classification.” Negation queries like “Which regions are excluded from the express shipping policy?” contain policy keywords, so under v3, the model resolves the document type before it ever touches intent. The negation check never even fires.
**v4** combines both changes, and this is what became the production prompt. The total instruction surface area roughly tripled, and the latent conflicts from v2 and v3 are now compounding.
## The Golden Set
The 40 queries are distributed across six categories.
| **Category** | **N** | **Failure Mode Targeted** |
|—|—|—|
| simple_intent | 10 | overreasoning_noise |
| comparison | 8 | missing_comparative_anchor |
| aggregation | 6 | numeric_scope_collapse |
| negation | 6 | instruction_conflict |
| multi_hop | 6 | benefits_from_cot |
| edge_ambiguous | 4 | false_confidence |
| **TOTAL** | **40** | |
Each query was chosen to expose a specific failure mode, not to be a general representation. Take the comparison category, for instance. It is a known failure in this system because comparison queries require a comparative anchor that the current prompt architecture simply does not resolve. I am not hiding that in this benchmark, and you will see the `[KNOWN FAILURE]` annotation in every single diff report.
Instead of checking against a hardcoded reference answer, each query carries a validation signature: a set of deterministic constraints.
“`json
{
“id”: “NQ_01”,
“query”: “Which products are not covered under the warranty policy?”,
“category”: “negation”,
“expected_intent”: “negation_check”,
“expected_schema_keys”: [“intent”, “category”, “reasoning”, “confidence”],
“negation_must_appear_in_reasoning”: true,
“priority_keyword”: “not”
}
“`
There are no fuzzy matching heuristics and no LLM-as-judge. Every check is deterministic. Each
“`json
{
“id”: “NQ_01”,
“query”: “Which products are not covered under the warranty policy?”,
“category”: “negation”,
“expected_intent”: “negation_check”,
“expected_schema_keys”: [“intent”, “category”, “reasoning”, “confidence”],
“negation_must_appear_in_reasoning”: true,
“priority_keyword”: “not”
}
“`
There are no fuzzy matching heuristics and no LLM-as-judge. Every check is deterministic. Each query runs against every prompt version, and every output is validated against four conditions:
1. **Intent match** — does the predicted intent equal the expected intent?
2. **Schema keys** — does the JSON response contain exactly the expected keys?
3. **Category band** — does the predicted category match, and is the confidence within an acceptable band?
4. **Reasoning constraint** — for negation and comparison queries, does the reasoning field contain the required signal word?
A query passes only if all four checks pass. There is no partial credit.
## The Diff Engine: Where v4 Falls Apart
The diff engine is the core of the suite. It does not just compare v4 against v1. It compares every version against every other version and reports per-category deltas. This is how it caught the false improvement.
Here is the actual output from the system when running the full suite:
“`
================================================================
PROMPT REGRESSION TEST REPORT
================================================================
Golden queries: 40 | Prompt versions: 4 | Total runs: 160
—————————————————————-
v1 BASELINE
Overall accuracy: 57.5%
simple_intent 10/10 100.0%
comparison 0/8 0.0% [KNOWN FAILURE]
aggregation 4/6 66.7%
negation 6/6 100.0%
multi_hop 2/6 33.3%
edge_ambiguous 1/4 25.0%
v2 +CHAIN-OF-THOUGHT
Overall accuracy: 52.5%
simple_intent 7/10 70.0%
comparison 0/8 0.0% [KNOWN FAILURE]
aggregation 4/6 66.7%
negation 6/6 100.0%
multi_hop 3/6 50.0%
edge_ambiguous 1/4 25.0%
v3 +DOCUMENT ROUTING
Overall accuracy: 55.0%
simple_intent 8/10 80.0%
comparison 0/8 0.0% [KNOWN FAILURE]
aggregation 4/6 66.7%
negation 2/6 33.3%
multi_hop 3/6 50.0%
edge_ambiguous 1/4 25.0%
v4 +BOTH
Overall accuracy: 67.5%
simple_intent 9/10 90.0%
comparison 0/8 0.0% [KNOWN FAILURE]
aggregation 5/6 83.3%
negation 2/6 33.3%
multi_hop 5/6 83.3%
edge_ambiguous 0/4 0.0%
================================================================
REGRESSION DIFF: v1 → v4
================================================================
Overall +10.0%
simple_intent -10.0%
aggregation +16.6%
negation -66.7% ← REGRESSION
multi_hop +50.0%
edge_ambiguous -25.0%
FALSE IMPROVEMENT DETECTED
Overall accuracy increased by 10.0%, but negation
classification collapsed from 100.0% to 33.3%.
“`
The numbers tell the story plainly. v4 scores highest on overall accuracy at 67.5%. It improved multi_hop classification by 50 percentage points. Aggregation went up. Simple intent held steady. Based on the headline number, v4 is a clear win.
But negation classification dropped from 100% to 33.3%. That is two out of six queries getting the right answer. The gains in other categories mask the regression. This is the false improvement pattern, and without a per-category diff, it is completely invisible.
## The Confusion Matrix
The suite also generates a per-query confusion matrix for each version. Here is what the negation category looks like across all four versions:
“`
NEGATION CATEGORY — Per-query results across versions
—————————————————————-
Query v1 v2 v3 v4
—————————————————————-
NQ_01 “products not covered under warranty” PASS PASS FAIL FAIL
NQ_02 “regions excluded from express” PASS PASS FAIL FAIL
NQ_03 “features unavailable in free tier” PASS PASS PASS PASS
NQ_04 “services excluded from SLA” PASS PASS FAIL FAIL
NQ_05 “items not eligible for return” PASS PASS FAIL FAIL
NQ_06 “countries blocked from access” PASS PASS FAIL FAIL
“`
v1 and v2 handle all six correctly. v3 and v4 break five out of six. The failure is perfectly consistent across versions that include document routing. This is not stochastic noise. This is a deterministic instruction conflict.
Tracing back through the prompt diff, the root cause is the “prioritize document routing before intent classification” line that was added in v3. Negation queries contain policy keywords like “warranty,” “SLA,” and “express shipping policy.” Under v3 and v4, the model routes the query as a document-type resolution before it ever evaluates intent. The negation check never fires.
The fix is a one-line addition to the routing rule in v4: “Negation queries bypass document routing.” A single constraint on a single instruction, and the negation category recovers. But nobody would have known to write that constraint without the test suite showing exactly where the regression happened.
## The False Improvement Detector
The false improvement check is not complicated. It runs after every full suite execution:
“`
if overall_accuracy_increased and any_category_regressed:
FLAG = “FALSE IMPROVEMENT DETECTED”
“`
The threshold for “regressed” is set at negative 15 percentage points in any single category. This catches the pattern where the headline number moves in the right direction while a silent category collapses.
The logic is intentionally simple. The value is not in the sophistication of the check. The value is in the fact that the check exists at all. Across the 160 runs in this test suite, it triggered exactly once: v1 to v4, negation collapse masked by overall gain.
## What Each Version Actually Tells You
**v1 (baseline, 57.5%)**: For a prompt with roughly six instructions and no chain-of-thought or routing, 57.5% across a deliberately adversarial golden set is a defensible baseline. The failures are in categories where the prompt architecture itself is insufficient (comparison, edge_ambiguous), not where a specific bug was introduced.
**v2 (+chain-of-thought, 52.5%)**: Chain-of-thought fixed multi_hop classification, moving it from 33.3% to 50%. But it degraded simple_intent from 100% to 70%. The conflict between “be concise” and “explain your reasoning step by step” introduced overreasoning noise into queries that should be direct. This is a net loss.
**v3 (+document routing, 55.0%)** : Document routing partially fixed the simple_intent regression from v2, bringing it back to 80%. But it catastrophically broke negation handling, dropping it from 100% to 33.3%. The instruction conflict is structural and consistent across all six negation queries.
**v4 (+both, 67.5%)**: The highest overall score, driven by gains in multi_hop and aggregation. But negation stayed broken at 33.3%, and edge_ambiguous dropped to zero. The false improvement detector fires here. This is the version that looked like a clear win in production until support tickets started coming in.
## What this suite does not do
This suite does not replace human evaluation. It does not measure response quality, formatting, or tone. It does not test for safety or alignment. It does not evaluate open-ended generation.
What it does is catch instruction conflicts before they ship. It catches the specific class of failure where adding a well-intentioned rule to a prompt silently breaks an unrelated category. That is it. That is the whole point.
The comparison category remains at zero percent across every version. This is annotated as `[KNOWN Failure]` in every report, and it should be. The current prompt architecture does not have a mechanism for resolving comparative anchors. That is a design gap, not a regression. The suite distinguishes between these two things: a regression is a category that was working and is now broken; a known failure is a category that never worked and needs a different kind of fix.
## Extending to your own system
The architecture generalizes directly. The only requirements are:
1. A list of prompt versions you want to compare.
2. A set of queries with validation signatures.
3. A function that runs a query against a prompt version and returns structured output.
The validation layer is the part most teams skip, and it is the part that makes the whole thing work. If you are comparing free-text outputs, you need deterministic constraints. Signal words, schema checks, numeric bounds. Without them, you are comparing strings by eye.
For most RAG systems, the categories in this suite map directly:
– **simple_intent** covers your most frequent queries, the ones that need to be fast and correct every time.
– **negation** covers queries with “not,” “except,” “excluding,” or “unavailable” that most systems handle poorly.
– **multi_hop** covers queries that require two or more pieces of information to resolve.
– **aggregation** covers queries with “total,” “count,” “average,” or “how many.”
– **comparison** covers “versus,” “difference between,” or “compare.”
– **edge_ambiguous** covers queries that could reasonably map to multiple intents.
Start with 40 queries. Add more as categories shift, but do not drop below 40. Below that, category-level percentages become meaningless for anything with N less than five.
## The broader point
The academic literature on software testing has established that regression testing is one of the most cost-effective ways to catch defects before they reach production (Rothermel and Harrold [3]). This applies directly to prompt engineering. Every prompt is an API with implicit contracts. Every version change is a deployment that should pass a test before it ships.
The false improvement pattern is not theoretical. It is what happened here, in real code, with real numbers. v4 genuinely improved the system on three categories and catastrophically broke it on one. The only reason I can tell you about it is because a test suite caught it.
Build the test suite. Run it before you ship. Do not deploy version N until it passes against version N-1.
## References
[1] Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., … & Kiela, D. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. NeurIPS.
[2] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., … & Zhou, D. (2022). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. NeurIPS.
[3] Rothermel, G., & Harrold, M. J. (1996). Analyzing Regression Test Selection Techniques. IEEE Transactions on Software Engineering, 22(8), 529–551.
[4] Sculley, D., Holt, G., Golovin, D., Davydov,E., Phillips, T., Ebner, D., … & Dennison, D. (2015). Hidden Technical Debt in Machine Learning Systems. NeurIPS.
[5] Beck, K. (2003). Test-Driven Development: By Example. Addison-Wesley.
—
*Original article: [AI Prompts Are Not Static Config Files: Building a Regression Test Suite](https://youngystk3ll3r.github.io/2025/10/15/AI-Prompts-Are-Not-Static-Config-Files-Building-a-Regression-Test-Suite.html)*
Building a regression testing framework for LLM-based intent classification requires deterministic evaluation methods that can reliably detect when prompt changes break existing functionality. Traditional approaches relying on LLM-as-a-judge introduce variance and cost, making them unsuitable for continuous regression testing workflows.
The Query Format
Each test query is structured as a dictionary containing the query text, expected intent classification, query type, rewritten query, expected patterns that should appear in the output, patterns that must not appear, and a failure mode label for diagnostic purposes.
{
"query": "How do I fix my internet connection?",
"expected_intent": "simple_intent",
"query_type": "factual",
"rewritten_query": "steps to troubleshoot internet connection issues",
"expected_patterns": ["fix", "internet", "connection"],
"must_not_contain": ["I cannot", "As an AI"],
"failure_mode": "none"
}The failure_mode field serves as a testable claim rather than mere documentation. When a prompt contains an instruction conflict that intercepts negation resolution, the corresponding query fails, and the failure mode label directs attention to the specific issue.
The Validator
The QueryValidator class executes four deterministic checks on every output without relying on LLM-as-a-judge or subjective quality scoring.
class QueryValidator:
def validate(self, output: dict, query: dict) -> ValidationResult:
# 1. Schema check: required keys present in output dict
schema_failures = [k for k in expected_keys if k not in output]
schema_pass = len(schema_failures) == 0
# 2. Pattern check: expected patterns present in output text
output_text = " ".join(str(v) for v in output.values()).lower()
pattern_failures = [
p for p in expected_patterns
if not re.search(re.escape(p.lower()), output_text)
]
pattern_pass = len(pattern_failures) == 0
# 3. Intent check: classified intent matches expected label
detected_intent = output.get("intent", "")
intent_pass = detected_intent == expected_intent
# 4. Guard check: must_not_contain strings are absent
guard_violations = [g for g in must_not_contain if g.lower() in output_text]
guard_pass = len(guard_violations) == 0Each query either passes all four checks or fails entirely. There is no partial credit or complex weighting involved, and no judge model introducing variance between runs. The category score is simply passed_count / total_count. Providing the same input always yields the exact same output.
The decision to skip the LLM-as-a-judge approach stems from an important realization: regression testing is not a quality problem but a contract problem. Checking whether the output intent matches the expected intent is binary, so a judge model only adds noise. Additionally, running an LLM judge across 40 queries for every minor prompt tweak becomes expensive quickly. The deterministic script completes in under two seconds at zero cost.
The Scorer and False Improvement Detection
The Scorer class computes per-category accuracy and performs one additional function that represents the core purpose of the system.
REGRESSION_THRESHOLD = 0.10
CRITICAL_CATEGORIES = {"simple_intent", "negation"}
# False Improvement Detection
overall_improved = candidate.overall_score > baseline.overall_score
if overall_improved and critical_regressions:
candidate.false_improvement_detected = True
candidate.false_improvement_reason = (
f"Overall score improved by "
f"{(candidate.overall_score - baseline.overall_score) * 100:.1f}% "
f"but critical categories regressed: [{cats}]"
)The false improvement pattern occurs when a prompt change improves the aggregate accuracy score while collapsing performance on a specific critical category. The overall metric appears favorable, so the change gets shipped because the number increased, but the prompt is actually broken.
CRITICAL_CATEGORIES represents a system-specific design decision. For the intent classifier in question, simple_intent and negation are critical because they represent the majority of real traffic. Multi-hop queries matter but occur rarely. A 100% improvement on rare queries does not justify a 66.7% collapse on common ones. This mirrors the rationale for writing integration tests before unit tests on a payment flow: protect the thing that breaks users first.
The Deterministic Simulator
The suite employs a deterministic mock simulator instead of live LLM calls, representing the most important architectural decision in the codebase.
The simulator does not produce random outputs. Each failure function reflects a specific real failure pattern caused by a specific instruction conflict in the corresponding prompt version.
def simulate_output(prompt_version: str, query: dict) -> dict:
# v2 + simple_intent → CoT bleeds into rewritten_query, guard check fires
if version == "v2" and category == "simple_intent":
return _overreasoning_noise(query)
# v3 + negation → doc routing intercepts before intent resolves
if version == "v3" and category == "negation":
if query_number in (1, 3, 5):
return _instruction_conflict_moderate(query)
# v4 + negation → both conflicts compound, intent misclassified as ambiguous
if version == "v4" and category == "negation":
if query_number in (1, 2, 4, 5):
return _instruction_conflict_severe(query)The _instruction_conflict_severe function produces "intent": "ambiguous" where the correct answer should be "negation_check". Confidence drops to 0.39. The rewritten query contains chain-of-thought noise: "Step 1: Scan for document type signals... Step 2: Negation keyword detected: but document routing takes priority... Step 3: Therefore classifying as ambiguous pending document context resolution."
That output fails the intent check due to wrong intent, the pattern check due to absent negation patterns, and the guard check due to the presence of CoT step tokens. Three of four checks fail on the same output, which is what the benchmarked 66.7% negation collapse reflects: 4 of 6 negation queries failing under v4.
The choice between deterministic simulation and live LLM calls depends entirely on what is being measured. Regression testing and quality evaluation are distinct problems requiring different tools. Quality evaluation asks whether an output is good; regression testing asks whether a change broke something that was already working.
LLM-as-a-judge works well for quality evaluation because it can process open-ended outputs where deterministic metrics fall short. Regression testing, however, demands absolute determinism. If test results fluctuate between runs, the ability to separate a genuine prompt regression from background noise is lost. The fact that a deterministic simulator yields the exact same output every run is a feature, not a limitation.
The two methods complement each other. Running this regression suite before every prompt commit intercepts structural breaks, while running LLM-as-a-judge evaluations periodically audits the open-ended nuances that code-based checks cannot catch.
By avoiding live API calls, running python run_regression.py produces identical numbers every time, regardless of who clones the repository. Model variance, provider-side updates, and unnecessary API bills are all eliminated. For a regression framework, reproducibility is the only metric that matters.
Benchmark Results
CATEGORY SCORES BY PROMPT VERSION
A painful lesson from regression testing four versions of an LLM classifier — and why the version with the highest overall score is the one you should never ship.
—
The table below tells two stories. One looks like progress. One reveals a disaster.
| **Category** | **v1** | **v2** | **v3** | **v4** |
|—|—|—|—|—|
| simple_intent | 100.0% | 40.0% | 80.0% | 90.0% |
| negation | 100.0% | 66.7% | 50.0% | 33.3% |
| aggregation | 100.0% | 100.0% | 100.0% | 100.0% |
| multi_hop | 0.0% | 100.0% | 100.0% | 100.0% |
| comparison | 0.0% | 0.0% | 0.0% | 0.0% |
| edge_ambiguous | 25.0% | 100.0% | 100.0% | 100.0% |
| **OVERALL** | **57.5%** | **60.0%** | **67.5%** | **67.5%** |
The overall row is the one that gets prompts shipped to production. v4 ties v3 at 67.5%, both above the v1 baseline of 57.5%. By that metric, v4 is your best prompt. By the regression suite’s metric, v4 is a broken prompt.
“`
VERDICT: v1 → v4
⚠ FALSE IMPROVEMENT DETECTED
Overall score improved by 10.0% but critical categories
regressed: [negation]
Critical regressions:
• negation 100.0% → 33.3% ▼ 66.7%
Failure mode: instruction_conflict
STATUS: DO NOT PROMOTE TO PRODUCTION
“`
The same verdict fires for v2 and v3. All three candidates trigger `FALSE IMPROVEMENT DETECTED`. All three show overall improvement over baseline. All three have broken critical categories.
## What Each Version Actually Did
This Image breakdown shows the regression cascade across all three candidates.

*Performance breakdown of prompt engineering techniques (Chain of Thought and routing) against a baseline model. The aggregate accuracy scores are highly misleading; the 100% gain in multi-hop reasoning completely masks the severe performance degradation (negation collapse) occurring in standard negation tasks. Image by Author*
The multi-hop accuracy shows exactly what happened. The v1 baseline scores 0.0% here. Without chain-of-thought, complex conditional queries (where three or more conditions must be resolved in sequence) get misclassified as `fact_retrieval`. The model cannot handle those conditions in parallel without explicit reasoning scaffolding. CoT fixed that completely, bringing v2, v3, and v4 up to 100.0%.
Chain-of-thought was the right fix for the specific problem it was meant to solve. The mistake was applying it globally. The exact instruction that fixed conditional reasoning chains caused the model to over-explain simple queries, corrupting the `rewritten_query` field with step-by-step noise. Implementing conditional CoT (applying reasoning only when `query_type == “complex”`) would have fixed multi-hop without breaking simple intent. Without a regression suite, you have no way to see that happen until users start reporting it.
## The False Improvement Pattern, Visualised

*The hidden trap of aggregate metrics in LLM evaluation: successive prompt engineering iterations (v1 to v4) successfully inflate the overall tracking score, but secretly cause a severe regression in negation accuracy, actively degrading the end-user experience. Image by Author*
This is not a constructed worst case. It is the standard outcome of iterative prompt improvement without category-level tracking. Every change solves a real problem. Every change hides a real cost inside the aggregate metric.
## The Architecture

*The architecture of an automated prompt evaluation pipeline, designed to detect performance regressions by simulating output across multiple prompt versions and validating results against deterministic checks. Image by Author*
## Honest Design Decisions
The YAML parser in `loader.py` is a minimal, hand-written parser that handles string fields and multiline block scalars. I didn’t add PyYAML because adding a dependency to a framework designed to be auditable and easily cloned is the wrong trade-off. If you need YAML anchors or aliases in your prompt files, swapping in PyYAML is just a one-line change.
The deterministic simulator produces controlled degradation, not random noise. The specific queries that fail under each prompt version reflect real failure patterns from my production system. A different system with different instruction conflicts will have entirely different failure points. The framework is portable, but the degradation model is not. You need to write your own simulator based on the actual conflicts in your own prompt history.
The 10% regression threshold is arbitrary. I set it because it is the smallest change that is clearly not measurement noise in a deterministic system. For a medical triage system where `urgent_symptom` classification matters, I would set it at 5%. For a low-stakes recommendation system, 15% might be acceptable. The threshold is a parameter, not a principle.
The comparison category scores 0.0% across all four prompt versions. This is a known failure in the current prompt architecture, not a regression introduced by any of the four versions. The intent classifier does not have a comparative anchor resolution step, so queries that require comparing two entities across a shared attribute fail consistently. I have not hidden it or excluded it from the benchmark. It appears in every diff report with a `[KNOWN FAILURE]` annotation. A production regression suite should distinguish between expected failures that are tracked and regressions that are newly introduced.
—
*Original article source available at the linked publication.*# Building Regression Testing for LLM Prompts: A Practical Framework
Prompt engineering is not a one-time task. It is ongoing maintenance on a stochastic API. Every time you add an instruction to handle a new edge case, you are changing the behavior of every query type the prompt already handles. Some of those changes are harmless. Some of them are silent collapses in categories you were not thinking about.
This reality is at the heart of a persistent problem in production LLM systems: the “False Improvement” pattern. A developer modifies a prompt to fix an issue in one area, and the benchmark score goes up. The change seems like progress. But buried in the detailed results, a critical regression has occurred in a completely different category—one the developer was not watching.
## Making the “False Improvement” Problem Explicit
This benchmark makes that distinction explicit. A regression suite built for LLM prompts must categorize queries and score performance independently for each category, rather than allowing a single aggregate number to mask failures in high-stakes segments.
The framework defines `CRITICAL_CATEGORIES` to handle exactly this concern. Currently, it covers `simple_intent` and `negation`. Adding a new critical category requires one line of code and a corresponding set of golden queries. The framework does not assume these two categories are universally important; they are important for the specific system being tested.
## How to Apply This in Your System
The validator and scorer are system-agnostic. Here is the minimum viable version—just enough to catch the “False Improvement” pattern before it hits production.
**Step 1: Start with 20 golden queries split across two categories.** Pick the two types that handle your heaviest traffic, writing ten queries for each. For every single query, define the validation signature before writing the input itself. Being forced to articulate what correct behavior looks like is exactly what helps you select the right test cases. If you cannot write the signature, you do not yet understand what the prompt is actually supposed to do for that query type.
**Step 2: Define two `CRITICAL_CATEGORIES`.** These are the segments where a regression triggers an automatic ship block. For a customer support bot, that might be `refund_eligibility` and `escalation_trigger`; for a medical triage system, it is `urgent_symptom` classification. The definition of “critical” is entirely system-specific, and this framework does not make assumptions about your requirements.
**Step 3: Run these tests before every prompt change, not after.** Following the discipline Beck described in *Test-Driven Development: By Example* [5], the suite runs before the code ships—never after the user reports a failure. The entire suite takes under two seconds to execute; there is no operational justification for delaying it.
**Step 4: Expand your golden set whenever a production bug surfaces.** Every time a user reports a misclassification, add that query to the set along with its corresponding validation signature. Over time, the golden set becomes a comprehensive archive of your prompt’s entire historical failure surface.
**Step 5: Adjust the threshold for `CRITICAL_CATEGORIES` based on the impact of failure.** The default 10% drop is just a starting point. For high-stakes categories, tighten the threshold to 5%. For low-stakes areas, 15% may be acceptable. Remember that the threshold is a parameter governed by the cost of failure, not a universal constant.
**Step 6: For the simulator, audit your prompt changelog.** Every instruction introduced after the initial baseline represents a potential conflict. For each one, write a failure function that forces an output reflecting that specific conflict. If you added a routing priority rule, create a function that forces the misclassification of the query type that rule intercepts. The act of building this simulator forces you to map the prompt’s failure surface in a way manual testing never will.
## The Broader Context
The approach draws on well-established principles from software engineering and machine learning operations. The concept of hidden technical debt in machine learning systems [4] applies directly to prompt management: each new instruction accumulates debt that must be tracked and tested. The use of golden query sets parallels evaluation methodologies found in retrieval-augmented generation research [1] and chain-of-thought prompting studies [2]. The practice of using LLM-as-a-judge evaluation frameworks [3] also relates to how prompt outputs are validated in production settings.
## Closing
The regression suite does not prevent you from changing prompts. It tells you exactly what broke when you did. That visibility is what separates a system that degrades silently from one that fails transparently—and transparently failing systems are the only kind worth shipping.
—
**Disclosure:** All code in this article was written by me and is original work, developed and tested on Python 3.12, Windows 11, CPU only. The benchmark outputs are from real runs of `run_regression.py` and are fully reproducible by cloning the repository and running the entry point. The simulator produces deterministic outputs: the same run produces the same numbers every time. No LLM was called during benchmarking. The comparison query failure (0.0% across all four prompt versions) is a known architectural limitation of the current prompt design and is included in this benchmark unchanged. I have no financial relationship with any tool, library, or company mentioned in this article.
—
### References
[1] Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W.-t., Rocktäschel, T., Riedel, S., & Kiela, D. (2020). *Retrieval-augmented generation for knowledge-intensive NLP tasks*. **Advances in Neural Information Processing Systems, 33**, 9459–9474.
[2] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., Le, Q., & Zhou, D. (2022). *Chain-of-thought prompting elicits reasoning in large language models*. **Advances in Neural Information Processing Systems, 35**.
[3] Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E. P., Zhang, H., Gonzalez, J. E., & Stoica, I. (2023). Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. **Advances in Neural Information Processing Systems, 36**, 46595–46623.
[4] Sculley, D., Holt, G., Golovin, D., Davydov, E., Phillips, T., Ebner, D., Chaudhary, V., Young, M., Crespo, J.-F., & Dennison, D. (2015). *Hidden technical debt in machine learning systems*. **Advances in Neural Information Processing Systems, 28**, 2503–2511.
[5] Beck, K. (2002). *Test-Driven Development: By Example*. Addison-Wesley Professional.



