LLM Evals Run On Vibes — So I Built The Missing Gatekeeper For What Actually Ships

TL;DR

A complete, working implementation in pure Python, backed by real benchmark data.

Most teams assess LLM outputs by reading them and making rough guesses. That approach falls apart the instant you try to scale.

The core issue isn’t that models hallucinate. It’s that nothing flags the confident ones — the responses scoring 0.525, clearing your threshold, yet being silently wrong.

I built a scoring layer that breaks faithfulness into two separate signals: attribution and specificity. High specificity paired with low attribution is the hallmark of a hallucination. A single score will miss it every single time.

This isn’t an evaluation script. It’s a decision engine positioned between your model and your end user.

I Changed One Line in My Prompt. Everything Fell Apart.

Three words wrecked my eval pipeline: “be specific and detailed.”

I dropped them into my system prompt on a Tuesday afternoon. A routine tweak — the sort of thing you do a dozen times while tuning a RAG pipeline. An hour later, my next test batch ran, and question three returned this:

“Context engineering was invented at MIT in 1987 and is primarily used for hardware cache optimization in CPUs. It has nothing to do with language models.”

My scorer rated it 0.525. Above my 0.5 passing threshold. Green light.

I nearly let it slip. I was scanning outputs the way you do after two hours of staring at test results — glancing at scores, not actually reading the text. The sole reason I noticed was that “1987” felt off. I reread it and checked the context document. The model had fabricated every precise detail in that sentence.

The score climbed because the response became more specific. The quality cratered because the model grew more confident about things it was making up. My eval layer had a single number trying to capture both dimensions, and it couldn’t separate them.

I caught it by hand that time. That’s not a process. That’s luck. And the entire purpose of an eval system is that it shouldn’t hinge on whether you happen to be paying close attention on a particular afternoon.

But the moment you attempt to genuinely fix it, things get messy. Like, how do you even pin down “good”? If you simply ask another LLM to evaluate the first one, you’re just shifting the problem up a level. The real threat isn’t a broken response — it’s the one that sounds like an expert while quietly deceiving you.

Most guides tell you to just call the model and see if the output “looks right.” But examine the numbers. What happens when your response scores 0.525 overall — technically acceptable — but its grounding score sits at 0.428 and its specificity at 0.701? That combination signals confident yet ungrounded. That’s not a borderline response. That’s a hallucination dressed in a business suit.

These aren’t rare edge cases. This is the default behavior in production LLM systems, and you won’t catch it with a gut feeling.

The solution is a layer most teams skip entirely. Between LLM output and user delivery, there’s a deliberate step: determining whether the response should be served, retried, or regenerated. I built that layer. Here’s the system, with real numbers and runnable code.

Complete code:

Who This Is For

This architecture is valuable when you’re building RAG systems [1], where incorrect answers can easily creep in, or multi-turn chatbots that need their responses validated over time. It’s also useful in any LLM pipeline requiring automated next-step decisions — like whether to display a response to the user, retry, or regenerate.

Skip it for single-turn demos with no production traffic. If every response undergoes human review anyway, the overhead isn’t justified. Same if your domain has one correct answer and exact matching works fine.

Why LLM Evaluation Is Broken

There are three ways most eval systems fail, and they typically happen before anyone catches on.

“Looks correct” isn’t always correct. A response can sound fluent, be well-structured, and project confidence, yet still be entirely wrong. Fluency doesn’t guarantee truth. When you’re reviewing outputs quickly, your brain tends to assess writing quality, not factual accuracy. You have to actively resist that instinct, and most people don’t.

The hallucinations that matter aren’t the obvious ones. Nobody ships a model that claims the Eiffel Tower is in Berlin. That gets caught on day one. The dangerous ones are the confident, domain-specific assertions that sound right to anyone who isn’t a specialist in that exact field [10]. They pass review undetected, reach production, and ultimately land in front of users.

The deeper issue is that a score isn’t a decision. You set a threshold at 0.5. One response scores 0.51 and passes. Another scores 0.95 and also passes. You treat them identically. But one of them probably needed human review. They give you a number when what you actually need is: ship this, flag this, or reject this.

The score went up. The quality collapsed. One number can’t hold both directions simultaneously.

Traditional metrics like BLEU and ROUGE don’t work well here [2, 3]. They measure how many words overlap with a reference answer, which makes sense in machine translation where there’s typically one correct output. But LLM responses don’t have a single correct version. There are many valid ways to express the same idea. So applying BLEU to a conversation is misleading. It’s like grading an essay purely by counting how many words match a model answer, rather than judging whether the idea is actually correct and well articulated.

LLM-as-judge is what everyone’s gravitating toward now [4]. You use a model like GPT-4 to score the outputs of another GPT-4 model. It does improve over BLEU, but it comes with drawbacks. It’s expensive, it can yield slightly different results each time, and it creates a dependency on another model you don’t fully control. And it doesn’t scale when you’re scoring every response in a production system.

Frameworks like RAGAS [6] have advanced this space, but they still rely on an LLM judge for scoring and aren’t deterministic across runs. What you actually need is a scoring layer that runs locally, incurs no per-call cost, and delivers consistent results every time.

What a Real Eval System Needs

Before writing any code, I established five hard constraints. It had to run in milliseconds because an eval layer that slows down user responses isn’t deployable. No API calls on the critical path either. The LLM judge is a fallback, not the default, because paying per evaluation call doesn’t scale. And same

The remaining two criteria focused on transparency. Each rejection needed to include a clear, human-readable explanation, not just a numeric value, because seeing “score: 0.43” gives no guidance on what to actually improve. Additionally, introducing new scoring rules should never mean rewriting the core decision logic. That is a recipe for system decay over time.

The Architecture

Three distinct layers, each with a well-defined responsibility.

LLM Evaluation Architecture: A multi-tier pipeline showing how AI-generated responses are assessed for quality and directed through automated decision and action stages to ensure well-grounded outputs. Image by Author

The scoring layer generates numeric values. The decision layer translates those values into a clear verdict accompanied by a thorough explanation. That final step is what most frameworks overlook, and it becomes the most valuable piece when a response fails in production and you are left guessing why.

The Core Evaluation Dimensions

Faithfulness: Attribution and Specificity

This was the most critical scorer, and the one I nearly implemented incorrectly.

Initially, I relied on a single “faithfulness” metric. It blended aspects like semantic similarity and word overlap between the context and the response. It handled straightforward cases well, but it broke down in the scenarios that truly matter.

Here is the issue: certain responses sound authoritative and thorough, yet they have no real basis in the provided context.

So I broke faithfulness into two independent checks.

Attribution verifies whether the response is actually backed by the context. If the answer includes claims that cannot be traced to or inferred from the source material, attribution drops [8].

# Attribution: is it grounded?
semantic    = semantic_similarity(context, response)
overlap     = token_overlap(context, response)
attribution = 0.60 * semantic + 0.40 * overlap

Specificity evaluates how precise and substantive the response is. A specific answer provides concrete details and steers clear of vague language such as “it can be useful in many situations.”

# Specificity: is it concrete?
length_score  = min(1.0, len(tokens) / 80)
richness      = len(set(tokens)) / len(tokens)
hedge_penalty = min(0.60, hedge_count * 0.15)
specificity   = (0.40 * length_score + 0.60 * richness) - hedge_penalty

# Composite
faithfulness = 0.70 * attribution + 0.30 * specificity

The key takeaway: high specificity combined with low attribution is a hallmark of hallucination.

A 2x2 matrix diagram evaluating AI responses based on High and Low Specificity versus High and Low Attribution. It categorizes outputs as Weak Answers, Hallucinations, Grounded but Thin, or Good Answers. — The AI Response Quality Matrix: Mapping the relationship between factual grounding (Attribution) and detail precision (Specificity) to decide whether to accept, reject, or review model outputs. Image by Author

This combination is particularly risky because wrong answers that sound confident and detailed are far easier to miss. Vague responses at least signal uncertainty. Confident but ungrounded ones do not.

Attribution carries the most weight because factual grounding is paramount. Specificity plays a supporting role, primarily helping to surface confident but incorrect answers.

Here is a real-world example. A response asserts that context engineering “was invented at MIT in 1987 and is primarily used for hardware cache optimization”:

Attribution: 0.428 (low, weakly grounded in the context)
Specificity: 0.701 (high, sounds detailed and authoritative)
Decision: REJECT
Reason: Confident hallucination detected

A single threshold-based score of 0.5 might still let this slip through. Separating attribution from specificity exposes the issue because it reveals not only the score, but the underlying reason the response is failing.

Answer Relevance

This measures how closely the response addresses the original question.

The scorer blends three signals: semantic similarity between the full response and the query, the highest-scoring individual sentence in the response, and basic token overlap [5, 6].

semantic  = semantic_similarity(query, response)
max_sent  = max_sentence_similarity(query, response)
overlap   = token_overlap(query, response)

relevance = 0.45 * semantic + 0.35 * max_sent + 0.20 * overlap

The sentence-level component rewards concise, on-target answers. Even when a response is lengthy or contains supplementary information, it can still achieve a strong score as long as at least one sentence directly addresses the question.

Context Quality: Precision and Recall

Context Precision answers a straightforward question: is the model fabricating information, or is it sticking to what the context provides? [7] When precision is low, the response includes claims that the retrieved context never supported. The model has gone off-script.

Context Recall approaches it from the opposite direction. It measures how much of the retrieved content actually appears in the response. Low recall indicates that the retrieval step pulled in documents the model largely disregarded. You fetched a lot of irrelevant material.

prec = precision(context, response)   # context -> response coverage
rec  = recall(response, context)      # response -> context grounding
f1   = 2 * prec * rec / (prec + rec)

context_quality = 0.50 * f1 + 0.50 * semantic_similarity(context, response)

Context quality is actionable, not just observational. When it dips below a certain threshold, the system does not merely raise a flag. It adjusts the next step in the workflow.

if context_quality < 0.40 and final_score < 0.65:
    action = "retrieve_more_documents"
    reason = "Root cause is retrieval, not the model"

A poor response stemming from weak retrieval calls for better source documents, not a revised prompt. Most evaluation frameworks fail to draw this distinction, and as a result you end up troubleshooting the wrong component for an hour.

Disagreement Signal

I began paying close attention to variance after tracking down a particularly nasty edge case. The logs revealed a faithfulness score of 0.68, relevance at 0.32, and context quality at 0.71.

If you simply compute a weighted average of those values, the overall result looks perfectly fine. It sails right through the pipeline. But beneath the surface, the raw numbers are each telling a completely different story about the same response. One metric claims the answer is accurate, another says it misses the point entirely, and the third suggests the context was reasonably solid.

Averaging those values completely buries the tension between them. What you really need to monitor is the signal of disagreement.

You can detect this immediately by computing the standard deviation across all your dimension scores:

def _disagreement(scores: list[float]) -> float:
    n = len(scores)
    if n < 2:
        return 0.0           
    mean = sum(scores) / n
    return round(math.sqrt(sum((s - mean) ** 2 for s in scores) / n), 4)

Once the standard deviation exceeds 0.12, the system sends the response directly to a human review queue, completely bypassing the final average.

When your scorers are all pointing in different directions, the system is fundamentally unsure. That internal conflict is your clearest signal that automation has hit its ceiling and a person needs to take over.

This disagreement metric doesn’t just trigger reviews, though. It also feeds directly into the confidence calculation, which leads us to the next piece of the puzzle.

The Scoring Engine: Hybrid by Design

The complete pipeline operates in three stages.

Stage 1: Heuristic Scoring

All four evaluation dimensions are calculated locally. The system makes zero external API calls. By running sentence-transformers directly on the CPU, this step wraps up in about 3ms.

Stage 2: Confidence Gating

When a score falls between 0.45 and 0.65, things get interesting. The system no longer trusts the heuristics on their own and escalates to the LLM judge. Outside that range, local scoring is reliable enough and no API call is triggered.

Stage 3: The Decision Layer

A vertical flowchart of an AI response evaluation pipeline. It displays a sequence from data input to a final rejection decision based on metrics for faithfulness, relevance, context, and specificity. — AI Evaluation Pipeline: A step-by-step logic flow showing how metric thresholds identify hallucinations and trigger automated rejection and regeneration. Image by Author

No raw floating-point value gets written straight to the logs. Instead, the pipeline returns a complete schema: ACCEPT, REVIEW, or REJECT, along with a failure type, an explanation, and a concrete next step. The LLM judge never runs by default. It only activates when the heuristics genuinely can’t reach a conclusion.

The Decision Layer: From Scores to Actions

Most evaluation tools try to answer a simple question: “Is this response good?”

This system reframes the question entirely: “What should we do with this response?”

The decision logic running underneath is a three-dimensional policy that operates directly on your grounding, specificity, and agreement metrics. Rather than leaning on a single average, it pinpoints failures using explicit programmatic rules:

# Confirmed hallucination: attribution is critically low and the response is vague
if attribution < 0.35 and specificity <= 0.50:
    return REVIEW, "vague response, retry with specific prompt"

# Confirmed hallucination: attribution is low but the response sounds confident
if attribution < 0.35 and specificity > 0.50:
    return REJECT, "confident hallucination"

# Confident hallucination: sounds authoritative but is poorly grounded
if attribution < 0.45 and specificity > 0.60:
    return REJECT, "confident hallucination detected"

# Poor retrieval: the context fetch itself is the root cause
if context_quality < 0.40:
    return REVIEW, "retrieve_more_documents"

# Hard guardrail: both attribution and context quality are weak
# Two weak signals together are worse than one strong failure
if attribution < 0.55 and context_quality < 0.50:
    return REJECT, "hallucination guardrail triggered"

# Weak grounding
if attribution < 0.55:
    return REVIEW, "weak grounding, retry with specific prompt"

# Off-topic: response does not address the query at all
if relevance_score < 0.30:
    return REVIEW, "off-topic, retry with clearer query"

# High disagreement
if disagreement > 0.12:
    return REVIEW, "uncertain scoring, human review recommended"

# Borderline quality
if final_score < 0.65:
    return REVIEW, "borderline, optional human review"

# All gates passed successfully
return ACCEPT, "serve_response"

Not every bad output deserves the same treatment. A vague response (low attribution, low specificity) simply needs a rewrite, so it gets routed to REVIEW with a prompt retry. A confident hallucination (low attribution, high specificity) is far riskier, so it receives an immediate REJECT and a forced regeneration. Each type of failure calls for a different downstream action.

What the Output Looks Like

Here are the actual results from running main.py on four test cases.

Example 1: Well-grounded response

Final Score       : 0.680
Attribution       : 0.684   (grounding)
Specificity       : 0.713   (concreteness)
Relevance         : 0.657
Context Quality   : 0.688
Disagreement      : 0.016   (scorer std dev)
No hallucination
Decision          : ACCEPT  (confidence: 41%)
Reason            : All quality gates passed
Next Action       : serve_response
Latency           : 322ms

Example 2: Confident hallucination

Final Score       : 0.525
Attribution       : 0.428   (grounding)
Specificity       : 0.701   (concreteness)
Relevance         : 0.613
Context Quality   : 0.424
Disagreement      : 0.077   (scorer std dev)
Suspected weak grounding
Failure Type      : hallucination
Decision          : REJECT  (confidence: 22%)
Reason            : Confident hallucination detected, attribution=0.428
                    (low grounding) but specificity=0.701 (high confidence).
                    Response sounds authoritative but is not grounded in context.
Next Action       : regenerate_with_grounding_prompt
Why               : Confident but ungrounded response is more dangerous than a vague one
Low-confidence sentences:
  It has nothing to do with language models.

This case

This clearly illustrates why evaluating based solely on a final score is insufficient. A score of 0.525 might appear acceptable since it exceeds the typical 0.5 pass mark, and a basic metrics system would likely let it pass without issue. However, the decision layer identifies a problem: an attribution score of 0.428 paired with a specificity score of 0.701 is a classic indicator of a confident hallucination.

Example 3: Vague response

Final Score       : 0.295
Attribution       : 0.248   (grounding)
Specificity       : 0.332   (concreteness)
Decision          : REVIEW  (confidence: 32%)
Reason            : Uncertain / vague response, low grounding, low specificity.
                    Not a confirmed hallucination.
Next Action       : retry_with_specific_prompt

A vague or evasive answer is not the same as a hallucination. When both attribution and specificity are low, it usually means the model is being overly cautious and avoiding a direct answer. Simply regenerating the response will likely produce more of the same unhelpful content. The correct approach is to retry using a more focused and specific prompt.

Example 4: Off-topic response

Final Score       : 0.080
Attribution       : 0.017   (grounding)
Specificity       : 0.630   (concreteness)
Decision          : REJECT  (confidence: 42%)
Reason            : Confident hallucination, attribution=0.017,
                    specificity=0.630. Response sounds authoritative but is fabricated.
Low-confidence sentences:
  The French Revolution was a period of major political and societal change...
  Marie Antoinette was Queen of France at the time.

An attribution score of 0.017 alongside a specificity of 0.630 indicates the model produced a detailed response about the French Revolution when asked a question about context engineering. The system immediately detects this mismatch. Rather than issuing a blanket rejection, it identifies and highlights the specific sentences that raised the alarm.

Decision Distribution

ACCEPT      1/4  (25%)
REVIEW      1/4  (25%)
REJECT      2/4  (50%)

Monitoring this decision distribution over time in a live environment helps you quickly spot issues such as declining model performance, missing documents in your retrieval pipeline, or weakening prompt templates. This represents genuine system observability, far beyond simply logging raw text into a monitoring tool.

Real Benchmark Numbers

Results from running the full 5-case RAG evaluation set:

ID	Label	Attr	Relev	Ctx	Final	Hallucination	Decision
q_001	good_response	0.686	0.680	0.725	0.694	No	ACCEPT
q_002	hallucinated_response	0.445	0.621	0.459	0.547	Suspected	REJECT
q_003	good_response	0.528	0.456	0.535	0.534	Suspected	REVIEW
q_004	off_context_response	0.043	0.682	0.091	0.337	Confirmed	REJECT
q_005	good_response	0.625	0.341	0.628	0.536	No	REVIEW

Decisions, not scores, should be your primary measure of truth. These figures are for demonstration only — five cases do not constitute a statistically meaningful sample, and you should validate thresholds against your own labeled dataset before relying on them.

Accuracy benchmark

Looking at actual accuracy benchmarks, good responses average 0.588, while poor ones fall to 0.442. The 0.146 gap between them is large enough to establish precise and dependable boundaries. Additionally, it caught 2 out of 2 hallucinations without error during testing. This gives you complete detection coverage without increasing runtime costs.

Latency benchmark (10 runs, warm model)

Operation	Latency	Notes
Attribution scorer	~1.2ms	Embedding plus overlap
Relevance scorer	~1.1ms	Sentence-level scoring
Context scorer	~0.8ms	Precision plus recall
Decision layer	~0.1ms	Policy rules plus confidence
Full pipeline.evaluate()	~291ms mean	No LLM calls
With LLM judge	~340ms	Edge cases only, 0.45 to 0.65 zone

The first execution will experience an 800–1000ms delay as the sentence-transformers model initializes. After this initial load, performance improves significantly, averaging around 291ms per call. By pre-loading the model weights when your application starts up, you can run this entire evaluation layer in production while adding less than 300ms to your response time.

The Regression Test System

Most teams overlook this step, which is a critical mistake. Generating evaluation scores is meaningless if you do not act on them. If adjusting a prompt template causes accuracy to drop, you need to know immediately. If changing a retrieval strategy breaks edge cases that previously passed, you must catch that before merging to the main branch. The regression suite addresses this by storing historical baselines and comparing current scores against them during your CI build.

suite = RegressionSuite("data/baselines.json")

# Record baselines after validating your system
suite.record_baseline("q_001", query, context, response, result)

# After changing your prompt or model:
report = suite.run_regression(pipeline, test_cases)

# Treat failures like CI failures
if report.failed > 0:
    raise SystemExit("Quality regression detected. Deployment blocked.")

Below is the terminal output when a prompt change causes a performance regression:

Regression Report  --  CI/CD Quality Gate
3 REGRESSION(S) DETECTED -- DEPLOYMENT BLOCKED

Total cases   : 3
Passed        : 0
Failed        : 3
Mean delta    : -0.4586
Threshold     : +/- 0.05

Regressions -- score dropped beyond threshold:
  [q_001] 0.694 -> 0.137  (delta -0.556)
  [q_002] 0.547 -> 0.137  (delta -0.410)
  [q_003] 0.534 -> 0.124  (delta -0.410)

A minor prompt adjustment causes a strong response score to plummet from 0.694 to 0.137. The regression pipeline catches this immediately, blocking the deployment before any users are affected.

This brings standard

CI/CD practices applied to generative AI. No more manual spot-checks. If quality falls below your defined threshold, the build fails. It treats prompt engineering just like code coverage or unit testing [11].

From Metrics to Decisions to Actions

Here’s the full transformation this system enables.

Old thinking:
score = 0.68
# ship it? probably fine
This system:
signals -> reasoning -> decision -> action

Every output is mapped into a consistent schema. You receive a clear decision (ACCEPT, REVIEW, or REJECT), a reason logged for audit, a failure category, a routing instruction, and a confidence score. This structured output is what makes the system truly debuggable when issues arise.

The to_dict() method on every result ensures it can be serialized to JSON for logging, dashboards, and APIs:

result.to_dict()
# {
#   "decision": "REJECT",
#   "confidence_pct": 22,
#   "failure_type": "hallucination",
#   "hallucination_status": "suspected",
#   "next_action": "regenerate_with_grounding_prompt",
#   "action_why": "A confident but ungrounded response is riskier than a vague one",
#   "scores": {
#     "final": 0.525,
#     "attribution": 0.428,
#     "specificity": 0.701,
#     "relevance": 0.613,
#     "context_quality": 0.424,
#     "disagreement": 0.077
#   },
#   "explanations": {
#     "reason": "Confident hallucination detected...",
#     "low_confidence_sentences": ["It has nothing to do with language models."]
#   },
#   "meta": {
#     "passed": false,
#     "used_llm_judge": false,
#     "latency_ms": 301.0
#   }
# }

Integrate this with any logging system, and you’ll have a complete quality audit trail for every response your system has ever generated.

Honest Design Decisions

A score separation of 0.146 is entirely typical for a local heuristic system. High-quality and low-quality responses will inevitably overlap in the middle range. The decision layer addresses this by analyzing how attribution and specificity interact, instead of relying on a single averaged score. Attempting to artificially widen the separation by adjusting weights only distorts benchmarks without improving real-world performance.

The 0.70/0.30 and 0.60/0.40 weight ratios aren’t derived from any universal principle. I simply tested different values until they aligned with the data in my own knowledge base. If you apply this same configuration to legal contracts, medical journals, or raw source code, these ratios will not work. That’s why I placed them in a configs directory. You can fine-tune the parameters for your specific use case without modifying the core pipeline logic.

The 0.35 hallucination threshold triggers only when attribution drops to near zero. If your domain involves extensive paraphrasing without exact word matches, this strict cutoff will generate false positives. Leveraging sentence-transformers [9] captures semantic meaning far more effectively than basic TF-IDF matching. If you disable it and fall back to the local mode, the pipeline automatically adopts a more conservative stance to safeguard your data. [5]

The 0.45 to 0.65 LLM judge range is directly linked to the default thresholds. If you adjust REJECT_THRESHOLD or REVIEW_THRESHOLD, you must also remap the judge window accordingly. The architecture follows a strict pattern: invoke the costly LLM judge only when local heuristics encounter uncertainty—never as the default gatekeeper.

Low confidence scores—such as 22% or 42% on borderline outputs—are not bugs. Those responses are inherently unstable. An overconfidence-prone evaluation pipeline processing unreliable inputs is a serious production risk; you want a system that accurately quantifies its own uncertainty.

Also, don’t be concerned about the embeddings.position_ids warning that appears when sentence-transformers initializes. It’s purely cosmetic and has no effect on runtime performance.

What This Does Not Solve

The most challenging case is implicit hallucination. If a response reuses vocabulary from your context but subtly alters the meaning, the local code is misled because the surface words still align. Heuristics cannot detect this kind of semantic drift. That’s precisely why the LLM judge fallback exists.

Cross-document consistency is also beyond the scope of this system. The scorer evaluates each response against its own context independently. If two related responses contradict each other, this setup won’t catch it. And calibration is genuinely domain-specific—treat configs/thresholds.yaml as a starting point, test it against your own labeled examples, and tune before relying on any values listed here. A medical QA system, for instance, demands hallucination thresholds far stricter than those used in this implementation.

What You Have Actually Built

What you end up with after building all of this is not just an evaluation script.

It accepts three inputs: query, context, and response. The output is a structured payload containing a decision, a log reason, a failure type, a next action, a confidence score, and a detailed breakdown of the underlying metrics.

Every response passing through your system is scored, classified, and routed. High-quality ones go directly to the user. Vague ones are retried with a more precise prompt. Hallucinations are blocked before they reach anyone. And when you modify a prompt and three cases that previously scored 0.69 suddenly drop to 0.13, the regression suite catches it before you merge to main—not after a user reports the issue.

This is the missing layer in the flood of LlamaIndex demos, LangChain examples, and basic RAG tutorials available online. Everyone demonstrates how to connect a vector database, but no one shows you how to reliably validate the model’s output.

RAG retrieves the right documents. Prompt engineering crafts the right instructions. This layer ensures you make the right decision about what to do with the output.

You can access the full source code, benchmark data, and local implementation scripts here: .

References

[1] Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W.-T., Rocktäschel, T., Riedel, S., and Kiela, D. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. Advances in Neural Information Processing Systems, 33, 9459-9474.

[2] Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J. (2002). BLEU: a method for automatic evaluation of machine translation. Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, 311-318.

[3] Lin, C.-Y. (2004). ROUGE: A package for automatic evaluation of summaries. Text Summarization Branches Out, 74-81.

[4] Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E., Zhang, H., Gonzalez, J. E., and Stoica, I. (2023). Judging

LLM-as-a-Judge with MT-Bench and Chatbot Arena. arXiv preprint arXiv:2306.05685.

[5] Reimers, N., and Gurevych, I. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, 3982–3992.

[6] Es, S., James, J., Espinosa Anke, L., and Schockaert, S. (2023). RAGAS: Automated Evaluation of Retrieval Augmented Generation. arXiv preprint arXiv:2309.15217.

[7] Manning, C. D., Raghavan, P., and Schutze, H. (2008). Introduction to Information Retrieval. Cambridge University Press.

[8] Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. Proceedings of NAACL-HLT 2019, 4171–4186.

[9] Wang, W., Wei, F., Dong, L., Bao, H., Yang, N., & Zhou, M. (2020).
MiniLM: Deep self-attention distillation for task-agnostic compression of pre-trained transformers. Advances in NeurIPS, 33, 5776–5788.

[10] Tonmoy, S. M., Zaman, S. M., Jain, V., Rani, A., Rawte, V., Chadha, A.,& Das, A. (2024). A comprehensive survey of hallucination mitigation techniques in large language models. arXiv:2401.01313.

[11] Breck, E., Cai, S., Nielsen, E., Salib, M., & Sculley, D. (2017).
The ML test score: A rubric for ML production readiness and technical debt reduction. IEEE BigData 2017, 1123–1132.

Disclosure

All code in this article was written by me and is original work, developed and tested on Python 3.12.6. Benchmark numbers are from actual runs on my local machine (Windows 11, CPU only) and are reproducible by cloning the repository and running main.py, experiments/rag_eval_demo.py, and experiments/benchmarks.py. The sentence-transformers library is used as an optional dependency for semantic embedding in the attribution and relevance scorers. Without it, the system falls back to TF-IDF vectors with a warning, and all functionality remains operational. The scoring formulas, decision logic, hallucination detection rules, and regression system are independent implementations not derived from any cited codebase. I have no financial relationship with any tool, library, or company mentioned in this article.

Top Posts

The Hidden Alignment Chasm: Why Enterprise AI’s Unexamined Reality Gap Threatens Deployment

From OMB M-26-14 Blueprint to Battle-Ready Cyber Edge

Nothing’s Pink Earbuds: Style Meets Sound Test

LLM Evals Run on Vibes — So I Built the Missing Gatekeeper for What Actually Ships

The Hidden Alignment Chasm: Why Enterprise AI’s Unexamined Reality Gap Threatens Deployment

Dale-Proof AI Learns Perfect MNIST, Near-CIFAR-10 Vision—No Backpropagation Needed

Unlock Peak Performance: Your Command Protocol for GPT-5.6 Synergy

Beyond the Main Branch: Streamlining AI Workflows with Git Worktrees

The AI Safety Capital Rising: Beyond Silicon Valley’s Shadow

Beyond Context Engineering: The Loop Experiment Running Blind Without an LLM

The Hidden Alignment Chasm: Why Enterprise AI’s Unexamined Reality Gap Threatens Deployment

From OMB M-26-14 Blueprint to Battle-Ready Cyber Edge

Nothing’s Pink Earbuds: Style Meets Sound Test

Orchestrate an AI Venue Maestro: Architecting Event Fluency with MongoDB, Voyage & LangGraph

The 11-Byte Time Bomb: OpenSSL’s HollowByte Memory Freeze Vulnerability

China’s Kimi K3 Dominates: Shattering Benchmarks Against Claude Fable and GPT 5.6

CMMC Listening Sessions: DoD Hears Questions as Plans Take Shape

Sensing the Skies: IoT’s Silent Revolution in Aerospace Safety Checks

Trending

The Hidden Alignment Chasm: Why Enterprise AI’s Unexamined Reality Gap Threatens Deployment

From OMB M-26-14 Blueprint to Battle-Ready Cyber Edge

Latest Posts

Not More Data, but Better World Models – Unite.AI

OpenAI Is Hiring Head of Preparedness, Amid AI Cyberattack Fears

Subscribe to Updates

Top Posts

LLM Evals Run on Vibes — So I Built the Missing Gatekeeper for What Actually Ships

TL;DR

I Changed One Line in My Prompt. Everything Fell Apart.

Who This Is For

Why LLM Evaluation Is Broken

What a Real Eval System Needs

The Architecture

The Core Evaluation Dimensions

Faithfulness: Attribution and Specificity

Answer Relevance

Context Quality: Precision and Recall

Disagreement Signal

The Scoring Engine: Hybrid by Design

The Decision Layer: From Scores to Actions

What the Output Looks Like

Real Benchmark Numbers

The Regression Test System

From Metrics to Decisions to Actions

Honest Design Decisions

What This Does Not Solve

What You Have Actually Built

References

Disclosure

Related Posts