A Coding Implementation To Construct An Uncertainty-Conscious LLM System With Confidence Estimation, Self-Analysis, And Computerized Internet Analysis

On this tutorial, we construct an uncertainty-aware giant language mannequin system that not solely generates solutions but additionally estimates the arrogance in these solutions. We implement a three-stage reasoning pipeline by which the mannequin first produces a solution together with a self-reported confidence rating and a justification. We then introduce a self-evaluation step that permits the mannequin to critique and refine its personal response, simulating a meta-cognitive verify. If the mannequin determines that its confidence is low, we routinely set off an online analysis part that retrieves related info from dwell sources and synthesizes a extra dependable reply. By combining confidence estimation, self-reflection, and automatic analysis, we create a sensible framework for constructing extra reliable and clear AI programs that may acknowledge uncertainty and actively search higher info.

import os, json, re, textwrap, getpass, sys, warnings
from dataclasses import dataclass, area
from typing import Non-compulsory
from openai import OpenAI
from ddgs import DDGS
from wealthy.console import Console
from wealthy.desk import Desk
from wealthy.panel import Panel
from wealthy import field


warnings.filterwarnings("ignore", class=DeprecationWarning)


def _get_api_key() -> str:
   key = os.environ.get("OPENAI_API_KEY", "").strip()
   if key:
       return key
   attempt:
       from google.colab import userdata
       key = userdata.get("OPENAI_API_KEY") or ""
       if key.strip():
           return key.strip()
   besides Exception:
       go
   console = Console()
   console.print(
       "n[bold cyan]OpenAI API Key required[/bold cyan]n"
       "[dim]Your key will not be echoed and is never stored to disk.n"
       "To skip this prompt in future runs, set the environment variable:n"
       "  export OPENAI_API_KEY=sk-...[/dim]n"
   )
   key = getpass.getpass("  Enter your OpenAI API key: ").strip()
   if not key:
       Console().print("[bold red]No API key provided — exiting.[/bold red]")
       sys.exit(1)
   return key


OPENAI_API_KEY = _get_api_key()
MODEL           = "gpt-4o-mini"
CONFIDENCE_LOW  = 0.55
CONFIDENCE_MED  = 0.80


consumer  = OpenAI(api_key=OPENAI_API_KEY)
console = Console()


@dataclass
class LLMResponse:
   query:    str
   reply:      str
   confidence:  float
   reasoning:   str
   sources:     record[str] = area(default_factory=record)
   researched:  bool = False
   raw_json:    dict = area(default_factory=dict)

We import all required libraries and configure the runtime atmosphere for the uncertainty-aware LLM pipeline. We securely retrieve the OpenAI API key utilizing atmosphere variables, Colab secrets and techniques, or a hidden terminal immediate. We additionally outline the LLMResponse information construction that shops the query, reply, confidence rating, reasoning, and analysis metadata used all through the system.

SYSTEM_UNCERTAINTY = """
You're an professional AI assistant that's HONEST about what it is aware of and does not know.
For each query you MUST reply with legitimate JSON solely (no markdown, no prose outdoors JSON):


{
 "answer": "",
 "confidence": ,
 "reasoning": ""
}


Confidence scale:
 0.90-1.00 → very excessive: well-established truth, you're sure
 0.75-0.89 → excessive: robust information, minor uncertainty
 0.55-0.74 → medium: believable however chances are you'll be fallacious, might be outdated
 0.30-0.54 → low: important uncertainty, reply is a greatest guess
 0.00-0.29 → very low: principally guessing, minimal dependable information


Be CALIBRATED — don't all the time give excessive confidence. Genuinely replicate uncertainty
about current occasions (after your information cutoff), area of interest matters, numerical claims,
and something that modifications over time.
""".strip()


SYSTEM_SYNTHESIS = """
You're a analysis synthesizer. Given a query, a preliminary reply,
and web-search snippets, produce an improved remaining reply grounded within the proof.
Reply in JSON solely:


{
 "answer": "",
 "confidence": ,
 "reasoning": ""
}
""".strip()


def query_llm_with_confidence(query: str) -> LLMResponse:
   completion = consumer.chat.completions.create(
       mannequin=MODEL,
       temperature=0.2,
       response_format={"type": "json_object"},
       messages=[
           {"role": "system", "content": SYSTEM_UNCERTAINTY},
           {"role": "user",   "content": question},
       ],
   )
   uncooked = json.masses(completion.decisions[0].message.content material)


   return LLMResponse(
       query=query,
       reply=uncooked.get("answer", ""),
       confidence=float(uncooked.get("confidence", 0.5)),
       reasoning=uncooked.get("reasoning", ""),
       raw_json=uncooked,
   )

We outline the system prompts that instruct the mannequin to report solutions together with calibrated confidence and reasoning. We then implement the query_llm_with_confidence operate that performs the primary stage of the pipeline. This stage generates the mannequin’s reply whereas forcing the output to be structured JSON containing the reply, confidence rating, and clarification.

def self_evaluate(response: LLMResponse) -> LLMResponse:
   critique_prompt = f"""
Assessment this reply and its acknowledged confidence. Examine for:
1. Logical consistency
2. Whether or not the arrogance matches the precise high quality of the reply
3. Any factual errors you may spot


Query: {response.query}


Proposed reply: {response.reply}
Said confidence: {response.confidence}
Said reasoning: {response.reasoning}


Reply in JSON:
{{
 "revised_confidence": ,
 "critique": "",
 "revised_answer": ""
}}
""".strip()


   completion = consumer.chat.completions.create(
       mannequin=MODEL,
       temperature=0.1,
       response_format={"type": "json_object"},
       messages=[
           {"role": "system", "content": "You are a rigorous self-critic. Respond in JSON only."},
           {"role": "user",   "content": critique_prompt},
       ],
   )
   ev = json.masses(completion.decisions[0].message.content material)


   response.confidence = float(ev.get("revised_confidence", response.confidence))
   response.reply     = ev.get("revised_answer", response.reply)
   response.reasoning += f"nn[Self-Eval Critique]: {ev.get('critique', '')}"
   return response




def web_search(question: str, max_results: int = 5) -> record[dict]:
   outcomes = DDGS().textual content(question, max_results=max_results)
   return record(outcomes) if outcomes else []




def research_and_synthesize(response: LLMResponse) -> LLMResponse:
   console.print(f"  [yellow]🔍 Confidence {response.confidence:.0%} is low — triggering auto-research...[/yellow]")


   snippets = web_search(response.query)
   if not snippets:
       console.print("  [red]No search results found.[/red]")
       return response


   formatted = "nn".be a part of(
       f"[{i+1}] {s.get('title','')}n{s.get('body','')}nURL: {s.get('href','')}"
       for i, s in enumerate(snippets)
   )


   synthesis_prompt = f"""
Query: {response.query}


Preliminary reply (low confidence): {response.reply}


Internet search snippets:
{formatted}


Synthesize an improved reply utilizing the proof above.
""".strip()


   completion = consumer.chat.completions.create(
       mannequin=MODEL,
       temperature=0.2,
       response_format={"type": "json_object"},
       messages=[
           {"role": "system", "content": SYSTEM_SYNTHESIS},
           {"role": "user",   "content": synthesis_prompt},
       ],
   )
   syn = json.masses(completion.decisions[0].message.content material)


   response.reply      = syn.get("answer", response.reply)
   response.confidence  = float(syn.get("confidence", response.confidence))
   response.reasoning  += f"nn[Post-Research]: {syn.get('reasoning', '')}"
   response.sources     = [s.get("href", "") for s in snippets if s.get("href")]
   response.researched  = True
   return response

We implement a self-evaluation stage by which the mannequin critiques its personal reply and revises its confidence as wanted. We additionally introduce the net search functionality that retrieves dwell info utilizing DuckDuckGo. If the mannequin’s confidence is low, we synthesize the search outcomes with the preliminary reply to supply an improved response grounded in exterior proof.

def self_evaluate(response: LLMResponse) -> LLMResponse:
   critique_prompt = f"""
Assessment this reply and its acknowledged confidence. Examine for:
1. Logical consistency
2. Whether or not the arrogance matches the precise high quality of the reply
3. Any factual errors you may spot


Query: {response.query}


Proposed reply: {response.reply}
Said confidence: {response.confidence}
Said reasoning: {response.reasoning}


Reply in JSON:
{{
 "revised_confidence": ,
 "critique": "",
 "revised_answer": ""
}}
""".strip()


   completion = consumer.chat.completions.create(
       mannequin=MODEL,
       temperature=0.1,
       response_format={"type": "json_object"},
       messages=[
           {"role": "system", "content": "You are a rigorous self-critic. Respond in JSON only."},
           {"role": "user",   "content": critique_prompt},
       ],
   )
   ev = json.masses(completion.decisions[0].message.content material)


   response.confidence = float(ev.get("revised_confidence", response.confidence))
   response.reply     = ev.get("revised_answer", response.reply)
   response.reasoning += f"nn[Self-Eval Critique]: {ev.get('critique', '')}"
   return response




def web_search(question: str, max_results: int = 5) -> record[dict]:
   outcomes = DDGS().textual content(question, max_results=max_results)
   return record(outcomes) if outcomes else []




def research_and_synthesize(response: LLMResponse) -> LLMResponse:
   console.print(f"  [yellow]🔍 Confidence {response.confidence:.0%} is low — triggering auto-research...[/yellow]")


   snippets = web_search(response.query)
   if not snippets:
       console.print("  [red]No search results found.[/red]")
       return response


   formatted = "nn".be a part of(
       f"[{i+1}] {s.get('title','')}n{s.get('body','')}nURL: {s.get('href','')}"
       for i, s in enumerate(snippets)
   )


   synthesis_prompt = f"""
Query: {response.query}


Preliminary reply (low confidence): {response.reply}


Internet search snippets:
{formatted}


Synthesize an improved reply utilizing the proof above.
""".strip()


   completion = consumer.chat.completions.create(
       mannequin=MODEL,
       temperature=0.2,
       response_format={"type": "json_object"},
       messages=[
           {"role": "system", "content": SYSTEM_SYNTHESIS},
           {"role": "user",   "content": synthesis_prompt},
       ],
   )
   syn = json.masses(completion.decisions[0].message.content material)


   response.reply      = syn.get("answer", response.reply)
   response.confidence  = float(syn.get("confidence", response.confidence))
   response.reasoning  += f"nn[Post-Research]: {syn.get('reasoning', '')}"
   response.sources     = [s.get("href", "") for s in snippets if s.get("href")]
   response.researched  = True
   return response

We assemble the principle reasoning pipeline that orchestrates reply technology, self-evaluation, and optionally available analysis. We compute visible confidence indicators and implement helper features to label their confidence ranges. We additionally constructed a formatted show system that presents the ultimate reply, reasoning, confidence meter, and sources in a clear console interface.

DEMO_QUESTIONS = [
   "What is the speed of light in a vacuum?",
   "What were the main causes of the 2008 global financial crisis?",
   "What is the latest version of Python released in 2025?",
   "What is the current population of Tokyo as of 2025?",
]


def run_comparison_table(questions: record[str]) -> None:
   console.rule("[bold cyan]UNCERTAINTY-AWARE LLM — BATCH RUN[/bold cyan]")
   outcomes = []


   for i, q in enumerate(questions, 1):
       console.print(f"n[bold]Question {i}/{len(questions)}:[/bold] {q}")
       r = uncertainty_aware_query(q)
       display_response(r)
       outcomes.append(r)


   console.rule("[bold cyan]SUMMARY TABLE[/bold cyan]")
   tbl = Desk(field=field.ROUNDED, show_lines=True, spotlight=True)
   tbl.add_column("#",          model="dim", width=3)
   tbl.add_column("Question",   max_width=40)
   tbl.add_column("Confidence", justify="center", width=12)
   tbl.add_column("Level",      justify="center", width=10)
   tbl.add_column("Researched", justify="center", width=10)


   for i, r in enumerate(outcomes, 1):
       emoji, label = confidence_label(r.confidence)
       col = "green" if r.confidence >= 0.75 else "yellow" if r.confidence >= 0.55 else "red"
       tbl.add_row(
           str(i),
           textwrap.shorten(r.query, 55),
           f"[{col}]{r.confidence:.0%}[/{col}]",
           f"{emoji} {label}",
           "✅ Yes" if r.researched else "—",
       )


   console.print(tbl)




def interactive_mode() -> None:
   console.rule("[bold cyan]INTERACTIVE MODE[/bold cyan]")
   console.print("  Type any question. Type [bold]quit[/bold] to exit.n")
   whereas True:
       q = console.enter("[bold cyan]You ▶[/bold cyan] ").strip()
       if q.decrease() in ("quit", "exit", "q"):
           console.print("Goodbye!")
           break
       if not q:
           proceed
       resp = uncertainty_aware_query(q)
       display_response(resp)




if __name__ == "__main__":
   console.print(Panel(
       "[bold white]Uncertainty-Aware LLM Tutorial[/bold white]n"
       "[dim]Confidence Estimation · Self-Evaluation · Auto-Research[/dim]",
       border_style="cyan",
       increase=False,
   ))


   run_comparison_table(DEMO_QUESTIONS)


   console.print("n")
   interactive_mode()

We outline demonstration questions and implement a batch pipeline that evaluates the uncertainty-aware system throughout a number of queries. We generate a abstract desk that compares confidence ranges and whether or not analysis was triggered. Lastly, we implement an interactive mode that constantly accepts person questions and runs the total uncertainty-aware reasoning workflow.

In conclusion, we designed and applied an entire uncertainty-aware reasoning pipeline for giant language fashions utilizing Python and the OpenAI API. We demonstrated how fashions can verbalize confidence, carry out inner self-evaluation, and routinely conduct analysis when uncertainty is detected. This method improves reliability by enabling the system to acknowledge information gaps and increase its solutions with exterior proof when wanted. By integrating these elements right into a unified workflow, we confirmed how builders can construct AI programs which are clever, calibrated, clear, and adaptive, making them much more appropriate for real-world decision-support purposes.

Take a look at the FULL Pocket book Right here. Additionally, be happy to observe us on Twitter and don’t neglect to hitch our 120k+ ML SubReddit and Subscribe to our E-newsletter. Wait! are you on telegram? now you may be a part of us on telegram as effectively.

Jean-marc is a profitable AI enterprise government .He leads and accelerates development for AI powered options and began a pc imaginative and prescient firm in 2006. He’s a acknowledged speaker at AI conferences and has an MBA from Stanford.

Top Posts

Resolv Says No Belongings Misplaced After USR Stablecoin Exploit

Meet GitAgent: The Docker for AI Brokers that’s Lastly Fixing the Fragmentation between LangChain, AutoGen, and Claude Code

Weekly information roundup: Stryker cyberattack, Meta layoffs and AI spending surge

A Coding Implementation to Construct an Uncertainty-Conscious LLM System with Confidence Estimation, Self-Analysis, and Computerized Internet Analysis

Tips on how to activate Google’s free VPN in your Pixel – it is price your 30 seconds

Constructing a Navier-Stokes Solver in Python from Scratch: Simulating Airflow

iPhone 17e vs. Google Pixel 10a vs Samsung Galaxy A56: This finances cellphone wins it for me

Agentic RAG Failure Modes: Retrieval Thrash, Instrument Storms, and Context Bloat (and The way to Spot Them Early)

A Coding Implementation for Constructing and Analyzing Crystal Constructions Utilizing Pymatgen for Symmetry Evaluation, Part Diagrams, Floor Technology, and Supplies Challenge Integration

Wish to know which internet sites are promoting your information? This free privateness device gave me solutions

Resolv Says No Belongings Misplaced After USR Stablecoin Exploit

Meet GitAgent: The Docker for AI Brokers that’s Lastly Fixing the Fragmentation between LangChain, AutoGen, and Claude Code

Weekly information roundup: Stryker cyberattack, Meta layoffs and AI spending surge

Is an M5 MacBook Professional price upgrading from an M1 mannequin? What the numbers inform us

Tips on how to activate Google’s free VPN in your Pixel – it is price your 30 seconds

If one dealer can pressure the end result of a prediction market, it shouldn’t be tradable

How BM25 and RAG Retrieve Info Otherwise?

I in contrast Verizon, T-Cell, and AT&T 5G protection on a street journey – and the winner shocked me

Trending

Resolv Says No Belongings Misplaced After USR Stablecoin Exploit

Meet GitAgent: The Docker for AI Brokers that’s Lastly Fixing the Fragmentation between LangChain, AutoGen, and Claude Code

Latest Posts

Not More Data, but Better World Models – Unite.AI

OpenAI Is Hiring Head of Preparedness, Amid AI Cyberattack Fears

Subscribe to Updates

Top Posts

A Coding Implementation to Construct an Uncertainty-Conscious LLM System with Confidence Estimation, Self-Analysis, and Computerized Internet Analysis

Related Posts