I’ll paraphrase the text content while keeping the HTML structure intact.

TL;DR: Reliability techniques (approaches that improve an LLM’s accuracy by using additional inference, such as feedback-driven retries, ensembling, generator/critic refinement loops, verification passes, and difficulty-aware routing) are spread across academic papers, each locked inside its own custom codebase. We consolidated 28 reliability techniques (21 communication-theoretic methods spanning 6 families, plus 7 established baselines: Self-Consistency, Self-Refine, CoVe, BoN, Weighted BoN, CISC, MoA), all benchmarked against a plain single-pass baseline, under one unified API, with 3 adaptive routers (SemKNN plus two local ACM routers) layered on top. We then demonstrated that dynamically routing the technique per prompt lets you navigate a quality-vs-cost tradeoff curve. In our paper’s benchmark using one specific model lineup — Nemotron and Devstral as the two generators and GLM-5.1 as the judge — the adaptive router achieved roughly 56% cost savings at equivalent quality, or about a 7% quality improvement at equivalent cost, compared to the best fixed method we tested in that same lineup. A single parameter (λ) controls the tradeoff. The general finding (adaptive routing beats any single fixed method) should hold broadly, but the exact numbers depend on the model lineup, and we haven’t yet run the full sweep across other model combinations.

Getting started is as simple as change one import:

python - from openai import OpenAI + from agentcodec.openai import OpenAI

Specify reliability="harq_ir" (or any of the 28 available techniques), and your existing client.chat.completions.create(...) calls retain their native OpenAI response format. Equivalent drop-in shims are available for Anthropic and Ollama.

GitHub:
Working paper:

After spending considerable time studying reliability methods from the research literature, we kept running into the same frustration: every paper comes with its own bespoke codebase, its own prompt format, its own scoring rubric, its own model wrapper. Trying to answer a question like "should I use self-refine or best-of-N for this task?" turned into a week of integration work per comparison.

The communication-theory perspective is what made everything click: an LLM acts as a noisy channel Y = A(X) + N, and every reliability technique from wireless communications has a direct counterpart in the agent world:

Wireless	Agent-land
ARQ / HARQ	retry-with-feedback loops
Diversity combining (MRC/SC/EGC)	ensemble multiple models
Turbo decoding	iterative generator/critic mutual refinement
Fountain codes	rateless sampling, stop when the judge is confident
FEC	answer + structured parity passes (re-derivation, verification, alternative), decode by cross-check
ACM (adaptive coding-modulation)	route by difficulty

We packaged everything into a single library: 28 reliability techniques (the 7 established baselines are included within that count of 28, not in addition to it), plus the uncoded single-pass baseline they’re all measured against, plus 3 adaptive routers (SemKNN and two local ACM routers) that pick the best technique for each prompt. The full breakdown is in the README.

The minimal version

python from agentcodec import ReliabilityModule

mod = ReliabilityModule.from_dict({ "models": [ # Spatial diversity: two different model families = uncorrelated errors {"model": "qwen3:8b", "base_url": " "api_key": "ollama"}, {"model": "llama3.1:8b", "base_url": " "api_key": "ollama"}, ], "judge": {"model": "gemma3:12b", "base_url": " "api_key": "ollama"}, "critic": {"same": True}, "strategy": {"type": "fixed", "technique": "harq_ir", "params": {"max_rounds": 4}}, })

result = mod.run("Prove the sum of the first n odd integers is n^2.", category="reasoning") print(result.text, result.cost_usd, result.cost_source, result.technique_used)

Swap "harq_ir" for "diversity_mrc", "turbo", "fountain", or any other technique. Same API, same ReliabilityResult structure, same cost-source tier on every output. For production use, switch strategy to routed and the library automatically selects the right technique per prompt (cheap baseline for easy prompts, diversity_mrc for hard ones).

Three things worth highlighting

Beyond the technique catalog, three aspects of the implementation required significant effort:

1. Native async streaming for all but 2 techniques (acm_soft, acm_learned), with role-tagged events. mod.astream() drives AsyncOpenAI / AsyncAnthropic / httpx.AsyncClient end-to-end (no worker-thread bridge) and emits TokenEvents tagged with a role: "answer", "thinking", "draft", "critique", "verification", "candidate", "synthesis". So when you stream a HARQ-IR run, you can render the round-by-round drafts and critiques live, not just the final answer:

python async for ev in mod.astream("Explain QUIC vs TCP."): if isinstance(ev, TokenEvent): if ev.role == "answer": print(ev.text, end="", flush=True) elif ev.role == "draft": print(f"n[draft] {ev.text}") elif ev.role == "critique": print(f"n[CRITIC] {ev.text}") elif ev.role == "thinking": pass # captured to result.thinking_text elif isinstance(ev, FinalEvent): print(f"ndone — {ev.result.technique_used}, " f"thinking_cost=${ev.result.thinking_cost_usd:.4f}")

Parallel-branch techniques fan out concurrently via asyncio.gather. diversity_mrc with two models actually runs them in parallel, and you see per-branch ProgressEvents as each one completes.

2. Thinking-text capture across all backends. Anthropic ThinkingBlock, OpenAI reasoning_content (+ exact reasoning_tokens from usage.completion_tokens_details), Ollama msg.thinking, and inline <think>...</think> tag stripping (DeepSeek-R1, Qwen3, GLM-4.5+, Nemotron) all populate result.thinking_text and split result.cost_usd into thinking_cost_usd + answer_cost_usd. So you can finally see what the o-series / Claude / DeepSeek is actually charging you for.

3. Drop-in compat shims with expose_reliability_stream=True. Default: the shim looks identical to the native SDK, delta.content for the answer, delta.reasoning_content

for thinking. Drafts/critiques are hidden so existing code keeps working unchanged. Set the flag and the shim surfaces internal roles via sentinel fields (delta.agentcodec_role, delta.agentcodec_call_id) that existing consumers ignore harmlessly:

from agentcodec.openai import AsyncOpenAI client = AsyncOpenAI(api_key=KEY, reliability="harq_ir", expose_reliability_stream=True)

Now drafts/critiques flow through the native OpenAI stream with sentinels.

Same flag and same semantics on agentcodec.anthropic.AsyncAnthropic and agentcodec.ollama.AsyncClient.

Other useful bits

Cost transparency built in: every result carries a cost_source tier marking how the price was obtained, from exact_user_rate (you supplied the rate) through openrouter_rate / exact_table_rate / inferred_table_rate down to default_fallback, plus token-estimation flags when only character counts were available. Live pricing fetched from OpenRouter, cached locally for 7 days. No more “I think this run cost $40, maybe?”
Works against whatever you have: OpenAI, Anthropic (native SDK), Ollama (native + python lib + OpenAI-compat), vLLM, OpenRouter, LM Studio, Together. No Docker, no separate inference server, no LangChain.
Strict config schema: typos in YAML / dict configs raise at load time, not on first .run().
195 tests, 25 runnable examples under examples/: async streaming, thinking capture, drop-in compat for all three backends, plus a fully-annotated YAML config.

Caveats

The headline numbers are for a specific model lineup. The ~56% cost / ~7% quality figures come from a single benchmark run with Nemotron + Devstral as the two generators and GLM-5.1 as the judge. We expect the qualitative pattern (adaptive routing dominates fixed) to hold for other model combinations, since that’s the whole point of the framework, but the absolute numbers will move with the lineup, and we haven’t done the cross-lineup sweep yet. If you swap in different generators expect different absolute savings; the right comparison is your adaptive vs your best fixed baseline at your lineup.
License is PolyForm Noncommercial 1.0.0: free for research, teaching, personal/internal eval. Commercial use needs a separate license.
The trained SemKNN routing artifacts (learned router mapping prompt embeddings → best technique, the thing that delivers the headline cost number) are not redistributed; the client talks to a remote SemKNN service. All other routers (fixed, acm_table, acm_linear) run fully locally, though the last one needs you to train it.
2 techniques (acm_soft, acm_learned) still fall back to sync dispatch in an executor on the async streaming path. They produce correct FinalEvents but no mid-stream tokens. Roadmap.
This is research code. Expect rough edges on the less-traveled paths (soft-output diversity variants, the learned ACM router).

Feel free to ask about specific techniques, the routing approach, how to add a new one, or the streaming / thinking / compat work. Suggestions on what to ship next are welcome.

Top Posts

Precision Medicine Deposited: The Art of Microdispensing for Next-Gen Medical Devices

When the World Cup Collided with the Cloud: 2026’s Digital Traffic Surge

Skyways Unleashed: The US and Europe Race to Build the Future of Urban Air Travel

Boost Your LLM Efficiency with a Source-Available Reliability Library: Halve Inference Costs at No Quality Loss—Adopt with One Simple Import Change

The minimal version

Three things worth highlighting

Beyond Guesswork: A Slurm-Powered Battle Plan for Benchmarking Distributed LLM Servers

Beyond Prompt Engineering: How 4 Context Bricks Silence RAG Hallucinations

Run Mythos Enhanced Coding Model Locally with llama.cpp on Raspberry Pi

Astryx: Meta’s Open-Source React Toolkit—150+ Accessible Components, 7 Themes, and a CLI Agent-Ready Design System

Endless Code: Mastering the Art of the 24-Hour Claude Agent

Unlock Peak Performance: Your Blueprint for Lightning-Fast Agentic Coding with Claude

Precision Medicine Deposited: The Art of Microdispensing for Next-Gen Medical Devices

When the World Cup Collided with the Cloud: 2026’s Digital Traffic Surge

Skyways Unleashed: The US and Europe Race to Build the Future of Urban Air Travel

5 No-Cost Courses to Transform from AI Newbie to Pro

Beyond Guesswork: A Slurm-Powered Battle Plan for Benchmarking Distributed LLM Servers

The Magic of Friction: Engineering Smarter Robot World Models

Trump Mobilizes Defense Industry to Chart Software and Supplier Networks Nationwide

KuCoin Pay: Weaving Crypto Seamlessly Into Everyday Payments

Trending

Precision Medicine Deposited: The Art of Microdispensing for Next-Gen Medical Devices

When the World Cup Collided with the Cloud: 2026’s Digital Traffic Surge

Latest Posts

Not More Data, but Better World Models – Unite.AI

OpenAI Is Hiring Head of Preparedness, Amid AI Cyberattack Fears

Subscribe to Updates

Top Posts

Boost Your LLM Efficiency with a Source-Available Reliability Library: Halve Inference Costs at No Quality Loss—Adopt with One Simple Import Change

The minimal version

Three things worth highlighting

Now drafts/critiques flow through the native OpenAI stream with sentinels.

Other useful bits

Caveats

Related Posts