I’ll paraphrase the text content while keeping the HTML structure intact.
TL;DR: Reliability techniques (approaches that improve an LLM’s accuracy by using additional inference, such as feedback-driven retries, ensembling, generator/critic refinement loops, verification passes, and difficulty-aware routing) are spread across academic papers, each locked inside its own custom codebase. We consolidated 28 reliability techniques (21 communication-theoretic methods spanning 6 families, plus 7 established baselines: Self-Consistency, Self-Refine, CoVe, BoN, Weighted BoN, CISC, MoA), all benchmarked against a plain single-pass baseline, under one unified API, with 3 adaptive routers (SemKNN plus two local ACM routers) layered on top. We then demonstrated that dynamically routing the technique per prompt lets you navigate a quality-vs-cost tradeoff curve. In our paper’s benchmark using one specific model lineup — Nemotron and Devstral as the two generators and GLM-5.1 as the judge — the adaptive router achieved roughly 56% cost savings at equivalent quality, or about a 7% quality improvement at equivalent cost, compared to the best fixed method we tested in that same lineup. A single parameter ( Getting started is as simple as
Specify
After spending considerable time studying reliability methods from the research literature, we kept running into the same frustration: every paper comes with its own bespoke codebase, its own prompt format, its own scoring rubric, its own model wrapper. Trying to answer a question like "should I use self-refine or best-of-N for this task?" turned into a week of integration work per comparison. The communication-theory perspective is what made everything click: an LLM acts as a noisy channel
We packaged everything into a single library: 28 reliability techniques (the 7 established baselines are included within that count of 28, not in addition to it), plus the uncoded single-pass baseline they’re all measured against, plus 3 adaptive routers (SemKNN and two local ACM routers) that pick the best technique for each prompt. The full breakdown is in the README. The minimal versionpython from agentcodec import ReliabilityModule mod = ReliabilityModule.from_dict({ "models": [ # Spatial diversity: two different model families = uncorrelated errors {"model": "qwen3:8b", "base_url": " "api_key": "ollama"}, {"model": "llama3.1:8b", "base_url": " "api_key": "ollama"}, ], "judge": {"model": "gemma3:12b", "base_url": " "api_key": "ollama"}, "critic": {"same": True}, "strategy": {"type": "fixed", "technique": "harq_ir", "params": {"max_rounds": 4}}, }) result = mod.run("Prove the sum of the first n odd integers is n2.", category="reasoning") print(result.text, result.cost_usd, result.cost_source, result.technique_used) Swap Three things worth highlightingBeyond the technique catalog, three aspects of the implementation required significant effort: 1. Native async streaming for all but 2 techniques (
Parallel-branch techniques fan out concurrently via 2. Thinking-text capture across all backends. Anthropic 3. Drop-in compat shims with |
for thinking. Drafts/critiques are hidden so existing code keeps working unchanged. Set the flag and the shim surfaces internal roles via sentinel fields (delta.agentcodec_role, delta.agentcodec_call_id) that existing consumers ignore harmlessly:
from agentcodec.openai import AsyncOpenAI
client = AsyncOpenAI(api_key=KEY, reliability="harq_ir", expose_reliability_stream=True)
Now drafts/critiques flow through the native OpenAI stream with sentinels.
Same flag and same semantics on agentcodec.anthropic.AsyncAnthropic and agentcodec.ollama.AsyncClient.
Other useful bits
- Cost transparency built in: every result carries a
cost_sourcetier marking how the price was obtained, fromexact_user_rate(you supplied the rate) throughopenrouter_rate/exact_table_rate/inferred_table_ratedown todefault_fallback, plus token-estimation flags when only character counts were available. Live pricing fetched from OpenRouter, cached locally for 7 days. No more “I think this run cost $40, maybe?” - Works against whatever you have: OpenAI, Anthropic (native SDK), Ollama (native + python lib + OpenAI-compat), vLLM, OpenRouter, LM Studio, Together. No Docker, no separate inference server, no LangChain.
- Strict config schema: typos in YAML / dict configs raise at load time, not on first
.run(). - 195 tests, 25 runnable examples under
examples/: async streaming, thinking capture, drop-in compat for all three backends, plus a fully-annotated YAML config.
Caveats
- The headline numbers are for a specific model lineup. The ~56% cost / ~7% quality figures come from a single benchmark run with Nemotron + Devstral as the two generators and GLM-5.1 as the judge. We expect the qualitative pattern (adaptive routing dominates fixed) to hold for other model combinations, since that’s the whole point of the framework, but the absolute numbers will move with the lineup, and we haven’t done the cross-lineup sweep yet. If you swap in different generators expect different absolute savings; the right comparison is your adaptive vs your best fixed baseline at your lineup.
- License is PolyForm Noncommercial 1.0.0: free for research, teaching, personal/internal eval. Commercial use needs a separate license.
- The trained SemKNN routing artifacts (learned router mapping prompt embeddings → best technique, the thing that delivers the headline cost number) are not redistributed; the client talks to a remote SemKNN service. All other routers (
fixed,acm_table,acm_linear) run fully locally, though the last one needs you to train it. - 2 techniques (
acm_soft,acm_learned) still fall back to sync dispatch in an executor on the async streaming path. They produce correctFinalEvents but no mid-stream tokens. Roadmap. - This is research code. Expect rough edges on the less-traveled paths (soft-output diversity variants, the learned ACM router).
Feel free to ask about specific techniques, the routing approach, how to add a new one, or the streaming / thinking / compat work. Suggestions on what to ship next are welcome.
submitted by /u/Intellerce
[comments]



![Boost Your LLM Efficiency with a Source-Available Reliability Library: Halve Inference Costs at No Quality Loss—Adopt with One Simple Import Change We built a source-available LLM reliability library (free for research / personal / internal eval) that can cut inference cost by half at matched quality, and you adopt it by changing one import [P] [R]](https://technologiesdigest.com/wp-content/uploads/2026/06/We-built-a-source-available-LLM-reliability-library-free-for-research.png)