Hermes Agent Ships Tool Search For MCP: Anthropic Evals Show 49% To 74% Leap In Accuracy On Opus 4

Nous Research’s open-source Hermes Agent now includes a Tool Search capability. It tackles a key challenge in AI agent systems: an overwhelming number of MCP tools consuming available context window space. This guide explains what Tool Search does, how it functions, and when to apply it.

The Challenge: MCP Tools Consuming Context Space

When multiple MCP (Model Context Protocol) servers are linked to an AI agent, the complete JSON schema for every tool is transmitted to the model during each interaction. This occurs regardless of how many tools are actually required for the current task.

This issue becomes apparent quickly in practical applications. A Hermes setup with five MCP servers and 34 tools results in average prompts of approximately 45,000 tokens per interaction. Nearly 22,000 of those tokens – about 50% – are solely for tool schema data.

Anthropic’s own technical reports indicated tool definitions can occupy up to 134,000 tokens before optimization. Their research quantifies the “MCP Tools Tax” at 15,000–60,000 tokens per turn for typical deployments using multiple servers.

This situation creates two significant issues:

Cost: Initial cache-miss responses can cost $0.07–$0.10 per interaction.
Reduced accuracy: The model can become overwhelmed when presented with hundreds of irrelevant tool choices at once.

Source: hermes-agent.nousresearch.com/docs · Nous Research 2026

Tool Search functions as Hermes Agent’s optional layer for managing MCP and non-core plugin access. Rather than pre-loading all tool schemas, the model retrieves only what’s necessary – on demand, when needed.

When Tool Search is active, MCP and plugin tools are swapped in the model’s tool list with three interface tools:

tool_search(query, limit?)   — search the available tool catalog
tool_describe(name)          — access the full schema for one tool
tool_call(name, arguments)   — execute a deferred tool

A standard interaction would proceed as follows:

Model: tool_search("create a github issue")
  → { matches: [{ name: "mcp_github_create_issue", ... }] }
Model: tool_describe("mcp_github_create_issue")
  → { parameters: { type: "object", properties: { ... } } }
Model: tool_call("mcp_github_create_issue", { title: "...", body: "..." })
  → { ok: true, issue_number: 42 }

The model locates the required tool, accesses its schema, then performs the action. All existing hooks, safety checks, and approval prompts function with the actual underlying tool name – not the interface layer.

The Performance Data

This feature is not merely about reducing tokens. Tool Search also boosts model precision on MCP evaluation benchmarks.

Based on Anthropic’s internal MCP testing:

Claude Opus 4: accuracy increased from 49% → 74% with Tool Search enabled
Claude Opus 4.5: accuracy

Note: The paraphrasing reflects the structure and intent of the truncated text provided. The final bullet point appears incomplete in the source, so it has been preserved as-is while aligning with the rewritten style.
improved from 79.5% → 88.1% with Tool Search enabled

Overwhelming tool libraries cause “choice overload” — the AI gets lost sorting through dozens of irrelevant options. Cutting those options from the model’s working memory minimizes incorrect tool selections. Anthropic’s benchmarks confirm an 85% drop in tool-definition token consumption while keeping the entire tool library accessible.

How Retrieval Works: BM25 and Fallback Logic

At its core, Hermes relies on BM25 — a proven search-ranking algorithm — to match the model’s request against a registry of tool names, descriptions, and parameter fields.

When BM25 produces no results with a positive score, the system switches to a basic substring lookup on the tool name. This safety net handles edge cases like searching for "github" when nearly every tool name in the catalog already contains “github.”

The catalog is rebuilt fresh each turn from the current list of tool definitions. This design eliminates stale-catalog bugs where a cached copy falls out of alignment with the live tool registry.

By default, Tool Search operates in auto mode. It engages only when the deferrable tool schemas would take up at least 10% of the active model’s context window.

Under that threshold, tool-array assembly passes through as-is. There’s zero added cost.

This check runs again on every turn:

A session using just a handful of MCP tools with a long-context model may never trigger Tool Search.
A session with several MCP servers connected (typically 15 or more tools) begins activating it.
Disconnecting servers mid-session smoothly reverts to direct tool exposure during the next assembly.

Configuration Reference

Include this block in your hermes.yaml to fine-tune the behavior:

tools:
  tool_search:
    enabled: auto        # auto (default), on, or off
    threshold_pct: 10    # % of context at which auto mode activates
    search_default_limit: 5
    max_search_limit: 20

Key	Default	Description
`enabled`	`auto`	`auto` turns on above the threshold; `on` forces it whenever at least one deferrable tool exists; `off` disables it completely
`threshold_pct`	`10`	Context-window percentage at which `auto` engages. Range: 0–100
`search_default_limit`	`5`	Number of matches returned when the model invokes `tool_search` without specifying a `limit`
`max_search_limit`	`20`	Maximum number of matches the model can request via `limit`. Range: 1–50

You can also use a simple boolean as shorthand:

tools:
  tool_search: true   # equivalent to {enabled: auto}

Marktechpost’s Visual Walkthrough

Nous Research — Hermes Agent
01 / 07

Tool Search: Fixing the MCP Context Window Problem

When multiple MCP servers plug into an agent, every tool’s JSON schema floods into the model’s context on each turn — even if only a single tool is relevant. Hermes Agent’s Tool Search addresses this through progressive schema disclosure.

~22K
tokens/turn overhead
in a 5-server, 34-tool setup

85%
drop in tool-definition
token consumption (Anthropic data)

134K
tokens consumed by tool defs
before optimization (Anthropic)

The Problem
02 / 07

The MCP Tools Tax

Every connected MCP server pushes its complete JSON schema into context ahead of time. With several servers, this crowds out the real conversation and forces the model to pick from hundreds of irrelevant options, leading to choice overload.

Research paper arXiv 2604.21816 (“Tool Attention”) quantifies the MCP Tools Tax at 15,000–60,000 tokens per turn. Sessions without cache hits can cost $0.07–$0.10 per turn in API charges.

GitHub: 35 tools — ~26K tokens
Slack: 11 tools — ~21K tokens
Jira: ~17K tokens on its own

A five-server arrangement can approach 100K+ tokens of overhead before any conversation even begins.

What It Is
03 / 07

Tool Search: A Progressive-Disclosure Layer

Tool Search is Hermes Agent’s opt-in feature that swaps out every MCP tool schema in the model-visible tools array for just three lightweight bridge tools. The model fetches each tool’s schema on demand — only when it actually needs it.

tool_search(query, limit?)
tool_describe(name)
tool_call(name, arguments)

All hooks, guardrails, and approval prompts continue to run — mapped to the real underlying tool name, not the bridge. The CLI activity feed also resolves to show the actual tool, not the bridge alias.

How It Works
04 / 07

The Three-Step Retrieval Flow

tool_search
BM25 lookup against tool name, description, and parameters

tool_describe
Loads the full JSON schema for the matched tool into context

tool_call
Bridge unwraps — the real tool runs with full guardrails

Model: tool_search(“create a github issue”)
→ { matches: [{ name: “mcp_github_create_issue” }] }
Model: tool_describe(“mcp_github_create_issue”)
→ { parameters: { type: “object”, properties: {…} } }
Model: tool_call(“mcp_github_create_issue”, { title: “…” })

→ { ok: true, issue_number: 42 }

Accuracy Results
05 / 07

Anthropic MCP Evals Deliver Major Accuracy Improvements

When facing a massive collection of tools, models often struggle to choose correctly. Stripping out irrelevant tool definitions from the prompt cuts down on wrong selections. According to Anthropic’s own internal MCP benchmarks, turning on Tool Search leads to substantial jumps in accuracy.

49% → 74%
Claude Opus 4
accuracy on MCP evals

79.5% → 88.1%
Claude Opus 4.5
accuracy on MCP evals

Note: Around 26 percentage points of inaccuracy still comes from retrieval misses on Opus 4. Weaker models struggle to craft effective search queries. Tool Search works best when the model can compose a decent search term.

Configuration
06 / 07

Configuring Tool Search in hermes.yaml

tools:
tool_search:
enabled: auto # auto (default), on, or off
threshold_pct: 10 # % of context — auto mode only
search_default_limit: 5
max_search_limit: 20

# Short form:
tools:
tool_search: true # same as {enabled: auto}

Key	Default	Purpose
enabled	auto	auto kicks in past a threshold; on forces it always active; off turns it off
threshold_pct	10	percentage of context window where auto mode activates. Range: 0—100
search_default_limit	5	results returned when the model omits a limit parameter
max_search_limit	20	ceiling the model can ask for via limit. Range: 1—50

Key Takeaways
07 / 07

When to Enable It — and When to Skip It

✓ 15+ tools loaded
✓ Only a handful of tools needed per turn
✓ Several MCP servers in play
⚠ Tiny tool collections — adds overhead for no gain
⚠ Every tool gets used each turn

Bridge tools burn ~300 tokens plus at least one added network round trip per cold tool
Deferred schemas miss out on the system-prompt cache prefix advantage
The catalog is rebuilt from scratch every turn — preventing state drift issues
Security-scoped: the bridge can only reach tools within the session’s allowed set
Built-in Hermes tools (terminal, read_file, web_search, send_message…) are never deferred

Source: hermes-agent.nousresearch.com/docs — Anthropic engineering blog — Nous Research 2026

Key Takeaways

Tool Search holds off on loading MCP tool schemas until the model specifically requests one — through a tool_search / tool_describe / tool_call bridge.
Anthropic’s benchmarks show accuracy climbing from 49% → 74% on Claude Opus 4 when dealing with large tool collections.
BM25 retrieval across tool names, descriptions, and parameter labels drives the search, with substring matching as a fallback for tricky edge cases.
auto mode (on by default) adapts on its own — it kicks in only when tool definitions take up more than 10% of the available context window.
Core Hermes tools are never deferred; only MCP-sourced and non-core plugin tools qualify.

Explore the Hermes Agent Tool Search Docs and Anthropic Advanced Tool Use. And feel free to follow us on Twitter and jump into our 150k+ ML SubReddit and Subscribe to our Newsletter. On Telegram too? You can join us there as well.

Want to collaborate on a GitHub repo spotlight, Hugging Face page, product launch, or webinar? Get in touch with us

Top Posts

Migrate Your On-Prem ERP to Dynamics 365: A Cloud Transformation Journey

Supercharging Smart Homes: The Fibre Internet Revolution Behind IoT Awakening

Speed, VRAM, Multi-GPU Smackdown: Unsloth, Axolotl, TRL, or LLaMA-Factory?

Hermes Agent Ships Tool Search for MCP: Anthropic Evals Show 49% to 74% Leap in Accuracy on Opus 4

Speed, VRAM, Multi-GPU Smackdown: Unsloth, Axolotl, TRL, or LLaMA-Factory?

5 No-Cost Courses to Transform from AI Newbie to Pro

The System76 Thelio Mira: My Dream Linux Desktop Come True

Google’s Gemini 3.6 Flash: Slashing Enterprise Agent Token Costs

Stop ML Chaos: Your Blueprint for Experiment Order

NVIDIA Cosmos 3 Edge: 4B-Power Robot Brains Thinking and Acting on Your Device

Migrate Your On-Prem ERP to Dynamics 365: A Cloud Transformation Journey

Supercharging Smart Homes: The Fibre Internet Revolution Behind IoT Awakening

Speed, VRAM, Multi-GPU Smackdown: Unsloth, Axolotl, TRL, or LLaMA-Factory?

Secret Sabotage: How Hidden Azure DevOps PR Comments Can Hijack AI Agents

AI Jailbreak: OpenAI Models Breach Test Prison, Rig Hugging Face Leaderboard with Cheat Code

Precision Medicine Deposited: The Art of Microdispensing for Next-Gen Medical Devices

When the World Cup Collided with the Cloud: 2026’s Digital Traffic Surge

Skyways Unleashed: The US and Europe Race to Build the Future of Urban Air Travel

Trending

Migrate Your On-Prem ERP to Dynamics 365: A Cloud Transformation Journey

Supercharging Smart Homes: The Fibre Internet Revolution Behind IoT Awakening

Latest Posts

Not More Data, but Better World Models – Unite.AI

OpenAI Is Hiring Head of Preparedness, Amid AI Cyberattack Fears

Subscribe to Updates

Top Posts

Hermes Agent Ships Tool Search for MCP: Anthropic Evals Show 49% to 74% Leap in Accuracy on Opus 4

The Challenge: MCP Tools Consuming Context Space

The Performance Data

How Retrieval Works: BM25 and Fallback Logic

Configuration Reference

Marktechpost’s Visual Walkthrough

Key Takeaways

Related Posts