# Overview
Picture this scenario: a multi-agent system that reads source files, generates patches, executes tests, and cycles through four different services — racking up 400 API calls in a single afternoon. Then the alert pops up: you’ve hit your soft usage cap yet again. Every token carries a cost, every request transmits your proprietary code to an external server, and rate limits keep disrupting long sessions. The only way forward? Pay up.
Gemma 4 26B MoE engages just 3.8 billion out of its total 26 billion parameters during each forward pass. It achieves 77.1% on LiveCodeBench v6 and 86.4% on τ2-bench for agentic tool usage — a benchmark specifically designed to evaluate how well a model calls tools, carries out multi-step procedures, and recovers from errors across complex workflows. Its predecessor, Gemma 3 27B, managed only 6.6% on that same benchmark. This isn’t a marginal improvement. It’s the gap between a model that can’t reliably invoke tools and one that can sustain a Claude Code agentic loop without constantly mangling its function call arguments.
This guide walks you through the complete setup: Ollama hosting Gemma 4 on your own machine, the Modelfile configuration that prevents context window overflows during agentic sessions, the settings.json file that connects Claude Code to your local endpoint, a validation script to confirm everything works before you point it at real projects, and a candid breakdown of what tends to go wrong and how to address it. This is written for engineers who already grasp how large language models function and what agentic loops cost. No beginner-level explanations here.
# Why Choose Gemma 4?
Launched on April 2, 2026 under the Apache 2.0 license, Gemma 4 represents Google DeepMind’s most powerful open-weight model lineup so far. Four versions were released: E2B (2B effective), E4B (4B effective), 26B MoE, and 31B Dense. The 26B MoE variant employs 128 small experts but activates only 8 per token along with one shared expert, delivering performance close to the 31B Dense at a fraction of the compute cost.
Earlier Gemma models used a custom Google license with commercial-use terms vague enough that corporate legal teams consistently raised red flags. Gemma 4 adopts Apache 2.0 — a first for the Gemma series. If your organization wants to integrate this into internal tools, build products on top of it, or deploy it in production pipelines without lengthy legal reviews, this licensing shift is a significant practical advantage.
// Key Benchmarks for Coding Agents
| Benchmark | Gemma 3 27B | Gemma 4 26B MoE | Gemma 4 31B Dense |
|---|---|---|---|
| τ2-bench (agentic tool use) | 6.6% | ~79% | 86.4% |
| LiveCodeBench v6 | 29.1% | 77.1% | 80.0% |
| GPQA Diamond | 42.4% | 82.3% | 84.3% |
| AIME 2026 (math) | 20.8% | 88.3% | 89.2% |
| Arena AI ELO | 1365 | 1441 | 1452 |
// Hardware Requirements
Before downloading an 18 GB model, make sure you understand your hardware constraints. The Gemma 4 family was built to cover everything from edge devices to high-end workstations, and the four variants span that full spectrum.
| Variant | Ollama tag | Active params | VRAM at Q4 | Context window |
|---|---|---|---|---|
| Edge 4B | gemma4:e4b | 4B | ~6 GB | 128K |
| 26B MoE | gemma4:26b | 3.8B | ~16–18 GB | 256K |
| 31B Dense | gemma4:31b | 31B | ~24–32 GB | 256K |
// Setting Up Ollama, Gemma 4, and Claude Code
Step 1: Get Ollama Installed
# macOS and Linux -- single-command install
curl -fsSL | sh
# Check your version -- you need 0.14.0+ for Anthropic Messages API support
# The Anthropic-compatible endpoint was introduced in January 2026
ollama version
# Expected output: ollama version is 0.22.x or later (as of May 2026)
# Windows: grab the native installer from
# For GPU passthrough on Windows, WSL2 is the recommended approach
Once installed, Ollama runs as a background service listening on port 11434. Confirm it’s active:
curl
# Expected response: Ollama is running
Step 2: Download Gemma 4
# The 26B MoE
# Pulling the Gemma 4 Model
Download the recommended variant for this configuration (approximately 18 GB):
# Download the 26B parameter variant -- recommended for this setup (~18 GB)
ollama pull gemma4:26b
# While the download is in progress, check its status
ollama ps
# Displays models currently being downloaded or actively running
# Optional: also download the 31B variant for benchmarking on hardware with enough resources
ollama pull gemma4:31b
# Verify the download finished successfully
ollama list
# Should display gemma4:26b along with its file size and last modified timestamp
Step 3: Install Claude Code
# Prerequisite: Node.js version 18 or newer
node --version # Confirm your Node.js version is 18 or above
# Install Claude Code CLI globally via npm
npm install -g @anthropic-ai/claude-code
# Confirm the installation was successful
claude --version
With Ollama up and running and Gemma 4 downloaded, the natural next move is to set the environment variables and fire up Claude Code right away.
# The Modelfile
By default, Ollama assigns Gemma 4 a context window of just 4K tokens. Gemma 4's true context capacity is 128K–256K tokens. That 4K default isn't a guideline — it is what Ollama will enforce unless you explicitly override it. During a Claude Code agentic session that reads source files, tracks conversation history, and accumulates tool call results over many turns, 4K tokens runs out almost instantly.
Without the context override, Claude Code loses track of file contents partway through an edit, drops earlier instructions, and generates incomplete modifications. For example: when the agent attempts to refactor a 200-line service class, it silently forgets the second half of the file exists. No error is raised — the agent simply operates on a partial view of the file and produces output that breaks downstream.
The solution is a Modelfile that embeds the correct context size and other inference settings directly into a named model variant. Create the following file:
# ~/.ollama/Modelfiles/gemma4-claude
# Gemma 4 26B MoE variant optimized for Claude Code agentic sessions.
# Embeds context window, temperature, and system prompt into the model
# so every Claude Code session starts with the right configuration.
#
# Build with:
# mkdir -p ~/.ollama/Modelfiles
# ollama create gemma4-claude -f ~/.ollama/Modelfiles/gemma4-claude
FROM gemma4:26b
# Context window -- 65536 tokens (64K) is the tested-safe minimum for real-world
# codebases without triggering swap on systems with 16-18 GB VRAM.
# Bump to 131072 (128K) if you have headroom on 24 GB+ systems.
# Avoid exceeding 131072 unless you have profiled your memory usage
# under load -- Ollama pre-allocates the entire KV cache at startup.
PARAMETER num_ctx 65536
# Temperature -- 0.2 is intentionally low for agentic coding tasks.
# Higher temperatures introduce inconsistency in tool call parameter
# formatting, which causes Claude Code's tool validator to reject calls.
# For creative work you would raise this. For agentic loops: keep it low.
PARAMETER temperature 0.2
# top_p -- nucleus sampling threshold. 0.9 keeps output focused
# while preventing the repetition loops that top_p=1.0 can cause
# during long agentic sessions.
PARAMETER top_p 0.9
# repeat_penalty -- penalizes the model for repeating tokens.
# 1.15 helps prevent tool call loops where Gemma 4 retries the same
# failed tool call with nearly identical parameters endlessly.
PARAMETER repeat_penalty 1.15
# num_predict -- maximum tokens per response. 4096 is enough for
# most code patches. Raise to 8192 if you regularly generate
# large files in a single generation pass.
PARAMETER num_predict 4096
# System prompt -- reinforces coding agent behavior and explicit
# tool use discipline. Gemma 4 benefits from being reminded to
# commit to tool calls rather than merely describing what it would do.
SYSTEM """You are a senior software engineer operating as a coding agent.
When working with code:
- Read files before editing them. Never assume file contents.
- Make one focused change at a time and verify it before proceeding.
- When a tool call fails, examine the error carefully before retrying.
Do not retry with identical parameters. Diagnose first.
- Prefer surgical edits over full file rewrites.
- Run tests after each meaningful change, not after a batch of changes.
- If you are uncertain about the codebase structure, read more files
rather than guessing.
Be precise and methodical. Avoid explaining what you are about to do
when you could simply do it."""
Build the variant:
# Create the Modelfiles directory if it does not already exist
mkdir -p ~/.ollama/Modelfiles
# Save the Modelfile content above to this path, then build:
ollama create gemma4-claude -f ~/.ollama/Modelfiles/gemma4-claude
# Confirm the variant was created successfully
ollama list
# Should display gemma4-claude alongside gemma4:26b
# Quick smoke test -- verify the model loads and responds correctly
ollama run gemma4-claude "What is the time complexity of binary search and why?"
# Expect a clear, concise technical answer within a few seconds
# Wiring Claude Code to the Local Model
With the model variant built, the configuration layer connects Claude Code to Ollama. Two environment variables form the core of this setup, but three additional variables prevent the most common failure modes.
Ollama's Anthropic-compatible endpoint sits at , not /v1. The /v1 path is Ollama's OpenAI-compatible layer. Claude Code uses the Anthropic Messages API protocol, which maps to the root endpoint. Using the /v1 path will trigger authentication errors or unpredictable behavior.
// Global Settings — ~/.claude/settings.json
This configuration applies to every Claude Code session across all projects. It is the right choice unless you frequently switch between local and cloud models on a per-project basis.
{
"env": {
"ANTHROPIC_BASE_URL": "
"ANTHROPIC_AUTH_TOKEN": "ollama",
"ANTHROPIC_API_KEY": "",
"ANTHROPIC_MODEL": "gemma4-claude",
"ANTHROPIC_DEFAULT_SONNET_MODEL": "gemma4-claude",
"ANTHROPIC_DEFAULT_HAIKU_MODEL": "gemma4-claude",
"ANTHROPIC_DEFAULT_OPUS_MODEL": "gemma4-claude",
"CLAUDE_CODE_DISABLE_EXPERIMENTAL_BETAS": "1"
}
}
Why each variable matters:
- ANTHROPIC_BASE_URL redirects all Claude Code API calls from Anthropic's cloud servers to your local Ollama instance.
- ANTHROPIC_AUTH_TOKEN must be set to any non-empty string; Ollama ignores the value but Claude Code requires the header to be present.
- ANTHROPIC_API_KEY: "" explicitly clears the key so Claude Code cannot accidentally fall back to a real Anthropic API key if one happens to exist in your shell environment. Without this, a misconfigured
ANTHROPIC_BASE_URL could silently route requests to Anthropic's servers instead of your local setup.
ANTHROPIC_BASE_URL must point to your local Ollama instance at http://localhost:11434/v1. Without this, Claude Code defaults to the paid Anthropic API, and your local setup won't be used at all. ANTHROPIC_AUTH_TOKEN and ANTHROPIC_API_KEY are both set to "ollama". Ollama ignores these values, but Claude Code requires them to be present. Leaving them blank could cause a silent fallback to the paid API. ANTHROPIC_MODEL is the primary model name Claude Code sends in requests. Set this to your custom Modelfile variant, gemma4-claude, not gemma4:26b. The raw model tag does not carry the context window override. ANTHROPIC_DEFAULT_SONNET_MODEL, ANTHROPIC_DEFAULT_HAIKU_MODEL, and ANTHROPIC_DEFAULT_OPUS_MODEL: Claude Code internally routes different task types to different model tiers. Setting all three to the same local model ensures every request lands at your Ollama instance regardless of which tier Claude Code internally selects. CLAUDE_CODE_DISABLE_EXPERIMENTAL_BETAS: "1" strips the Anthropic-specific beta headers that Claude Code adds to requests. Local inference servers do not recognize these headers and reject requests that include them. Setting this variable prevents that error without affecting any core Claude Code functionality.
// Per-Project Configuration — .claude/settings.json
For projects where you want local inference isolated from your global setup — private repositories, sensitive codebases, or projects with specific model requirements — use a project-level settings file instead:
# In your project root
mkdir -p .claude
cat > .claude/settings.json << 'EOF'
{
"env": {
"ANTHROPIC_BASE_URL": "http://localhost:11434/v1",
"ANTHROPIC_AUTH_TOKEN": "ollama",
"ANTHROPIC_API_KEY": "",
"ANTHROPIC_MODEL": "gemma4-claude",
"ANTHROPIC_DEFAULT_SONNET_MODEL": "gemma4-claude",
"ANTHROPIC_DEFAULT_HAIKU_MODEL": "gemma4-claude",
"ANTHROPIC_DEFAULT_OPUS_MODEL": "gemma4-claude",
"CLAUDE_CODE_DISABLE_EXPERIMENTAL_BETAS": "1"
}
}
EOF
Claude Code reads the project-level .claude/settings.json when it exists, overriding global settings for that project. Add .claude/settings.json to your .gitignore if the settings contain anything environment-specific, or commit it if you want the entire team running local inference on that project.
// Verifying the Setup
Before running Claude Code against a real codebase, verify three things: Ollama is serving correctly, the model responds to API calls in the Anthropic Messages format, and tool calling specifically works. The third point is non-negotiable: tool calling is how Claude Code reads files, writes patches, and executes commands. A model that cannot format tool calls correctly will loop and fail on basic agentic tasks.
Prerequisites:
pip install httpx # Async HTTP client for the verification script
The full verification script:
#!/usr/bin/env python3
"""
verify_local_setup.py
Verifies the full Claude Code + Ollama + Gemma 4 stack before use.
Runs three checks in sequence:
1. Ollama health and model availability
2. Basic Anthropic Messages API call
3. Tool calling round-trip
Prerequisites:
pip install httpx
How to run:
python verify_local_setup.py
Expected output on a working setup:
[PASS] Ollama is running on localhost:11434
[PASS] Model 'gemma4-claude' is available
[PASS] Anthropic Messages API call successful
[PASS] Tool calling: model produced a valid tool_use block
All checks passed -- Claude Code + Ollama + Gemma 4 is ready.
"""
import httpx
import json
import sys
# ── Configuration ─────────────────────────────────────────────────────────────
OLLAMA_BASE_URL = "http://localhost:11434"
MODEL_NAME = "gemma4-claude" # Must match your Modelfile variant name
TIMEOUT = 120.0 # Seconds -- generation can be slow on first call
def check_ollama_health() -> bool:
"""
Check 1: Verify Ollama is running and responding.
Hits the root endpoint which returns 'Ollama is running' when healthy.
"""
print("nCheck 1: Ollama health")
try:
response = httpx.get(OLLAMA_BASE_URL, timeout=5.0)
if "Ollama is running" in response.text:
print(f" [PASS] Ollama is running on {OLLAMA_BASE_URL}")
return True
else:
print(f" [FAIL] Unexpected response: {response.text[:100]}")
return False
except httpx.ConnectError:
print(f" [FAIL] Cannot connect to {OLLAMA_BASE_URL}")
print(" Is Ollama running? Try: ollama serve")
return False
def check_model_available() -> bool:
"""
Check 2: Verify the specific model variant is available in Ollama.
Uses the /api/tags endpoint which lists all pulled models.
"""
print("nCheck 2: Model availability")
try:
response = httpx.get(f"{OLLAMA_BASE_URL}/api/tags", timeout=5.0)
data = response.json()
models = [m["name"] for m in data.get("models", [])]
# Normalize: Ollama may add ":latest" if not specified
normalized = [m.split(":")[0] for m in models]
if MODEL_NAME in models or MODEL_NAME in normalized:
print(f" [PASS] Model '{MODEL_NAME}' is available")
return True
else:
print(f" [FAIL] Model '{MODEL_NAME}' not found")
print(f" Available models: {', '.join(models) or 'none'}")
print(f" Run: ollama create {MODEL_NAME} -f ~/.ollama/Modelfiles/gemma4-claude")
return False
except Exception as e:
print(f" [FAIL] Error checking model list: {e}")
return False
def check_messages_api() -> bool:
"""
Check 3: Send a basic Anthropic Messages API call to the local endpoint.
Verifies the request format, model routing, and basic generation work.
Uses the same /v1/messages path and request schema that Claude Code uses.
Note: Claude Code uses /v1 (root), not /v1.
The Anthropic-compatible API is at /api/chat or the root -- Ollama routes it.
"""
print("nCheck 3: Anthropic Messages API call")
payload = {
"model": MODEL_NAME,
"max_tokens": 100,
"messages": [
{
"role": "user",
"content": "Reply with exactly: VERIFICATION_OK"
}
]
}
headers = {
"Content-Type": "application/json",
"x-api-key": "ollama", # Required by the API spec; value ignored locally
"anthropic-version": "2023-06-01" # Required version header
}
try:
response = httpx.post(
f"{OLLAMA_BASE_URL}/v1/messages",
json=payload,
headers=headers,
timeout=TIMEOUT
)
if response.status_code != 200:
print(f" [FAIL] HTTP {response.status_code}:
13;
Re-reading to locate the bug...
→ edit_file("src/user_service.py", old_string=..., new_string=...)
Applied edit: added email validation check
→ bash("python -m pytest tests/test_user_service.py -v 2>&1")
Running 14 tests...
All 14 passed.
This is the core loop of agentic coding: read → understand → write → execute → fix → repeat. Every iteration depends on the model correctly producing structured tool calls instead of plain text descriptions.
# Performance Benchmarks
Local inference is slower than Anthropic's data centers. Here are real-world numbers from a machine with an NVIDIA RTX 4090 (24 GB VRAM) running Gemma 4 27B.
bool:
"""
Check 1: Confirm Ollama is running and reachable.
Sends a GET request to the root endpoint.
"""
print("nCheck 1: Ollama health check") try:
response = httpx.get(f"{OLLAMA_BASE_URL}/", timeout=5.0)
if response.status_code == 200:
print(" [PASS] Ollama is running and reachable")
return True
else:
print(f" [FAIL] Unexpected status code: {response.status_code}")
return False
except Exception as e:
print(f" [FAIL] Cannot reach Ollama: {e}")
print(" Make sure Ollama is started: ollama serve")
return False
def check_model_available() -> bool:
"""
Check 2: Confirm the target model is loaded in Ollama.
Queries the /v1/models endpoint and checks for the model name.
"""
print("nCheck 2: Model availability check")
try:
response = httpx.get(f"{OLLAMA_BASE_URL}/v1/models", timeout=10.0)
if response.status_code != 200:
print(f" [FAIL] HTTP {response.status_code}: {response.text[:200]}")
return False
data = response.json()
models = data.get("data", [])
model_ids = [m.get("id", "") for m in models]
if MODEL_NAME in model_ids:
print(f" [PASS] Model '{MODEL_NAME}' is available")
return True
else:
print(f" [FAIL] Model '{MODEL_NAME}' not found in: {model_ids}")
print(f" Pull it first: ollama pull {MODEL_NAME}")
return False
except Exception as e:
print(f" [FAIL] Request failed: {e}")
return False
def check_messages_api() -> bool:
"""
Check 3: Verify the Anthropic Messages API compatibility layer works.
Sends a simple prompt and checks for a valid text response.
"""
print("nCheck 3: Anthropic Messages API compatibility")
payload = {
"model": MODEL_NAME,
"max_tokens": 128,
"messages": [
{
"role": "user",
"content": "Reply with exactly: 'local setup working'."
}
]
}
headers = {
"Content-Type": "application/json",
"x-api-key": "ollama",
"anthropic-version": "2023-06-01"
}
try:
response = httpx.post(
f"{OLLAMA_BASE_URL}/v1/messages",
json=payload,
headers=headers,
timeout=TIMEOUT
)
if response.status_code != 200:
print(f" [FAIL] HTTP {agent_code_session_will_fail_on_file_operations.")
print(f" Full response: {json.dumps(data, indent=2)}")
return False
tool_call = tool_blocks[0]
tool_name = tool_call.get("name", "")
tool_input = tool_call.get("input", {})
print(f" [PASS] Tool calling: model produced a valid tool_use block")
print(f" Tool called: {tool_name}")
print(f" Parameters: {json.dumps(tool_input)}")
# Sanity check: did it call the right tool with the right parameter?
if tool_name == "read_file" and "path" in tool_input:
print(f" Tool name and parameter are correct.")
else:
print(f" WARNING: Unexpected tool name or missing 'path' parameter.")
print(f" The model called a tool but not the expected one.")
return True
except Exception as e:
print(f" [FAIL] Request failed: {e}")
return False
def main():
print("=" * 60)
print("Claude Code + Ollama + Gemma 4 Setup Verification")
print("=" * 60)
checks = [
check_ollady_health,
check_model_available,
check_messages_api,
check_tool_calling,
]
results = [check() for check in checks]
print("n" + "=" * 60)
passed = sum(results)
total = len(results)
if all(results):
print(f"All {total} checks passed.")
print("Claude Code + Ollama + Gemma 4 is ready.")
print(f"nLaunch with: claude")
sys.exit(0)
else:
failed_checks = [i + 1 for i, r in enumerate(results) if not r]
print(f"{passed}/{total} checks passed. Failed: {failed_checkS}")
print("Resolve the failures above before using Claude Code locally.")
sys.exit(1)
if __name__ == "__main__":
main()
How to run:
bash
pip install httpx
python verify_local_setup.py
---
## Agentic Task Walkthrough
With all checks passing, here's what a real agentic session looks like. The task: take an existing Python module that has no tests, analyze it, write a full test suite, run the tests, and fix any failures.
bash
# Navigate to your project directory
cd ~/projects/my-service
# Confirm Claude Code detects the local configuration
claude --version
# Make sure it does not ask for an Anthropic API key — if it does,
# the settings.json file is not being read correctly
# Start an agentic session
claude
# Inside Claude Code, give the agent a clear task:
# > Analyze the UserService class in src/user_service.py.
# > Write a pytest test suite covering all public methods.
# > Run the tests and fix any failures.
# > The goal is a clean pytest run with no skips.
Here's what the Claude Code tool call trace looks like during this session:
→ read_file("src/user_service.py")
Reading 247 lines...
→ list_files("src/")
Found: user_service.py, models.py, db.py, exceptions.py
→ read_file("src/models.py")
Reading 89 lines...
→ write_file("tests/test_user_service.py", [test content])
Written: 312 lines
→ bash("python -m pytest tests/test_user_service.py -v 2>&1")
Running 14 tests...
FAILED tests/test_user_service.py::test_update_email_invalid
AssertionError: Expected ValidationError, got None
→ read_file("src/user_service.py")
Re-reading to locate the bug...
→ edit_file("src/user_service.py", old_string=..., new_string=...)
Applied edit: added email validation check
→ bash("python -m pytest tests/test_user_service.py -v 2>&1")
Running 14 tests...
All 14 passed.
This is the core loop of agentic coding: **read → understand → write → execute → fix → repeat**. Every cycle depends on the model correctly producing structured tool calls rather than plain text descriptions.
---
## Performance Benchmarks
Local inference is slower than Anthropic's data centers. Here are real-world numbers from a machine with an NVIDIA RTX 4090 (24 GB VRAM) running Gemma 4 27B.
| Metric | Value |
|---|---|
| Tokens per second | 25–40 |
| Time per API call | 2–8 seconds |
| API calls per session | 15–40 (depends on task complexity) |
| Tool calls per session | 30–120 |
| Total session time | 3–10 minutes |
| Cost per session | $0 (local hardware) |
| Monthly cost (50 sessions) | $0 |
**Key takeaway:** Each call takes longer, but the total cost is zero. For developers who run 50+ sessions a month, the hardware pays for itself within weeks compared to API costs.
---
## Common Pitfalls and Fixes
### 1. Claude Code still asks for an Anthropic API key
**Symptom:** Running `claude` prompts for an API key even after setting environment variables.
bash
# Fix: Double-check that the environment variables are exported in your shell profile
echo $ANTHROPIC_BASE_URL # Should print http://localhost:11434
echo $ANTHROPIC_MODEL # Should print gemma3:27b
# If empty, add them to your shell profile:
echo 'export ANTHROPIC_BASE_URL=http://localhost:11434' >> ~/.bashrc
echo 'export ANTHROPIC_MODEL=gemma3:27b' >> ~/.bashrc
source ~/.bashrc
### 2. Ollama returns 404 on `/v1/messages`
**Symptom:** The Anthropic compatibility endpoint is not responding.
bash
# Fix: Make sure you are running a recent Ollama version that supports
# the Anthropic Messages API shim
ollama --version
# Upgrade if needed:
curl -fsSL https://ollama.com/install.sh | sh
### 3. Tool calling returns text instead of `tool_use` blocks
**Symptom:** The model replies with a description like "I would call read_file" instead of producing a structured tool call.
bash
# Fix: Some models handle tool calling better than others.
# Gemma 4 27B generally works well. If you see this issue:
# - Make sure the tool schema is valid JSON
# - Try adding "tool_choice": {"type": "any"} to force tool use
# - Consider switching to a model with stronger function-calling support
### 4. Out-of-memory errors on smaller GPUs
**Symptom:** Ollama crashes or the system freezes during inference.
bash
# Fix: Use a quantized model variant that fits your VRAM
ollama pull gemma3:27b-q4_k_m # 4-bit quantization, ~16 GB VRAM
# Or use a smaller model:
ollama pull gemma3:12b # Fits in ~12 GB VRAM
After making the targeted fix to the update_email method, the agent re-reads the relevant code to confirm the change took effect.
→ write_file("tests/test_user_service.py", [corrected test])
The assertion in test_update_email_invalid has been corrected.
→ bash("python -m pytest tests/test_user_service.py -v 2>&1")
All 14 tests passed in 1.23 seconds.
Gemma 4 handles this workflow reliably — reading files before making edits, running tests after each change, and diagnosing failures from error output rather than blindly retrying. When it comes to complex architectural decisions spanning many cloud-based models still hold the advantage. For the tasks described above (code analysis, test generation, and targeted fixes), the local setup is more than capable.
What to watch for: If the agent repeatedly throws "Invalid tool parameters" errors and retries with the same parameters each time, the temperature setting is likely too high, or the model isn't using the gemma4-claude Modelfile variant. Both the temperature setting and the context window override are built into that variant; the base gemma4:26b tag does not include them.
// Common Issues and How to Resolve Them
-
Tool Parameter Formatting Errors
- Symptom: Claude Code repeatedly reports Invalid tool parameters. The agent apologizes and retries with identical or nearly identical parameters, then gets stuck in a loop.
- Cause: This is a known issue documented in the Ollama GitHub issues. The model generates tool call JSON that doesn't match the schema Claude Code expects. The most common problems are incorrect field names, missing required fields, or nested objects where scalar values are expected.
- Fix: Make sure you're running
gemma4-claude (the Modelfile variant) and not gemma4:26b directly. The temperature: 0.2 setting and the system prompt in the Modelfile significantly reduce these errors. If the problem continues, lower the temperature to 0.1 in the Modelfile and rebuild.
-
Context Window Swapping to Disk
- Symptom: Generation slows to a crawl after several turns.
ollama ps shows GPU utilization dropping. The operating system is paging the KV cache to disk. - Fix:
# Option 1: Reduce the context window in the Modelfile
# Edit ~/.ollama/Modelfiles/gemma4-claude
# Change: PARAMETER num_ctx 65536
# To: PARAMETER num_ctx 32768
# Then rebuild: ollama create gemma4-claude -f ~/.ollama/Modelfiles/gemma4-claude
# Option 2: Enable KV cache quantization to reduce memory usage
export OLLAMA_KV_CACHE_TYPE=q8_0
# This quantizes the KV cache, reducing memory usage with a minor quality trade-off
# Restart Ollama after setting this: pkill ollama && ollama serve
-
Model Unloading Between Agent Turns
- Symptom: A noticeable cold-start delay at the beginning of each Claude Code message. Ollama unloads the model after an inactivity timeout and reloads it for each new request.
- Fix:
# Keep the model loaded for the entire work session
export OLLAMA_KEEP_ALIVE=-1
# Or add it to your shell profile for a permanent setting
echo 'export OLLAMA_KEEP_ALIVE=-1' >> ~/.zshrc
# Alternatively, use the Ollama API to pin the model in memory
curl /api/generate
-d '{"model": "gemma4-claude", "keep_alive": -1}'
# This keeps the model loaded until you explicitly unload it or restart Ollama
-
Beta Header Rejection Errors
- Symptom: Claude Code throws Unexpected value(s) for the anthropic-beta header errors at launch or mid-session.
- Fix: Confirm that
CLAUDE_CODE_DISABLE_EXPERIMENTAL_BETAS: "1" is set in your settings.json. If you set it as a shell export instead of in settings.json, verify it's exported in the same shell session where claude is running:echo $CLAUDE_CODE_DISABLE_EXPERIMENTAL_BETAS
# Must print: 1
# Wrapping Up
The stack described in this article is not a proof of concept — it's a working production configuration that engineers have been running daily since Ollama added Anthropic Messages API support in January 2026. The Modelfile is not optional; it's the difference between a tool that works and one that silently produces incomplete outputs on multi-file tasks. The verification script catches configuration issues before they surface mid-session as confusing agent failures.
The setup built in this article is a private, zero-per-token-cost coding agent that handles the majority of daily engineering tasks — code analysis, test generation, targeted refactoring, and debugging — at generation speeds that are usable on modern hardware.
This setup is not a replacement for cloud inference on complex architectural reasoning across large codebases or SWE-bench class tasks that require deep repository understanding at scale.
Shittu Olumide is a software engineer and technical writer passionate about leveraging cutting-edge technologies to craft compelling narratives, with a keen eye for detail and a knack for simplifying complex concepts. You can also find Shittu on Twitter.



