You picked up a Mac Mini to run OpenClaw. Great choice.
Unfortunately, Anthropic has since steered OpenClaw users toward its pay-per-token API1, transforming what started as a one-time hardware buy into a (hefty) recurring cost2. Even if you go with OpenAI, you’ll still be shelling out a fair amount each month.
💵💵 Running a local model wipes out the monthly fee for your OpenClaw agents, completely. 💵💵
That said, getting everything installed and configured can feel overwhelming, particularly if you’re just starting out with local LLMs.
In this guide, I’ll walk you through setting up a local LLM on your Mac Mini — the smoothest way possible — so it can power your agent at no cost.
Even if you’re a complete beginner, you can follow along.
🤨 “I’ve heard local LLMs aren’t as good — is that actually true?”
A local LLM, when configured correctly, will deliver results that are nearly on par for everyday tasks like handling emails, managing your calendar, setting reminders, controlling smart home devices, and doing basic web research — the kinds of things you’d typically use OpenClaw for.
For more demanding work — say, using OpenClaw for software development — there’s a link at the end that walks you through setting up a fallback model.
⚠️ Note: This isn’t a comprehensive OpenClaw tutorial.
It’s designed to help you get your local LLM up and running alongside your agent(s) as quickly as possible.
Hardware
This guide was tested on a Mac Mini with the following configuration:
| OS | macOS Tahoe |
| Version | 26.3.1 |
| Processor | M2 |
| Cores | 8 |
| Unified Memory | 24GB |
If you’re considering buying a Mac Mini, I’d suggest going with at least an M2+ chip and a minimum of 24GB of RAM. You can manage with 16GB, but it’ll be tight, and you may run into issues with larger context windows.
Getting everything ready
Start by installing OpenClaw using the official instructions. If that’s already done, feel free to skip ahead.
1. Install llama.cpp
We’re going to skip Ollama (the recommended local provider) and go with llama.cpp instead. By pairing a quantized model with llama.cpp, we can accelerate inference by up to 70%.
We need to compile llama.cpp from source with metal flags enabled and CUDA disabled. This applies the optimizations necessary to run the model at full speed on your Mac. Just follow the steps below.
1️⃣ First, from your home directory, install a couple of prerequisites via Homebrew.
# paste this into your terminal
$ brew install cmake curl2️⃣ Next, compile llama.cpp with the correct flags.
# Clone llama.cpp
git clone
# Configure the build with Metal acceleration
cmake llama.cpp -B llama.cpp/build
-DBUILD_SHARED_LIBS=OFF
-DGGML_METAL=ON
-DGGML_CUDA=OFF
# Compile
cmake --build llama.cpp/build
--config Release
-j$(sysctl -n hw.ncpu)
--clean-first
--target llama-cli llama-mtmd-cli llama-server llama-gguf-splitAt this point, llama.cpp is built and ready to go.
2. Download the local LLM
As mentioned earlier, the secret to strong performance from a local model is quantization.
Quantization lets us take a larger, more powerful model and “compress” it intelligently so it fits on more modest hardware. The quantized version retains the vast majority of the original model’s capabilities.
Unless you have a beefy GPU or a Mac loaded with the maximum unified memory (80GB+), quantization is essential.
Blindly following the OpenClaw docs while attempting to use a quantized model will only lead to confusion and frustration.
There’s simply no clear, step-by-step resource explaining how to make quantized models work smoothly with agents.
Below is a tested recipe that will get your agent up and running.
Model Choice: Qwen 3.5-9B
Here we’re using Qwen 3.5 (the 9B parameter variant).
As of June 2026, it ranks among the top local models, outperforming Gemma 4-12B. It fits comfortably on both 16GB and 24GB Macs, requiring roughly 6–8GB of RAM. Users also rate it highly for OpenClaw.
Keep in mind that agents need longer context windows, which rules out running the larger 27B version — even with quantization.
1️⃣ Let’s grab the model.
# download the model
curl -L -o models/Qwen3.5-9B-UD-Q4_K_KL.gguf
"2️⃣ Download the template and save it to the templates folder.
mkdir templates &&
curl -o templates/qwen35.jinja
"Important: you must use an agent-compatible template for OpenClaw. Without this, nothing will function properly.
3. Launch llama-server
Llama-server will act as our backend API. OpenClaw will connect to this local service instead of reaching out to OpenAI or Anthropic’s API directly.
We’ve already built llama-server and downloaded our model. Let’s do a quick test run.
1️⃣ Run a quick test.
./llama.cpp/llama-server
-m models/Qwen3.5-9B-UD-Q4_K_XL.gguf
--chat-template-file templates/qwen35.jinja
--temp 0.7
--top-p 0.9
--top-k 20
-c 64000
-ngl 20
--host 127.0.0.1
--port 8080You should see output similar to this (with no errors):
srv llama_server: /think_off It looks like the output was cut off. Let me continue paraphrasing from where it stopped:
srv llama_server: waiting for requests on http://127.0.0.1:8080
If you see the server start up without errors, that means everything is working correctly.
4. Connect OpenClaw to your local model
Now that your local LLM is running, the final step is to point OpenClaw at your llama-server instance instead of an external API provider.
In your OpenClaw configuration, set the API base URL to http://127.0.0.1:8080 and select the appropriate model. This way, all your agent's requests will be handled locally — no API keys, no monthly bills.
Wrapping up
That's all there is to it. You now have a fully functional local LLM powering your OpenClaw agents, running entirely on your Mac Mini with zero ongoing costs.
For most everyday tasks — email, scheduling, reminders, smart home control, and light research — the Qwen 3.5-9B quantized model will serve you well. And if you ever need extra horsepower for more complex work, you can always configure a fallback to a cloud-based model.
Enjoy your free, private, locally-run AI agent.
2️⃣ Now, lets write a launchd daemon, so your local LLM server starts automatically and stays available after reboot. If you're familiar with Linux, launchd is essentially systemd for macOS
Save the following as /Library/LaunchDaemons/com.openclaw.llama-server.plist. You will need to use sudo for this.
Expand this for the plist file
❗Ensure that you replace YOUR_USERNAME with your actual username in the xml.
Label
com.openclaw.llama-server
UserName
YOUR_USERNAME
ProgramArguments
/Users/YOUR_USERNAME/llama.cpp/llama-server
-m
/Users/YOUR_USERNAME/models/Qwen3.5-9B-UD-Q4_K_XL.gguf
--chat-template-file
/Users/YOUR_USERNAME/templates/qwen35.jinja
--temp
0.7
--top-p
0.9
--top-k
20
-c
64000
-ngl
20
--host
127.0.0.1
--port
8080
WorkingDirectory
/Users/YOUR_USERNAME
RunAtLoad
KeepAlive
StandardOutPath
/tmp/llama-server.log
StandardErrorPath
/tmp/llama-server.err
Now, enable it.
sudo chown root:wheel /Library/LaunchDaemons/com.openclaw.llama-server.plist &&
sudo chmod 644 /Library/LaunchDaemons/com.openclaw.llama-server.plist &&
sudo launchctl bootstrap system /Library/LaunchDaemons/com.openclaw.llama-server.plistWe can check to make sure the service is running properly by monitoring our log file.
tail -f /tmp/llama-server.errAt this point, our local LLM is loaded and running as a background service. The next step is to reconfigure OpenClaw to work with it.
4. Reconfigure OpenClaw to use the local model
We now need to register this local model in the OpenClaw configuration so it can be used by the gateway.
1️⃣ Add the following to the "models" section in .openclaw/openclaw.json:
{
"models": {
"providers": {
"local": {
"baseUrl": "/v1",
"apiKey": "sk-local",
"api": "openai-completions",
"models": [
{
"id": "qwen3-9b",
"name": "Qwen3.5 9B Local",
"contextWindow": 64000,
"maxTokens": 8192
}
]
}
/* REMOVE THIS COMMENT */
/* you may add additional providers, like anthropic here */
}
}
}
Note: the values for
contextWindowandmaxTokensmight need to be tweaked depending on your specific use case.
You'll also want to designate this model as the default for your agents:
"agents": {
"defaults": {
"model": {
"primary": "local/qwen3-9b"
},
"models": {
"local/qwen3-9b": {}
}
}It's a good idea to double-check that the config file is syntactically correct. Run the command below to validate it:
openclaw config validate2️⃣ Restart the gateway to make the local model available:
openclaw gateway restart3️⃣ Confirm that OpenClaw has recognized the local model:
openclaw models list --provider localYou can also run a quick test inference:
openclaw infer model run
--model local/qwen3-9b
--prompt "Reply with exactly: pong"
--jsonYou should get a JSON response back. Important: check that there are no leaked tags in the output. You shouldn't see any, but it's worth verifying for security reasons.
{
"ok": true,
"capability": "model.run",
"transport": "local",
"provider": "local",
"model": "qwen3-9b",
"attempts": [],
"outputs": [
{
"text": "pong",
"mediaUrl": null
}
]
}The entire pipeline is now confirmed to be working. To be completely thorough—especially if this is your first agent—let's set up a test skill and make sure the model can reason through problems and execute tool calls correctly.
5. Verify functionality with a test skill
Let's build a simple 'python-calc' skill to confirm that our local model can reason and fire off tool calls as expected.
1️⃣ Run the following to create the skill. This will make the tool available across all of your OpenClaw agents:
mkdir -p ~/.openclaw/workspace/skills/python-calc
cat << 'EOF' > ~/.openclaw/workspace/skills/python-calc/SKILL.md
---
name: python-calc
description: A tool that evaluates mathematical expressions by executing a Python one-liner.
version: 1.0.0
---
## Instructions
1. Extract the precise mathematical expression from the user's request.
2. Use your built-in shell tool to run this command, substituting `` with the expression: `python3 -c "print()"`
3. Wait for the shell tool to return the stdout result.
4. Respond to the user with the exact numeric result produced by the script.
EOF Once more, restart the gateway.
2️⃣ Now, let's fire off a quick agent call to confirm the tool works as intended:
openclaw agent --local --agent main --verbose on --thinking high --message
"Use the python-calc skill to calculate 8664 multiplied by 222.
Do not use skill_workshop. Tell me the final answer."After a brief moment, if everything is wired up correctly, you should see something like:
The final answer is 1,923,408.Fantastic!
In practice, you can expect speeds ranging from 20 to 70 tokens per second*. While that falls short of Claude-level performance (130+ tps), it's perfectly usable for an OpenClaw agent running on modest hardware.
Keep in mind that the thinking mode is set to high, so don't worry if responses take a little longer.
If you want to confirm that OpenClaw is actually hitting your local model, open a separate terminal and watch the llama-server log with
tail -f /tmp/llama-server.err.
*Your mileage may vary
Wrapping up
Getting a local LLM up and running—especially when you're dealing with custom templates and quantization—can be a real headache. It took two full days of back-and-forth to get it working on a friend's Mac the first time! Thanks to Jacob W. for the inspiration.
That's all there is to it! Hopefully this saves you a lot of 💸.
If it did, or if it saved you some headaches, feel free to buy me a coffee.
☕Cheers!
1 Tweet by Boris Cherny, discussing the "ban" of OpenClaw
2 User spends $420 a month on API fees
3 Using multiple providers with with OpenClaw



