# Introduction
The moment you type ollama run llama3.2 into your terminal and see a 7-billion-parameter model load directly onto your own hardware — no API key needed, no billing page to worry about, no data ever leaving your system — something clicks. It’s not just that the technology is remarkable, though it truly is. It’s that the experience is quick, powerful, and completely under your control. You own every interaction. No one is tracking it. No one is billing you per token. The model doesn’t even know whether you’re connected to the internet.
I’ve been integrating local models into my everyday workflow for some time now, and the biggest surprise has been how frequently the local option turned out to be the superior choice — not a fallback, but the better tool for the job. Here are five real things I accomplished with local language models that I either wouldn’t have attempted or simply couldn’t have done using a cloud-based service. Where relevant, I’ve included working code.
“Local” means the model executes entirely on your own hardware. The setup relies on Ollama, a tool that makes downloading and running open-source models about as straightforward as installing any other piece of software. Most of what I describe works on a machine with 8 GB of RAM for smaller models, and 16 GB for a more comfortable experience. Apple Silicon Macs (M1 and onward) handle this workload impressively well thanks to their unified memory architecture. A dedicated NVIDIA GPU will speed things up considerably, but it’s not a prerequisite to get going.
# Project 1: Building a Private Document Brain
I deal with a growing pile of research papers, contracts, and project notes that accumulate faster than I can properly organize them. At one point, I had three years’ worth of PDFs, a scattering of Word documents, and a folder full of plain-text notes — all sitting on my disk, theoretically valuable but practically impossible to search in any meaningful way.
The natural instinct is to feed everything into an AI and start asking questions. The obvious issue is that uploading contracts and personal research to a cloud service means those files now live on someone else’s server, are processed by someone else’s infrastructure, and are retained under someone else’s data policy. For anything confidential — legal agreements, medical records, internal business documents, personal journals — that trade-off is difficult to accept.
So I configured AnythingLLM to run locally, connected to Llama 3.2 through Ollama. AnythingLLM is an open-source application that manages the complete retrieval-augmented generation (RAG) pipeline — document ingestion, chunking, embedding, vector storage, and retrieval — with zero cloud dependency. It has over 54,000 GitHub stars and operates entirely on your own machine. You drop documents in, it processes them locally, and you begin asking questions.
Getting it up and running takes a single command:
# Pull and run AnythingLLM via Docker
# Everything stays on your machine -- no data leaves
docker run -d
--name anythingllm
-p 3001:3001
-v anythingllm_storage:/app/server/storage
mintplexlabs/anythingllm
# Then open in your browser
# Connect it to Ollama (already running at localhost:11434)
# and pull the model you want to use for document chat
ollama pull llama3.2:3bI loaded a collection of research papers and posed questions that required synthesizing information across multiple documents:
This is the prompt I used:
“What are the key differences in how the 2023 and 2025 papers approach retrieval augmentation? Do they agree on chunking strategy or is there disagreement?”
The model extracted the relevant passages from each paper, cited which document each insight came from, and surfaced a genuine methodological disagreement I had missed when reading them individually. Not a single byte of those papers ever left my machine.
The model that worked best for this: Llama 3.2 3B for speed on modest hardware, and Mistral 7B if you have 8 GB of VRAM and want stronger synthesis across longer documents. For straightforward document Q&A on a machine with 16 GB of RAM, the difference is noticeable. Mistral reads with greater care.
Why this matters: This is the use case where local RAG is genuinely superior to cloud — not merely on par. The documents stay put. The AI comes to them. Everything that makes cloud AI appealing — the reasoning, the synthesis, the ability to answer questions spanning multiple sources — is fully present. Everything that makes it problematic for sensitive material — the data transfer, the server-side logging, the reliance on a third party — is eliminated.
# Project 2: Running a Code Reviewer That Never Judges You
There’s a particular kind of code review anxiety that most developers know well: you’ve written something that works, but you’re not exactly proud of it. It’s clever in ways that your future self will find frustrating. You suspect there’s an edge case you’ve overlooked. You want candid feedback before another person lays eyes on it.
The cloud AI path has an obvious drawback. Pasting production code into ChatGPT or Claude means shipping your company’s intellectual property to a third-party server. Most employer non-disclosure agreements (NDAs) cover this scenario, regardless of whether anyone is actively enforcing it. It’s a legitimate concern, especially for proprietary algorithms, internal business logic, or anything that touches customer data.
I set up Qwen2.5-Coder 7B locally through Ollama. This model was purpose-built for code; it consistently outperforms general-purpose models of the same size on coding benchmarks. At 7B parameters, it runs comfortably within 8 GB of VRAM. I gave it real functions from an active project and asked for three things: security vulnerabilities, unhandled edge cases, and any spots where I was being unnecessarily clever.
# Pull the model
ollama pull qwen2.5-coder:7b
# Run an interactive session
ollama run qwen2.5-coder:7bThe system prompt I used for every review session:
You are a senior software engineer doing a code review.
Your job is to find problems, not to be encouraging.
Review for:
1. Security vulnerabilities (injection, auth issues, data exposure)
2. Edge cases that are not handled
3. Anywhere the code is more complex than it needs to be
4. Any assumptions that will break under real conditions
Be direct. Do not summarize what the code does.
Start immediately with what you found.I fed it this function:
def get_user_data(user_id):
query = f"SELECT * FROM users WHERE id = {user_id}"
result = db.execute(query)
return result.fetchone()The model caught the SQL injection vulnerability right away, flagged the wildcard SELECT * as a data exposure risk, and noted that the function silently returns None if the user doesn’t exist — which would produce a confusing error three calls downstream wherever the result gets used. All three wereHere is the paraphrased version:
There were three genuine issues. Two of them I was already aware of and had planned to address eventually. The third one I had completely overlooked.
For developers looking to integrate this into their editor, the Continue plugin for VS Code and JetBrains links directly to a local Ollama instance:
// .continue/config.json -- add this to point Continue at your local model
{
"models": [
{
"title": "Qwen2.5-Coder Local",
"provider": "ollama",
"model": "qwen2.5-coder:7b",
"apiBase": "
}
]
}
Once configured, you get inline code suggestions and a chat panel — everything runs on your own machine, everything stays private, no monthly fee.
# Project 3: Running a Fully Offline AI Assistant
This one sounds straightforward, but it completely shifted my perspective on what AI tools are really meant to do. I had a 10-hour flight with unreliable Wi-Fi and a genuine pile of deep thinking work I’d been putting off. I wanted an AI assistant available for the entire flight — not just when the connection happened to work, but continuously, without shelling out for in-flight internet, without stressing about what data was passing through the airline’s network.
Before getting on the plane, I downloaded a model:
# Grab this before your flight -- it's a 4.1 GB download at Q4 quantization
ollama pull mistral:7b
# Confirm it's stored on your machine
ollama list
# Should display mistral:7b along with its size and modification date
That’s all the setup there is. Once it’s downloaded, Ollama runs the model entirely from files on your disk. Switch your laptop to airplane mode. Open a terminal. Type ollama run mistral:7b. The model loads in roughly 8 seconds on an M2 MacBook Pro and begins responding right away. No network call needed. The model has no idea — or any reason to care — that you’re cruising at 35,000 feet.
Here’s how I put it to use during that flight:
- Composing email drafts to polish later. I explained the scenario and the result I was aiming for. The model produced a draft. I refined it. Quicker than starting from a blank page, and nothing ever left my machine.
- Thinking through a technical architecture challenge. I laid out a system design problem I’d been turning over in my mind. Having something to challenge my thinking — even something that doesn’t fully grasp my codebase — is genuinely helpful. The model posed clarifying questions. I responded. By the time we landed, I had a much sharper perspective than when I started.
- Structuring this very article. Seriously. I outlined the five use cases I wanted to include, asked it to help me organize them, and worked through the sequence and focus during the descent.
Straight talk on performance: on an M2 MacBook Pro with 16 GB of unified memory, Mistral 7B at Q4_K_M quantization runs at about 25–35 tokens per second. That’s quick enough to feel like a natural back-and-forth. On older hardware or without GPU acceleration, it slows down — more like reading a slow feed than having a conversation — but it’s still perfectly usable for drafting and reflective work. What you can’t do offline: anything needing up-to-the-minute information (breaking news, live pricing, the latest research). That’s not a shortcoming unique to local models; it’s just how things work.
# Project 4: Building a Personal Thinking Partner That Understands Your Context
Every time you start a new conversation with Claude, ChatGPT, or any cloud-based AI, you’re starting from scratch. The model knows nothing about you, your work, your active projects, what you’ve already attempted, or how you like to work through problems. The first few minutes of any meaningful session are spent re-establishing the same context you had to set up in the previous one. It gets tiresome fast.
Local models handle this through something called a Modelfile — a compact configuration file that embeds a persistent system prompt directly into a named model. You write it once, and every session with that model begins with full context already in place. No re-explaining. No lengthy preamble.
Here’s the Modelfile I put together:
# Save this as Modelfile (no file extension) in any folder
# Then run: ollama create myassistant -f Modelfile
FROM llama3.2:3b
# This SYSTEM block gets injected at the beginning of every conversation
SYSTEM """
You are my personal thinking partner. Here is the context you always carry:
ABOUT ME:
I am a technical writer and developer focused on AI tooling and developer
education. I think most clearly by writing and talking through problems
out loud before settling on a direction.
CURRENT PROJECTS:
- A series of technical articles on agentic AI and LLM tooling
- A Python library for structured prompt management
- Researching retrieval-augmented generation for personal knowledge bases
HOW I WORK BEST:
- Challenge my assumptions. I value disagreement over easy agreement.
- Ask one clarifying question if my prompt is ambiguous before diving in.
- When I'm walking through a decision, test whether my reasoning holds up.
- Don't parrot what I said back to me. Jump straight into your actual response.
WHAT YOU SHOULD KNOW:
- I've been in software for 8 years. Don't over-explain the basics.
- When I ask for writing help, I want structure and directness, not filler.
- I have a strong aversion to bullet points that should have been full sentences.
"""
# Configure sensible generation parameters
PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER num_ctx 4096
Building and launching it:
# Build the model from your Modelfile
ollama create myassistant -f Modelfile
# Confirm it was created successfully
ollama list
# Launch it -- every session opens with your full context already loaded
ollama run myassistant
The practical difference is instantly noticeable. Compare these two answers to the same question — “Should I restructure this article to lead with the code examples?” — from a generic model versus the personalized one:
Generic model:
“That depends on your audience. If your readers are more technical, leading with code examples can be effective because it immediately shows them what they’ll learn. On the other hand, if you want to build context first…”
Personalized model:
“What’s the current structure? Leading with code only works if the code is self-explanatory enough that a reader can follow without background. Given that you said you’re writing for developers who are learning the tooling, not seasoned practitioners, I’d push back on leading with code. Give me the first section, and I’ll tell you if it stands on its own.”
The second response begins from where you actually are. It doesn’t burn time on the “it depends” hedging. It doesn’t know you dislike bullet points because it studied your preferences — it knows because you told it once, and it never forgets.
Whenever your projects evolve, just update the Modelfile and run ollama create myassistant -f Modelfile again to overwrite the previous version.
# Project 5: Creating a Local AI Agent That Actually Uses Tools
The first four projects on this list are impressive, but they essentially treat the model as a highly capable text generator. This project is different. Here, the model acts as the decision-making core within a system that plans, takes action, observes outcomes, and produces a final result — all without making a single API call to an external AI service.
I wanted to test how well a local model could handle an agentic task without relying on cloud-based fallbacks. I built a lightweight Python agent that runs Llama 3.2 Instruct through Ollama’s OpenAI-compatible API, equips it with two tools — a web search and a file writer — and executes the ReAct loop until the task is complete. Total external cost: $0.
First, ensure Ollama is running and serving the model:
ollama serve # starts the Ollama API server
ollama pull llama3.2:3b # pulls the instruct model if not already cachedThe Ollama API is OpenAI-compatible, meaning you can plug it into any framework designed for the OpenAI API by changing just one line. Below is the complete local agent implementation:
# local_agent.py
# Install: pip install openai duckduckgo-search
# Requires: Ollama running locally at
from openai import OpenAI
import json
from duckduckgo_search import DDGS
# Point the OpenAI client at your local Ollama instance
# This single-line change enables any OpenAI-compatible tool to work locally
client = OpenAI(
base_url="
api_key="ollama" # Ollama doesn't need a real key -- any string works
)
MODEL = "llama3.2:3b" # Swap this for any model you've pulled via Ollama
# Define the tools available to the agent
tools = [
{
"type": "function",
"function": {
"name": "web_search",
"description": (
"Search the web for up-to-date information on a topic. "
"Use this when you need facts or data that may have changed recently. "
"Do NOT use for information already present in the conversation."
),
"parameters": {
"type": "object",
"properties": {
"query": {
"type": "string",
"description": "A focused search query, 3-8 words."
}
},
"required": ["query"]
}
}
},
{
"type": "function",
"function": {
"name": "write_file",
"description": "Save text content to a local file. Use this when the task is finished.",
"parameters": {
"type": "object",
"properties": {
"filename": {
"type": "string",
"description": "The output filename, e.g. 'summary.md'"
},
"content": {
"type": "string",
"description": "The complete text content to save."
}
},
"required": ["filename", "content"]
}
}
}
]
def web_search(query: str) -> str:
"""Perform a real web search using DuckDuckGo -- no API key needed."""
with DDGS() as ddgs:
results = list(ddgs.text(query, max_results=4))
if not results:
return "No results found."
# Format results in a clean, readable way for the model
return "nn".join(
f"Title: {r['title']}nURL: {r['href']}nSnippet: {r['body']}"
for r in results
)
def write_file(filename: str, content: str) -> str:
"""Save content to a file in the current directory."""
with open(filename, "w") as f:
f.write(content)
return f"File '{filename}' written successfully ({len(content)} characters)."
def run_tool(name: str, arguments: dict) -> str:
"""Direct tool calls to the appropriate function."""
if name == "web_search":
return web_search(arguments["query"])
elif name == "write_file":
return write_file(arguments["filename"], arguments["content"])
return f"Unknown tool: {name}"
def run_agent(goal: str, max_turns: int = 10) -> None:
"""
The agent loop:
1. Send the goal and current conversation to the local model
2. If the model calls a tool, execute it and append the result to the conversation
3. If the model is done, print the final message and exit
4. Repeat until done or max_turns is reached
"""
system = """You are a research agent. When given a goal:
1. Use web_search to find accurate, current information -- search multiple times for different aspects
2. When you have enough information, use write_file to save a structured summary
3. The file should include: key findings, why they matter, and sources
Think carefully before each action. When the file is written, your task is complete."""
messages = [{"role": "user", "content": goal}]
for turn in range(max_turns):
print(f"n--- Turn {turn + 1} ---")
# Send conversation to the local model
response = client.chat.completions.create(
model=MODEL,
messages=[{"role": "system", "content": system}] + messages,
tools=tools,
tool_choice="auto"
)
choice = response.choices[0]
message = choice.message
# Model is done -- print and exit
if choice.finish_reason == "stop":
print(f"nAgent finished: {message.content}")
return
# Model called one or more tools -- execute each one
if choice.finish_reason == "tool_calls" and message.tool_calls:
# Add the model's message (with tool calls) to conversation history
messages.append({
"role": "assistant",
"content": message.content,
"tool_calls": [
{
"id": tc.id,
"type": "function",
"function": {
"name": tc.function.name,
"arguments": tc.function.arguments
}
}
for tc in message.tool_calls
]
})
# Execute each tool call and append results to conversation
for tool_call in message.tool_calls:
name = tool_call.function.name
args = json.loads(tool_call.function.arguments)
print(f"Tool: {name}({args})")
result = run_tool(name, args)
print(f"Result preview: {result[:120]}...")
# Tool results must reference the tool_call_id they are responding to
messages.append({
"role": "tool",
"tool_call_id": tool_call.id,
"content": result
}) })
print("Max turns reached.")
if __name__ == "__main__":
goal = (
"Find the three most actively discussed open-source RAG frameworks "
"in 2026 and write a summary to rag-summary.md explaining what each "
"one does and who it is best for."
)
print(f"Goal: {goal}n")
run_agent(goal)
What this code does: The OpenAI client is pointed at localhost:11434 instead of OpenAI’s servers. That single change is the entire difference between a cloud agent and a local one. DuckDuckGo search requires no API key. The agent runs the full ReAct loop — reason, act, observe, reason again — until it writes the output file. Every step runs on your machine.
Honest note on model capability: local models at 3–7B parameters are noticeably slower and less precise at multi-step reasoning than frontier cloud models. Llama 3.2 handles this task well when the goal is clear and focused. For more complex agentic tasks, Qwen3.5-4B or Mistral 7B Instruct produce more reliable tool-calling behavior. Keep the tasks focused and the tool set small. The same rule that applies to cloud agents applies here, just more so.
# Wrapping Up
None of these five things is possible in quite the same way with cloud AI. Not because cloud AI is less capable in raw benchmark terms — frontier models like Claude Opus and GPT-5 outperform anything running locally on a laptop. But benchmarks are not use cases.
The document brain works better locally because the documents are sensitive. The code reviewer is more useful locally because the code is proprietary. The offline assistant is only possible locally because the cloud is not available. The personalized model only remembers you locally because cloud sessions are stateless by design. The local agent costs nothing to run because there is no API meter ticking.
These are not compromises. They are genuine advantages in cases where running the model yourself is the right call for the right reasons. The setup is one command. The models are free. The ceiling, as it turns out, is higher than most people expect.
Shittu Olumide is a software engineer and technical writer passionate about leveraging cutting-edge technologies to craft compelling narratives, with a keen eye for detail and a knack for simplifying complex concepts. You can also find Shittu on Twitter.



