# Introduction
Language models continue to reshape how developers and machine learning practitioners build applications. The emergence of powerful yet compact small language models adds an exciting dimension to this landscape. By avoiding third-party APIs, running models locally ensures complete data privacy, removes per-token API fees, and allows for offline functionality. Tools like Ollama have become a go-to solution for local inference thanks to its lightweight Go-based engine, straightforward CLI, and efficient Docker-like model management system.
Yet, simply downloading a model and using its default settings is seldom the best approach. Default configurations are designed for a general audience, often favoring safe, conversational outputs over performance, deterministic reasoning, or specialized needs. If you’re developing a coding assistant, an automated ETL pipeline, or a multi-agent system, default settings may result in slow responses, context window issues, or inconsistent and unpredictable outputs.
To get the most out of your local AI applications, it’s essential to master both model-level hyperparameters and server-level runtime settings. This article will dive deep into Ollama’s configuration engine, showing you how to adjust local language model parameters using the Ollama Modelfile, boost hardware efficiency with server environment variables, and craft precise prompts with Go template syntax.
# 1. The Ollama Modelfile: Your Local Model Blueprint
Much like a Dockerfile outlines how a container is built, an Ollama Modelfile is a declarative configuration file that specifies how a local language model should behave. It allows you to set system instructions, tweak model parameters, and bundle these configurations into a new, reusable model variant that you can launch with a single command.
A basic Modelfile includes a base model reference (using the FROM directive), system-level guidelines (using SYSTEM), and parameter adjustments (using the PARAMETER directive):
// Example: A Custom Developer Modelfile
# Use Llama 3.1 8B as the base model
FROM llama3.1:8b
# Set model-level parameters
PARAMETER temperature 0.2
PARAMETER num_ctx 8192
PARAMETER min_p 0.05
# Define system persona and behavioral guidelines
SYSTEM """You are an elite, highly precise software engineer.
Provide concise, modular, and optimized code solutions.
Do not include conversational filler unless explicitly asked."""To build and run your custom model, use the ollama create command in your terminal:
# Create the model named 'dev-llama' from the Modelfile
ollama create dev-llama -f ./Modelfile
# Run the newly created model
ollama run dev-llamaBy embedding these parameters directly into the model definition, you guarantee that every application or API request targeting dev-llama automatically inherits these optimizations, eliminating the need to pass raw JSON parameter payloads with each API call.
# 2. Fine-Tuning the Sampling Parameters
When generating text, a model doesn’t “know” words; instead, it calculates probabilities for each possible next token in its vocabulary. Sampling parameters control how the engine selects the next token from this probability distribution. Adjusting these parameters is the most effective way to align the model’s output style—whether creative or precise—with your specific application.
// Temperature: The Randomness Dial
The temperature parameter adjusts the scaling of the token probability distribution. Mathematically, it divides the raw logits (pre-softmax scores) produced by the model before they are converted into probabilities:
- Low temperature (e.g., 0.1 to 0.2): Amplifies high-probability options and suppresses low-probability ones, resulting in highly deterministic, consistent, and logical outputs. This is ideal for code generation, math reasoning, structured data extraction (JSON/YAML), and factual summarization.
- High temperature (e.g., 0.8 to 1.2): Reduces differences between token probabilities, making less likely tokens more competitive. This introduces diversity, randomness, and “creativity” into responses, making it suitable for creative writing and brainstorming.
# Configure for highly deterministic, structured tasks
PARAMETER temperature 0.1// Top-K, Top-P, and Min-P: Narrowing the Token Pool
Without constraints, even at low temperatures, models can sometimes pick inappropriate tokens from the tail of the probability distribution. To avoid this, model engines filter the pool of candidate tokens before making a final selection.
- Top-K (e.g. 40): Limits the pool to the
Kmost probable next tokens. Any token ranked below 40 is discarded, regardless of its actual probability. This is a simple but effective way to eliminate highly erratic options. - Top-P / Nucleus Sampling (e.g. 0.90): Limits the pool to a dynamic set of tokens whose cumulative probability exceeds the threshold
P. For instance, at 0.90, Ollama sorts tokens from highest to lowest probability and retains only the top group that constitutes the first 90% of the distribution. If the model is highly confident, the pool might shrink to just 2 or 3 tokens; if uncertain, it expands. - Min-P (e.g. 0.05 to 0.10): A modern, more effective alternative to Top-P. Instead of taking a fixed cumulative slice,
min_pfilters out tokens whose probability falls below a dynamic threshold relative to the leading token’s probability. For example, if the top token has a probability of 0.80 andmin_pis set to 0.05, the minimum threshold for any other token is0.80 * 0.05 = 0.04. If the top token is highly certain (e.g., 0.99), other tokens are aggressively filtered out. If the top token is uncertain (e.g., 0.15), the threshold drops to 0.0075, keeping a broad pool of creative options available.
# Set robust sampling limits in the Modelfile
PARAMETER top_k 40
PARAMETER top_p 0.90
PARAMETER min_p 0.05top_k: 40
PARAMETER top_p: 0.90
PARAMETER min_p: 0.05
⚠️ When applying
min_p, it’s best to keeptop_pat its default value (1.0) or a high setting (0.95+), ensuringmin_pcan perform its dynamic adjustments without interference.
# 3. Ending Repetitive Loops and Preventing Redundancy
A frequent issue when deploying models locally is getting stuck in a repetition cycle. This occurs when the model keeps outputting the same phrase, sentence, or block of code repeatedly. It typically happens with smaller models (around 1.5B to 3B parameters) if penalty limits aren’t set correctly.
Ollama offers three primary tools to stop or prevent these repetitive loops.
// Repetition, Presence, and Frequency Adjustments
- Repetition penalty (
repeat_penalty): This multiplies the internal scores (logits) for tokens already used, making them less likely to be picked again. Typically, setting it between 1.1 and 1.2 prevents looping without hindering common grammatical words like “the” or “and”. - Presence penalty (
presence_penalty): Adds a single, fixed penalty for any token that appears in the text, pushing the model to explore new topics or broader vocabulary. - Frequency penalty (
frequency_penaltyApplies a penalty based on how frequently a token appears, progressively reducing the chance of using certain words too often.
# Counteract loops and boost word variety
PARAMETER repeat_penalty: 1.15
PARAMETER presence_penalty: 0.05
PARAMETER frequency_penalty: 0.05
// Stopping the Process Using Stop Sequences
At times, a model might not repeat itself internally but may not know when its turn ends, resulting in fake user responses. You can control this by defining precise stop sequences (stop tokens). As soon as the model produces one of these sequences, the engine immediately stops and returns the current output.
Popular stop indicators include chat markers like <|im_end|>, markdown headings, or custom delimiters:
# Halt output upon encountering ChatML tags or user lines
PARAMETER stop: "<|im_end|>"
PARAMETER stop: "<|im_start|>"
PARAMETER stop: "User:"
# 4. Managing Memory and Context Windows
When working locally, you have limited hardware—mainly the GPU’s video memory (VRAM). Learning how to manage memory usage is key to building reliable applications.
// Adjusting Context Length (num_ctx)
The context length (num_ctx) sets the number of tokens the model can process simultaneously (measured in tokens). It includes both any previous conversation history and newly generated content.
Typically, Ollama starts with smaller windows like 2048 or 4096 tokens to avoid overloading lower-end setups. Yet models such as Llama 3.1 or Mistral can handle windows up to 128,000. Smaller limits may silently cut off important information, which leads to poor responses.
To change this in your Modelfile:
# Increase context window to 16384 tokens
PARAMETER num_ctx: 16384
⚠️ Attention calculations increase quadratically ($O(N^2)$) as the context grows. Doubling
num_ctxwill much more than double VRAM use. Make sure your hardware can support larger windows.
// KV Cache Compression (OLLAMA_KV_CACHE_TYPE)
For long conversations, models maintain an active key-value (KV) cache in VRAM tracking connections across all previous tokens. For larger sequences—such as 32k or 128k tokens—this cache might even surpass the base size of the model itself, causing memory failures.
Ollama lets you apply quantization to KV caches. Similar to how you can shrink weights from 16-bit floating-point values to 4-bit integers, you can compress internal memory caches with minimal impact:
f16: Standard, full-precision cache (default)q8_0: Presses the cache into 8-bit integers—reducing about half the VRAM needed, with little noticeable change in output qualityq4_0: Drops the cache even further—cutting memory demands by around 75%; this allows huge context sizes on regular devices but slightly increases model perplexity
This is set using the OLLAMA_KV_CACHE_TYPE environment variable (covered next).
# 5. Server-Level Adjustments: Environment Settings
While the Modelfile changes how a specific model behaves, server environment variables customize the Ollama background process itself. These influence how your OS manages resources, handles parallel operations, and uses acceleration hardware.
Your setup method depends on platform:
- macOS: Add terminal exports or edit app launch config files (
launchctl) - Linux (Systemd): Update using
systemctl edit ollama.serviceto apply custom options - Windows (WSL2/System): Enter settings into standard Environment Variables or WSL terminal profile
// Core Server Settings
| Setting Name | Default | Usage & Tips |
|---|---|---|
OLLAMA_HOST | 127.0.0.1:11434 | Sets server network |
OLLAMA_HOST | 127.0.0.1:11434 | Sets the network address and listening port for the server. Use 0.0.0.0:11434 to allow other devices on your local network to access the API. |
OLLAMA_MODELS | System-dependent default | Specifies where models are stored. It's advisable to direct this to an external NVMe SSD if your primary drive has limited storage capacity. |
OLLAMA_KEEP_ALIVE | 5m (5 minutes) | Determines how long models remain in GPU VRAM after the final request. Configure it to 1h to avoid reloading delays during active sessions, or -1 for permanent retention. |
OLLAMA_NUM_PARALLEL | 1 | Allows processing multiple requests simultaneously. Values like 2 or 4 create separate model instances for concurrent queries, but this increases VRAM usage proportionally. |
OLLAMA_KV_CACHE_TYPE | f16 | Reduces VRAM consumption for extended context windows. Choose q8_0 for everyday use, or q4_0 for very large contexts on consumer-grade GPUs. |
OLLAMA_FLASH_ATTENTION | 0 (inactive) | Enable by setting to 1. This greatly speeds up prompt pre-fill execution and decreases memory demands on compatible hardware (recent NVIDIA/Apple GPUs). |
// Example: Setting Up Configurations on Linux (Systemd)
For users operating production systems on Ubuntu/Debian, modify the service file to incorporate these environment variables:
# Launch the systemd configuration editor for Ollama
sudo systemctl edit ollama.service
Within the editor section, insert the following settings:
[Service]
Environment="OLLAMA_NUM_PARALLEL=4"
Environment="OLLAMA_KEEP_ALIVE=24h"
Environment="OLLAMA_KV_CACHE_TYPE=q8_0"
Environment="OLLAMA_FLASH_ATTENTION=1"
Save your changes and restart the service to activate hardware optimizations:
# Refresh systemd definitions and restart the service
sudo systemctl daemon-reload
sudo systemctl restart ollama
# 6. Prompt Templating: Go Template Syntax
Language models don't inherently comprehend chat interactions, user questions, or role definitions. They require a single, uninterrupted text sequence with special tokens that distinguish the system instructions, user input, and assistant replies.
Ollama utilizes the Go text template engine to transform structured chat data (such as OpenAI-compatible JSON role arrays) into the precise text format required by the model.
If your template is misconfigured, the system prompt won't take effect, the model may overlook your instructions, and inference performance will suffer significantly.
// Go Template Structure Explained
The TEMPLATE directive within an Ollama Modelfile employs structured tags to process instructions. Below is an example that maps to the ChatML format (commonly used by Qwen, Mistral-instruct, and Hermes models):
# Specify the message stream format
TEMPLATE """{{ if .System }}<|im_start|>system
{{ .System }}<|im_end|>
{{ end }}{{ if .Prompt }}<|im_start|>user
{{ .Prompt }}<|im_end|>
{{ end }}<|im_start|>assistant
{{ .Response }}<|im_end|>"""
Here's how the Go template logic works:
{{ if .System }} ... {{ end }}: Verifies if a system prompt exists. If present, it outputs the start tag<|im_start|>system, inserts the system prompt content{{ .System }}, and ends with<|im_end|>.{{ if .Prompt }} ... {{ end }}: Processes the user's question ({{ .Prompt }}) and encloses it within user tokens<|im_start|>userand<|im_end|>.<|im_start|>assistant {{ .Response }}<|im_end|>: Signals the model to start generating the assistant's response. The engine writes the generated output into{{ .Response }}and adds the closing token.
When building a custom model, always review the base model's documentation to determine its specific template format (for instance, Llama uses distinctive headers like <|start_header_id|>system<|end_header_id|>, while Mistral uses bracket notation such as [INST] and [/INST]). Using the correct template ensures the model follows instructions accurately.
# 7. Reference Configurations for Practitioners
To assist you in applying these settings immediately, here are three ready-to-use Modelfiles designed for typical deployment scenarios:
// 1. The Precise JSON Parser (Structured Extraction / Coding)
Optimized for ETL workflows, JSON output, and precise coding tasks. Uses minimal temperature and dynamic filtering to eliminate unpredictable tokens.
FROM llama3.1:8b
# Strict and predictable parameters
PARAMETER temperature 0.0
PARAMETER min_p 0.05
PARAMETER top_p 0.95
PARAMETER top_k 10
# Reduce repetition
PARAMETER repeat_penalty 1.1
# Clear stop conditions
PARAMETER stop "<|im_end|>"
PARAMETER stop "User:"
// 2. The Creative Writer (Brainstorming / Interactive Agent)
Tailored for chat interfaces, flexible agent operations, and narrative generation. Increases temperature while maintaining vocabulary variety.
FROM llama3.1:8b
# Expressive and varied parameters
PARAMETER temperature 0.8
PARAMETER top_p 0.9
PARAMETER min_p 0.05
PARAMETER top_k 50
# Maintain vocabulary diversity
PARAMETER repeat_penalty 1.15
# Explicit stop markers
PARAMETER stop "<|im_end|>"
PARAMETER stop "User:"
// 3. The High-Throughput Production Endpoint
Maximizes hardware utilization for scalable deployments. Enables parallel processing and Flash Attention for optimal throughput on modern GPU infrastructure.
FROM llama3.1:8b
# Balanced speed parameters
PARAMETER temperature 0.3
PARAMETER top_p 0.95
# Extended context and efficient caching
PARAMETER num_ctx 8192
# System configuration modifications for production
# These are applied through systemd environment variables:
# Environment="OLLAMA_NUM_PARALLEL=4"
# Environment="OLLAMA_KEEP_ALIVE=24h"
# Environment="OLLAMA_KV_CACHE_TYPE=q8_0"
# Environment="OLLAMA_FLASH_ATTENTION=1"By mastering the Ollama Modelfile and fine-tuning server environment settings, you shift from simply using AI tools to engineering high-performance, private, and finely optimized local intelligent pipelines. Keep your parameters sharp, your memory use efficient, and let your local agents build.
Matthew Mayo (@mattmayo13) holds a master's in computer science and a graduate diploma in data mining. As managing editor of KDnuggets & Statology, and contributing editor at Machine Learning Mastery, he works to simplify complex data science concepts. His interests span natural language processing, language models, machine learning algorithms, and emerging AI. He’s driven by a mission to make data science knowledge accessible to all. Matthew began coding at just 6 years old.



