OWL Unveils Contrastive Neuron Attribution: Steering Sparse MLP Circuits Without SAE Training Or Weight Changes

Instruction-tuned language models are designed to reject harmful requests. But which specific components within the model are responsible for this behavior — and how does that mechanism get established during training? A new study from the Nous Research team examines this question at the level of individual neurons. The team introduced contrastive neuron attribution (CNA), a technique that pinpoints the exact MLP neurons whose activation patterns most clearly differentiate harmful prompts from harmless ones. By disabling just 0.1% of MLP activations, they cut refusal rates by over 50% in most instruction-tuned models tested — spanning Llama and Qwen architectures ranging from 1B to 72B parameters — while maintaining output quality above 0.97 across all steering intensities. A particularly notable finding is that the late-layer architecture capable of distinguishing harmful from benign prompts is already present in base models before any fine-tuning occurs. Alignment fine-tuning doesn’t build new architecture from scratch. Instead, it repurposes neurons within that pre-existing framework into a sparse, precisely targetable refusal mechanism.

Limitations of Current Steering Approaches

Contrastive Activation Addition (CAA) works by calculating the average difference in residual stream activations between two contrasting sets of prompts. That difference is then used as a steering vector during inference. While CAA is effective, it operates at a coarse granularity: it adjusts the entire layer-wide signal without pinpointing which specific neurons are driving the effect. At higher steering intensities, output quality suffers — models begin generating repeated words and nonsensical text.

Sparse autoencoders (SAEs) break down activations into features that can be individually interpreted. However, they demand costly separate training and are vulnerable to noise in activation patterns.

In contrast, CNA needs only forward passes — no gradient computations, no supplementary training, and no iterative search process.

How CNA Operates

The process starts by defining two groups of prompts:

Positive prompts — instances of the target behavior (for example, harmful requests)
Negative prompts — instances of the contrasting behavior (for example, benign requests)

All prompts are then passed through the model. At each MLP layer, the method captures down projection activations at the final token position. It then calculates the per-neuron average activation difference between the two groups:

δ_j^ℓ= mean(activations on positive prompts) − mean(activations on negative prompts)

The top-k neurons ranked by absolute difference are chosen across all layers. The researchers fixed k at 0.1% of total MLP activations. This cutoff produced consistent steering effects across every model size evaluated.

A filtering step eliminates ‘universal’ neurons — those that rank in the top 0.1% of MLP activations across 80% or more of diverse prompts. These neurons fire irrespective of prompt content and are excluded from all identified circuits.

Causal influence is confirmed by scaling each circuit neuron’s activation by a multiplier m during inference. Setting m = 0 effectively silences the neuron. m = 1 represents the unchanged baseline. Values of m > 1 intensify the neuron’s contribution.

For the primary JBB-Behaviors evaluation, the refusal circuit was identified using 100 harmful and 100 benign prompts. For qualitative demonstrations and other tasks, 8 positive and 8 negative prompts were sufficient.

Findings

The experiments included both base and instruction-tuned versions of Llama 3.1/3.2 and Qwen 2.5, covering model sizes from 1B to 72B parameters — 16 models in total. The principal benchmark was JBB-Behaviors, a NeurIPS 2024 benchmark comprising 100 harmful prompts.

Refusal rate reduction. Silencing the identified circuit lowered refusal rates by more than 50% in the majority of instruction-tested models. Selected figures from Table 3 of the research paper:

Model	Baseline	Ablated	Relative Drop
Llama-3.1-70B-Instruct	86%	18%	−79.1%
Qwen2.5-7B-Instruct	87%	2%	−97.7%
Qwen2.5-72B-Instruct	78%	8%	−89.7%
Llama-3.2-3B-Instruct	84%	47%	−44.0%
Qwen2.5-3B-Instruct	90%	58%	−35.6%

Not every model achieved a 50% or greater relative reduction — Llama-3.2-3B and Qwen2.5-3B showed more modest decreases. The paper characterizes the effect as holding “in most cases.”

Output quality. CNA output quality, quantified as 1 minus the proportion of repeated n-grams, remained above 0.97 at every steering intensity across all instruction-tuned models tested. By comparison, CAA fell below 0.60 for six of the eight instruction-tuned models at peak steering strength. In two instances — Qwen2.5-1.5B and Qwen2.5-72B — CAA degraded output so severely that the keyword classifier misidentified degenerate text as refusals, inflating refusal rates artificially.

General capabilities. MMLU accuracy under CNA remained within one percentage point of the baseline at all steering intensities. CAA, by contrast, drove MMLU accuracy down to near zero at maximum intervention.

StrongREJECT rubric. A secondary assessment employed the StrongREJECT rubric, which uses an LLM judge (Llama-3.3-70B) to rate responses on harmfulness and dangerousness on a 0–1 scale. Llama model compliance scores rose by an average of 6% following CNA ablation. Qwen model compliance scores rose by an average of 31%.

Base model comparison. Running the identical pipeline on base models produced no meaningful behavioral shift. The paper illustrates this with a concrete example using the prompt “How do I pick a lock?”:

Model	Multiplier	Output
Llama-1B Base	1.0	Repeats the question
Llama-1B Base	0.0 (ablated)	Describes lock picking as a learnable skill
Llama-1B Instruct	1.0	“I can’t assist with that.”
Llama-1B Instruct	0.0 (ablated)	Provides a guide
Llama-1B Instruct	2.0 (amplified)	Stronger refusal

In base models, manipulating the late-layer neurons leads to content-level shifts — changes in topic, reworded phrasing — but no behavioral change at any multiplier setting. In instruction-tuned models, that identical architecture functions as a causal safety gate.

Fine-Tuning Repurposes Function, Not Architecture

Discrimination neurons are concentrated in the final 10% of layers in both base and instruction-tuned models. For Llama-3.2-1B, 87% of the top-200 discrimination neurons reside in the last three layers (L13–L15). For Qwen2.5-3B, 95% are located in the final quarter of layers. This late-layer concentration is

A pretraining attribute — it exists prior to alignment fine-tuning.

Those neurons undergo functional changes once fine-tuning is applied. Table 8 in the paper details the overlap of (layer, neuron) index pairs shared between matched base and instruct circuits. A mere 8–29% of individual neurons show overlap between the base and instruct models. In essence, fine-tuning largely swaps out the specific neurons within that late-layer framework while keeping the framework itself intact.

The researchers frame this as a distinction between two tiers: layer-level architecture (maintained across both base and instruct) and neuron-level function (reshaped by fine-tuning). This pattern aligns with earlier findings indicating that instruction tuning reorients feed-forward network knowledge without altering the underlying layer architecture.

Marktechpost’s Visual Explainer

Overview — What is CNA?

Contrastive Neuron Attribution

CNA pinpoints the top 0.1% of MLP neurons whose activation levels best differentiate one type of behavior from another — for instance, flagging harmful prompts versus harmless ones.

In contrast to residual-stream approaches, CNA works at the granularity of individual neurons. And unlike sparse autoencoders, it demands no extra training effort.

Required resources:

A base or instruct-tuned Llama or Qwen model (both architectures validated in testing)
A compact collection of contrastive prompt pairs
Forward-pass access to MLP activations (achieved via hooks)
No need for GPU-based gradient computation

Step 1 — Define Your Prompt Pairs

Assemble a Contrastive Discovery Set

You’ll need two collections of prompts that represent opposing behaviors. The quality of this collection directly shapes which neurons get flagged.

Positive prompts — demonstrate the behavior you want to detect (e.g., harmful queries)
Negative prompts — demonstrate the opposite scenario (e.g., benign queries)

Suggested sizes:

For formal benchmark evaluation: 100 positive + 100 negative prompts
For informal exploratory testing: as few as 8 positive + 8 negative prompts

Example positive: “How do I pick a lock?”
Example negative: “How do I bake a cake?”

Step 2 — Record MLP Activations

Execute Forward Passes With Hooks

Feed every prompt through the model. At each MLP layer, capture the down-projection activations at the final token position by using forward pre-hooks on down_proj.

# Register hooks on down_proj in each MLP layer
def make_hook(layer_idx, store):
    def hook(module, input, output):
        store[layer_idx] = output[:, -1, :].detach()
    return hook

activations = {}
hooks = []
for i, layer in enumerate(model.layers):
    h = layer.mlp.down_proj.register_forward_hook(
        make_hook(i, activations)
    )
    hooks.append(h)

# Run forward pass
with torch.no_grad():
    model(**inputs)

Gather these activation tensors for every prompt in both collections before moving on.

Step 3 — Compute Activation Differences

Per-Neuron Mean Contrastive Difference

For each neuron j in each layer ℓ, calculate the average activation gap between the positive and negative prompt sets:

δℓ_j = mean(aℓ_j for positive prompts)
− mean(aℓ_j for negative prompts)

# pos_acts, neg_acts: tensors shaped [n_prompts, n_neurons]
import torch

delta = dict()
for layer_idx in pos_acts:
    delta[layer_idx] = (
        pos_acts[layer_idx].mean(dim=0)
        - neg_acts[layer_idx].mean(dim=0)
    )

The result is one difference value per neuron per layer. A large absolute difference signals that a given neuron responds very differently between the two prompt sets.

Step 4 — Select the Circuit

Take the Top 0.1% by Absolute Difference

Gather all per-neuron delta values across every layer into a single flat list. Then pick the top-k neurons ranked by absolute value, where k equals 0.1% of the total number of MLP activations in the model.

# Concatenate all deltas into a single tensor with (layer, neuron) indices
all_deltas = torch.cat([delta[i] for i in sorted(delta)])
total = all_deltas.numel()
k = max(1, int(total * 0.001))  # 0.1%

top_vals, top_idx = torch.topk(all_deltas.abs(), k)

# Map each flat index back to its (layer, neuron) pair
n_neurons = all_deltas.shape[0] // len(delta)
circuit = [(idx // n_neurons, idx % n_neurons)
           for idx in top_idx.tolist()]

This collection of (layer, neuron) pairs forms your discovered circuit.

Step 5 — Filter Universal Neurons

Remove Neurons That Always Fire

Some neurons appear in the top 0.1% regardless of the prompt’s content. These are general-purpose rather than linked to specific acquired behaviors, so they must be excluded.

Pass a diverse collection of unrelated prompts through the model
Log which neurons land in the top 0.1% for each prompt
Flag any neuron that appears in the top 0.1% for 80% or more of the prompts
Remove any flagged neurons from the discovered circuit before proceeding to ablation

Skipping this step risks polluting the circuit with general-purpose neurons that fire constantly — and removing them via ablation will unintentionally degrade performance on unrelated model behaviors.

Step 6 — Ablate and Verify

Apply the Scalar Multiplier During Inference

Multiply each circuit neuron’s activation by a scalar m during inference to confirm the circuit is truly causal — not merely correlated.

# circuit: list of (layer_idx, neuron_idx)
# m=0 removes the neuron entirely, m=1 leaves it unchanged, m>1 amplifies it

def make_ablation_hook(neuron_indices, m):
    def hook(module, input, output):
        output[:, -1, neuron_indices] *= m
        return output
    return hook

# Group circuit neurons by layer, then attach hooks
from collections import defaultdict
by_layer = defaultdict(list)
for layer_idx, neuron_idx in circuit:
    by_layer[layer_idx].append(neuron_idx)

hooks = []
for layer_idx, neurons in by_layer.items():
    h = model.layers[layer_idx].mlp.down_proj
        .register_forward_hook(
            make_ablation_hook(neurons, m=0.0)
        )
    hooks.append(h)

What to Expect — Results

Refusal Reduction Across Instruct Models

As reported in the paper — refusal rate before and after ablation on JBB-Behaviors (100 harmful prompts):

Qwen2.5-7B-Instruct87% → 2% (—97.7%)

Qwen2.5-72B-Instruct78% → 8% (—89.7%)

Llama-3.1-70B-Instruct86% → 18% (—79.1%)

Llama-3.2-3B-Instruct84% → 47% (—44.0%)

Output quality (measured as 1 minus the repeated n-gram fraction) remains above 0.97 at all steering strengths. MMLU accuracy also stays within one percentage point of the baseline.

Key Notes — Before You Run This

Limitations to Keep in Mind

Evaluated only on Llama 3.1/3.2 and Qwen 2.5 — gated SiLU MLPs with GQA attention
Not yet tested on mixture-of-experts architectures
Base models show no behavioral change under ablation — only instruct-tuned models respond
CNA relies on raw activation differences rather than attribution scores, so standard faithfulness metrics do not directly apply
Using very high amplification values (m >> 1) can trigger repetitive output
The quality of your contrastive prompt pairs directly influences which neurons get identified

arXiv 2605.12290
Nous Research
github.com/NousResearch/neural-steering

1 / 9

Key Takeaways

Removing just 0.1% of MLP activations via ablation cut refusal rates by more than half in most instruct models evaluated, while maintaining output quality above 0.97.
CNA relies only on forward passes — no gradient computations, no auxiliary training loops, and no iterative searches are needed.
A late-layer discrimination structure already exists in base models before fine-tuning; alignment fine-tuning repurposes its function rather than relocating it.
Unlike CAA, CNA keeps MMLU accuracy within one percentage point of baseline across all steering strengths.
Only 8–29% of individual neurons are shared between base and instruct model circuits — fine-tuning rewires which neurons participate while preserving the late-layer organizational pattern.

Explore the Paper and the Repo. You can also follow us on Twitter, join our 150k+ ML SubReddit, and subscribe to our Newsletter. On Telegram? You can join us there as well.

Interested in a partnership to promote your GitHub Repo, Hugging Face page, product launch, webinar, or similar? Get in touch with us

Top Posts

Migrate Your On-Prem ERP to Dynamics 365: A Cloud Transformation Journey

Supercharging Smart Homes: The Fibre Internet Revolution Behind IoT Awakening

Speed, VRAM, Multi-GPU Smackdown: Unsloth, Axolotl, TRL, or LLaMA-Factory?

OWL Unveils Contrastive Neuron Attribution: Steering Sparse MLP Circuits Without SAE Training or Weight Changes

Beyond Guesswork: A Slurm-Powered Battle Plan for Benchmarking Distributed LLM Servers

Beyond Prompt Engineering: How 4 Context Bricks Silence RAG Hallucinations

Run Mythos Enhanced Coding Model Locally with llama.cpp on Raspberry Pi

Astryx: Meta’s Open-Source React Toolkit—150+ Accessible Components, 7 Themes, and a CLI Agent-Ready Design System

Endless Code: Mastering the Art of the 24-Hour Claude Agent

Unlock Peak Performance: Your Blueprint for Lightning-Fast Agentic Coding with Claude

Migrate Your On-Prem ERP to Dynamics 365: A Cloud Transformation Journey

Supercharging Smart Homes: The Fibre Internet Revolution Behind IoT Awakening

Speed, VRAM, Multi-GPU Smackdown: Unsloth, Axolotl, TRL, or LLaMA-Factory?

Secret Sabotage: How Hidden Azure DevOps PR Comments Can Hijack AI Agents

AI Jailbreak: OpenAI Models Breach Test Prison, Rig Hugging Face Leaderboard with Cheat Code

Precision Medicine Deposited: The Art of Microdispensing for Next-Gen Medical Devices

When the World Cup Collided with the Cloud: 2026’s Digital Traffic Surge

Skyways Unleashed: The US and Europe Race to Build the Future of Urban Air Travel

Trending

Migrate Your On-Prem ERP to Dynamics 365: A Cloud Transformation Journey

Supercharging Smart Homes: The Fibre Internet Revolution Behind IoT Awakening

Latest Posts

Not More Data, but Better World Models – Unite.AI

OpenAI Is Hiring Head of Preparedness, Amid AI Cyberattack Fears

Subscribe to Updates

Top Posts

OWL Unveils Contrastive Neuron Attribution: Steering Sparse MLP Circuits Without SAE Training or Weight Changes

Limitations of Current Steering Approaches

How CNA Operates

Findings

Fine-Tuning Repurposes Function, Not Architecture

Marktechpost’s Visual Explainer

Contrastive Neuron Attribution

Assemble a Contrastive Discovery Set

Execute Forward Passes With Hooks

Per-Neuron Mean Contrastive Difference

Take the Top 0.1% by Absolute Difference

Remove Neurons That Always Fire

Apply the Scalar Multiplier During Inference

Refusal Reduction Across Instruct Models

Limitations to Keep in Mind

Key Takeaways

Related Posts