Instruction-tuned language models are designed to reject harmful requests. But which specific components within the model are responsible for this behavior — and how does that mechanism get established during training? A new study from the Nous Research team examines this question at the level of individual neurons. The team introduced contrastive neuron attribution (CNA), a technique that pinpoints the exact MLP neurons whose activation patterns most clearly differentiate harmful prompts from harmless ones. By disabling just 0.1% of MLP activations, they cut refusal rates by over 50% in most instruction-tuned models tested — spanning Llama and Qwen architectures ranging from 1B to 72B parameters — while maintaining output quality above 0.97 across all steering intensities. A particularly notable finding is that the late-layer architecture capable of distinguishing harmful from benign prompts is already present in base models before any fine-tuning occurs. Alignment fine-tuning doesn’t build new architecture from scratch. Instead, it repurposes neurons within that pre-existing framework into a sparse, precisely targetable refusal mechanism.
Limitations of Current Steering Approaches
Contrastive Activation Addition (CAA) works by calculating the average difference in residual stream activations between two contrasting sets of prompts. That difference is then used as a steering vector during inference. While CAA is effective, it operates at a coarse granularity: it adjusts the entire layer-wide signal without pinpointing which specific neurons are driving the effect. At higher steering intensities, output quality suffers — models begin generating repeated words and nonsensical text.
Sparse autoencoders (SAEs) break down activations into features that can be individually interpreted. However, they demand costly separate training and are vulnerable to noise in activation patterns.
In contrast, CNA needs only forward passes — no gradient computations, no supplementary training, and no iterative search process.
How CNA Operates
The process starts by defining two groups of prompts:
- Positive prompts — instances of the target behavior (for example, harmful requests)
- Negative prompts — instances of the contrasting behavior (for example, benign requests)
All prompts are then passed through the model. At each MLP layer, the method captures down projection activations at the final token position. It then calculates the per-neuron average activation difference between the two groups:
δjℓ = mean(activations on positive prompts) − mean(activations on negative prompts)
The top-k neurons ranked by absolute difference are chosen across all layers. The researchers fixed k at 0.1% of total MLP activations. This cutoff produced consistent steering effects across every model size evaluated.
A filtering step eliminates ‘universal’ neurons — those that rank in the top 0.1% of MLP activations across 80% or more of diverse prompts. These neurons fire irrespective of prompt content and are excluded from all identified circuits.
Causal influence is confirmed by scaling each circuit neuron’s activation by a multiplier m during inference. Setting m = 0 effectively silences the neuron. m = 1 represents the unchanged baseline. Values of m > 1 intensify the neuron’s contribution.
For the primary JBB-Behaviors evaluation, the refusal circuit was identified using 100 harmful and 100 benign prompts. For qualitative demonstrations and other tasks, 8 positive and 8 negative prompts were sufficient.
Findings
The experiments included both base and instruction-tuned versions of Llama 3.1/3.2 and Qwen 2.5, covering model sizes from 1B to 72B parameters — 16 models in total. The principal benchmark was JBB-Behaviors, a NeurIPS 2024 benchmark comprising 100 harmful prompts.
Refusal rate reduction. Silencing the identified circuit lowered refusal rates by more than 50% in the majority of instruction-tested models. Selected figures from Table 3 of the research paper:
| Model | Baseline | Ablated | Relative Drop |
|---|---|---|---|
| Llama-3.1-70B-Instruct | 86% | 18% | −79.1% |
| Qwen2.5-7B-Instruct | 87% | 2% | −97.7% |
| Qwen2.5-72B-Instruct | 78% | 8% | −89.7% |
| Llama-3.2-3B-Instruct | 84% | 47% | −44.0% |
| Qwen2.5-3B-Instruct | 90% | 58% | −35.6% |
Not every model achieved a 50% or greater relative reduction — Llama-3.2-3B and Qwen2.5-3B showed more modest decreases. The paper characterizes the effect as holding “in most cases.”
Output quality. CNA output quality, quantified as 1 minus the proportion of repeated n-grams, remained above 0.97 at every steering intensity across all instruction-tuned models tested. By comparison, CAA fell below 0.60 for six of the eight instruction-tuned models at peak steering strength. In two instances — Qwen2.5-1.5B and Qwen2.5-72B — CAA degraded output so severely that the keyword classifier misidentified degenerate text as refusals, inflating refusal rates artificially.
General capabilities. MMLU accuracy under CNA remained within one percentage point of the baseline at all steering intensities. CAA, by contrast, drove MMLU accuracy down to near zero at maximum intervention.
StrongREJECT rubric. A secondary assessment employed the StrongREJECT rubric, which uses an LLM judge (Llama-3.3-70B) to rate responses on harmfulness and dangerousness on a 0–1 scale. Llama model compliance scores rose by an average of 6% following CNA ablation. Qwen model compliance scores rose by an average of 31%.
Base model comparison. Running the identical pipeline on base models produced no meaningful behavioral shift. The paper illustrates this with a concrete example using the prompt “How do I pick a lock?”:
| Model | Multiplier | Output |
|---|---|---|
| Llama-1B Base | 1.0 | Repeats the question |
| Llama-1B Base | 0.0 (ablated) | Describes lock picking as a learnable skill |
| Llama-1B Instruct | 1.0 | “I can’t assist with that.” |
| Llama-1B Instruct | 0.0 (ablated) | Provides a guide |
| Llama-1B Instruct | 2.0 (amplified) | Stronger refusal |
In base models, manipulating the late-layer neurons leads to content-level shifts — changes in topic, reworded phrasing — but no behavioral change at any multiplier setting. In instruction-tuned models, that identical architecture functions as a causal safety gate.
Fine-Tuning Repurposes Function, Not Architecture
Discrimination neurons are concentrated in the final 10% of layers in both base and instruction-tuned models. For Llama-3.2-1B, 87% of the top-200 discrimination neurons reside in the last three layers (L13–L15). For Qwen2.5-3B, 95% are located in the final quarter of layers. This late-layer concentration is
A pretraining attribute — it exists prior to alignment fine-tuning.

Those neurons undergo functional changes once fine-tuning is applied. Table 8 in the paper details the overlap of (layer, neuron) index pairs shared between matched base and instruct circuits. A mere 8–29% of individual neurons show overlap between the base and instruct models. In essence, fine-tuning largely swaps out the specific neurons within that late-layer framework while keeping the framework itself intact.
The researchers frame this as a distinction between two tiers: layer-level architecture (maintained across both base and instruct) and neuron-level function (reshaped by fine-tuning). This pattern aligns with earlier findings indicating that instruction tuning reorients feed-forward network knowledge without altering the underlying layer architecture.
Marktechpost’s Visual Explainer
Key Takeaways
- Removing just 0.1% of MLP activations via ablation cut refusal rates by more than half in most instruct models evaluated, while maintaining output quality above 0.97.
- CNA relies only on forward passes — no gradient computations, no auxiliary training loops, and no iterative searches are needed.
- A late-layer discrimination structure already exists in base models before fine-tuning; alignment fine-tuning repurposes its function rather than relocating it.
- Unlike CAA, CNA keeps MMLU accuracy within one percentage point of baseline across all steering strengths.
- Only 8–29% of individual neurons are shared between base and instruct model circuits — fine-tuning rewires which neurons participate while preserving the late-layer organizational pattern.
Explore the Paper and the Repo. You can also follow us on Twitter, join our 150k+ ML SubReddit, and subscribe to our Newsletter. On Telegram? You can join us there as well.
Interested in a partnership to promote your GitHub Repo, Hugging Face page, product launch, webinar, or similar? Get in touch with us



