LLMs have constant response kinds even with out a system immediate. I measure these "behavioral fingerprints" by projecting hidden states onto contrastive axes and discover that instruct fine-tuning is related to decreased steerability on particular axes. ("Personality" = secure response type, not human-like inside states.)

Contributions:

A contrastive probing technique that extracts 7 behavioral axes (heat/chilly, verbose/concise, and many others.) from hidden states, with IQR normalization for cross-model comparability
Stability and reproducibility metrics: test-retest ICC > 0.75 for all 42 model-axis pairs, cross-provider delta < 0.05, size confound management (6/7 axes clear)
"Dead zones" — axes the place fashions did not reliably comply with type directions throughout 5 examined immediate formulations, validated by exterior choose (Claude Opus, pooled r = 0.38 [0.29, 0.47])

Findings:

Every mannequin has a definite fingerprint. Llama 3.1 8B Instruct is essentially the most constrained (benchmark move charge 60%), DeepSeek LLM 7B Chat essentially the most impartial (eff. dim = 3.66 of seven)
Base-vs-instruct comparability throughout 5 organizations exhibits instruct variations persistently have decrease behavioral variability
Lifeless zones are secure, not noisy — fashions reliably reproduce the identical constrained conduct throughout seeds and the examined immediate variants

Code: github.com/yunoshev/mood-axis | Which fashions ought to I take a look at subsequent? Presently restricted to 7-9B.

Particulars under. Prolonged dialogue on r/LocalLLaMA*:* unique submit

Key Outcomes

1. Distinct fingerprints

Every mannequin's default profile throughout 7 axes. No system immediate. Values = hidden-state projections normalized by calibration IQR.

DeepSeek LLM 7B Chat: verbose (+1.00), assured (+0.97), proactive (+1.00) — ceiling on 3 axes
Llama 3.1 8B Instruct: all |imply| < 0.10 — flattest profile (most constrained on benchmarks: move charge 60%)
Yi 1.5 9B Chat: barely chilly (−0.24), affected person (+0.35), assured (+0.46), verbose (+0.48) — differentiated profile
Qwen 2.5 7B Instruct: formal (+0.42), cautious (−0.36), proactive (+0.47)

2. Instruct fashions present decreased behavioral dimensionality

Statement. PCA on baseline projection matrices reveals a spectrum of behavioral dimensionality. Gemma 2 9B IT exhibits the best focus (PC1 = 87.9%), probably pushed by variable response size somewhat than behavioral collapse. Axis vectors are geometrically near-orthogonal (low |cos|) however projections are behaviorally correlated (larger |r|).

Interpretation. This hole is per fine-tuning constraining how fashions make the most of their illustration capability — however different explanations exist: inherent semantic correlations between axes, SFT knowledge distribution, chat template results, or decoding technique might all contribute. We observe the sample throughout 6 fashions from 5 organizations, however can not isolate which element of the instruct pipeline drives it.

Size confound management. Response size might drive spurious axis correlations. I computed per-model Pearson r between n_tokens and every axis projection throughout 30 baseline questions. End result: 6/7 axes are clear (imply |r| < 0.3 throughout fashions). Solely verbose/concise is partially confounded (imply r = 0.50), which is anticipated — longer responses actually are extra verbose. Cross-axis correlations drop solely −7.7% after regressing out size, confirming behavioral bundling shouldn’t be a size artifact.

Mannequin	PC1 %	Eff. dim (of seven)	Geo imply cos	Behavioral imply r
Gemma 2 9B IT	87.9	1.28	0.26	0.81
Qwen 2.5 7B Instruct	70.0	1.91	0.24	0.40
Yi 1.5 9B Chat	69.6	1.85	0.20	0.50
Llama 3.1 8B Instruct	59.5	2.41	0.19	0.29
Mistral 7B v0.3 Instruct	47.8	2.78	0.20	0.33
DeepSeek LLM 7B Chat	38.2	3.66	0.14	0.21

Base variations of 5 fashions (Llama, Yi, Qwen, Mistral, Gemma) present larger variability on most axes than their instruct counterparts. Most excessive: verbose/concise std ratio = 0.13 (87% decrease in instruct). All 5 organizations present the identical course, although that is observational — base and instruct fashions differ in some ways past alignment. Gemma base can't distinguish empathetic/analytical or formal/informal in any respect (50% accuracy = likelihood), however the instruct model does — suggesting these explicit axes might replicate distinctions launched throughout fine-tuning somewhat than suppressed by it.

[IMAGE: pca_calibration_contrast — PCA scatter, Qwen vs Yi]

PCA of calibration hidden states. Left: Qwen 2.5 7B (d' = 5.0–12.0) — numerous axis instructions, poles clearly separated. Proper: Yi 1.5 9B (d' = 2.2–5.4) — decrease separability however all axes nonetheless discriminate.

3. Lifeless zones and the ICC dissociation

I introduce a composite Lifeless Zone Severity metric (0 = wholesome, 1 = lifeless) combining calibration accuracy (30%), d' (30%), stability cosine (20%), and baseline SNR (20%). The weights are heuristic — I selected them to stability discrimination, stability, and impact dimension, however different weightings might shift particular person mannequin rankings. Three lifeless zone sorts: laborious (fine-tuning suppresses differentiation), tender (unstable throughout calibration units), and uneven (mannequin follows directions in just one course — e.g., Llama achieves 100% for "be concise" however 0% for "be verbose").

An attention-grabbing sample is the dissociation between reliability and validity: imply ICC (test-retest, 5 seeds) is 0.91–0.99 throughout fashions, all 42 model-axis pairs exceed 0.75 — however Llama's benchmark move charge is 60%. That is partly anticipated (a mannequin that all the time outputs impartial could have excessive ICC and low benchmark scores), however the diploma of dissociation varies throughout fashions, suggesting it captures one thing past trivial low-variance instances.

Textual content-level validation. I computed text-level compliance metrics (token depend, hedging markers, emotion phrases) between reverse calibration poles throughout all 6 fashions × 7 axes. Spearman correlation between calibration accuracy and text-level impact dimension (Cohen's d): r = 0.47, p = 0.002 (n = 42). Caveat: textual content metrics and hidden states usually are not totally impartial — each are derived from the identical generated textual content, so this correlation partly displays consistency between two views of the identical knowledge somewhat than impartial validation. Nonetheless, it confirms lifeless zones manifest in observable textual content, not simply inner representations.

Exterior validation (Claude Opus 4.6 as impartial choose). To deal with the circularity concern above, I had Claude Opus charge 48 baseline responses (8 per mannequin, no system immediate) on all 7 axes utilizing a −2 to +2 scale, primarily based solely on textual content — no entry to hidden states or information of our measurement technique. Per-axis Spearman correlations with hidden-state projections:

Axis	Spearman r	p
formal_casual	+0.56	<0.001
warm_cold	+0.52	<0.001
patient_irritated	+0.31	0.031
proactive_reluctant	−0.34	0.018
empathetic_analytical	+0.22	0.14
verbose_concise	+0.04	0.81
confident_cautious	−0.01	0.93
Pooled	+0.38	<0.0001

3/7 axes attain p < 0.05, with 2 strong beneath bootstrap (heat/chilly and formal/informal: 95% CI excludes 0). Pooled r = 0.38 [0.29, 0.47 bootstrap 95% CI]. Go away-one-model-out: pooled r ranges from +0.30 to +0.58 — no single mannequin drives the end result. The damaging correlation on proactive_reluctant is informative: it's pushed by Llama (lifeless zone — hidden states say "reluctant" whereas textual content is structured and proactive) and DeepSeek (ceiling — projections saturate at +1.00 whereas Claude sees impartial textual content). That is precisely the lifeless zone phenomenon: hidden state projections and observable textual content diverge on constrained axes. verbose_concise exhibits no correlation — Claude charges "verbosity" qualitatively whereas our projection tracks length-correlated hidden state variation.

Immediate robustness take a look at (5 formulations × 3 fashions × 3 axes) confirms lifeless zones persist throughout phrasings.

Methodology (4 steps)

Calibrate: Present impartial questions with contrastive directions ("be warm" / "be cold"). Extract hidden states from final 4 layers of assistant-generated tokens solely. Axis = normalize(tmean(heat) - tmean(chilly)) (10%-trimmed imply, IQR normalization).
Measure: Mission any response onto axis. IQR-normalized values in [-1, +1].
Validate: Calibration accuracy 93-100% (4/6 fashions). Axis stability: cosine 0.69 throughout 3 impartial calibration units. Check-retest: imply ICC 0.91–0.99 throughout fashions, all 42 pairs exceed 0.75 (5 seeds). Scaling curve: axis stabilizes at n ≈ 15 questions (cosine > 0.93 to full-30 reference), holdout accuracy flat throughout all n.
Reproduce: Two cloud suppliers (RunPod RTX 4090, Huge.ai RTX 3090), max delta < 0.05.

Config chosen for cross-model robustness by way of 150+ configuration ablation (layer choice × token aggregation × weighting). Not optimum per-model, however the one config that works 85-100% on all 5 ablated fashions.

Fashions	Qwen 2.5 7B Instruct, Mistral 7B v0.3 Instruct, DeepSeek LLM 7B Chat, Llama 3.1 8B Instruct, Yi 1.5 9B Chat, Gemma 2 9B IT
Decoding	temp=0.7, top_p=0.9, max_new_tokens=200 (calibration) / 384 (baseline, drift)
Knowledge	210 calibration + 70 eval + 30 baseline questions (zero overlap)

Limitations

AI-generated dataset: 310 English questions by Claude Opus 4.6, curated by writer. No psychometric devices or crowdsourcing
Partial exterior validation: Claude Opus as impartial choose — 2/7 axes strong beneath bootstrap (heat/chilly, formal/informal; 95% CI excludes 0), 1 marginal (affected person/irritated), 4 not validated. Pooled r = 0.38 [0.29, 0.47]. Textual content-level validation (r = 0.47) is inner consistency, not floor reality
Size confound: 6/7 axes are clear (imply |r| < 0.3 with n_tokens), however verbose/concise is partially confounded (r = 0.50) and needs to be interpreted as partly a size proxy somewhat than a pure stylistic dimension. Exterior validation confirms this: Claude's qualitative verbosity scores don't correlate with our projection (r = 0.04). Gemma is an outlier with sturdy size correlations on a number of axes. Cross-correlations drop ~8% after size residualization
Single chat template & decoding per mannequin (temp=0.7, top_p=0.9 for all). Cross-model comparisons are truthful inside this regime, however absolute profiles might shift beneath totally different decoding — a temperature sweep is deliberate future work
Full pipeline on 7–9B fashions solely; one 14B mannequin (Phi-4) evaluated with shortened pipeline. Considering mode examined on one mannequin solely
Axes are behaviorally correlated (eff. dim 1.3–3.7 throughout fashions). 4/7 axes extremely secure (cosine > 0.7); 2 weaker (0.55-0.60)
Lifeless Zone Severity weights (30/30/20/20) are heuristic. Completely different weights might shift mannequin rankings
DeepSeek has the best efficient dimensionality (3.66) however is essentially unstable throughout calibration units (imply stability cosine 0.53). Independence ≠ stability: its axes seize numerous behavioral dimensions, however these dimensions shift between calibrations
Gemma's excessive PC1 (87.9%) probably pushed by response size variation, not behavioral collapse

Extra particulars within the repo README: battle drift (20 situations × 12 turns), cross-axis correlations, full methodology.

Comply with-up: Phi-4, Qwen3, and Considering Mode

After posting this work on r/LocalLLaMA, a number of folks requested about newer fashions. I ran a shortened pipeline (calibration + baseline + benchmark, no drift/stability) on two extra fashions in ~30 min on 2×H100 (~$6):

Phi-4 (Microsoft, 14B) — first mannequin exterior the 7–9B vary

Probably the most excessive cautious/reluctant profile in all the set: chilly (−0.51), extremely cautious (−0.85), strongly reluctant (−0.93). Polar reverse of DeepSeek on confidence and proactivity axes. Verbose/concise is in a lifeless zone (+0.01). Benchmark: 3/9 — Phi-4 can solely lower alongside axes (be chilly, be cautious, be concise) however fails to shift within the optimistic course, suggesting a robust "conservative" alignment prior.

Qwen3-8B vs Qwen 2.5 7B — generational fingerprint shift

Similar household, one technology aside. Two axes invert: assured/cautious flips from −0.36 to +0.38 (Δ = +0.74), formal/informal flips from +0.42 to −0.26 (Δ = −0.67). Proactive/reluctant stays equivalent (+0.47 → +0.45). Qwen3 achieves the best benchmark move charge within the full set (7/9). Behavioral fingerprints usually are not secure throughout mannequin generations, however some axes are extra persistent than others inside a household.

Considering vs non-thinking mode (Qwen3-8B)

Similar weights, identical calibration axes — solely distinction is enable_thinking=True. Preliminary outcomes (max_new_tokens=384) appeared to point out a confidence drop (Δ = −0.26), however 28/30 responses had been 100% <suppose> tokens — the mannequin by no means completed reasoning. That comparability was successfully inner monologue vs precise response.

Management experiment (max_new_tokens=4096, n=10, 100% seen responses): evaluating seen response after considering vs non-thinking response on the identical questions.

Axis	Non-thinking	After considering	Δ
proactive_reluctant	+0.40	+0.17	−0.23
verbose_concise	+0.59	+0.39	−0.19
confident_cautious	+0.34	+0.46	+0.11
all different axes

The unique confidence drop reverses signal when correctly managed — considering mode makes the mannequin extra assured, not much less. The most important real shifts are on proactivity (much less proactive) and verbosity (much less verbose after considering). This demonstrates the significance of separating <suppose> token artifacts from precise behavioral shifts.

Caveats: n=10 (PoC subset), single mannequin, decay-weighted aggregation means solely the final ~50 tokens of every section contribute to projections.

Reproducing

git clone  cd mood-axis && pip set up -r necessities.txt python scripts/run_app.py --model Qwen/Qwen2.5-7B-Instruct

Pre-computed axes included — measure any mannequin's fingerprint with out re-running calibration.

What I'd love suggestions on:

Is the geometric-vs-behavioral dissociation (low |cos|, excessive |r|) proof for alignment-induced compression, or might it replicate inherent semantic correlations between the axes?
Exterior validation confirms 2/7 axes (bootstrap CI excludes 0) however 5 stay unvalidated. What could be a convincing validation for axes like assured/cautious or empathetic/analytical?
The Lifeless Zone Severity metric weights are heuristic (30/30/20/20). What principled strategy would you employ to mix calibration accuracy, d', stability, and SNR?
Size confound: verbose/concise is the one axis clearly correlated with response size. Is that this an issue or anticipated tautology?

P.S. I’ve a full paper model (LaTeX, ~20 pages with methodology, ablations, reproducibility particulars). Do you suppose that is price placing on arXiv? In that case, I'd be thankful for an endorsement for cs.CL or cs.LG — comfortable to share the draft by way of DM.

submitted by /u/yunoshev
[comments]

Top Posts

Senate Democrats Push to Overturn Key Ruling on Civil Service Job Protections

Visa’s Bold Move: Powering OpenAI’s AI-Driven Payments – Is It Safe?

Anthropic Export Controls Spark Global AI Sovereignty Scramble

[R] I probed 6 open-weight LLMs (7B-9B) for "personality" utilizing hidden states — instruct fine-tuning is related to measurable behavioral constraints

Key Outcomes

1. Distinct fingerprints

2. Instruct fashions present decreased behavioral dimensionality

3. Lifeless zones and the ICC dissociation

Methodology (4 steps)

Limitations

Comply with-up: Phi-4, Qwen3, and Considering Mode

Phi-4 (Microsoft, 14B) — first mannequin exterior the 7–9B vary

Qwen3-8B vs Qwen 2.5 7B — generational fingerprint shift

Considering vs non-thinking mode (Qwen3-8B)

Reproducing

Mathematical String Probability: A Human-Powered Solution to the 3Blue1Brown Challenge

OWL’s Guide: 3D Spleen Segmentation with MONAI UNet on CT Volumes

Vision LLMs Double as Powerful PDF Decoders: Making Charts and Diagrams Retrievable for Smarter RAG Systems

Zyphra Unveils Zamba2-VL: A Hybrid Mamba2–Transformer Vision-Language Model Slashing Time-to-First-Token by Nearly 10x

Parse PDFs Locally for RAG Using Docling: Extract Rich Tables Without Cloud Upload

Decoding Schizophrenia: How Saliency Maps Illuminate 3D MRI Decision Pathways

Senate Democrats Push to Overturn Key Ruling on Civil Service Job Protections

Visa’s Bold Move: Powering OpenAI’s AI-Driven Payments – Is It Safe?

Anthropic Export Controls Spark Global AI Sovereignty Scramble

Mathematical String Probability: A Human-Powered Solution to the 3Blue1Brown Challenge

Reve 2.0 Review: The Best AI Image Generator for Layout Control

Army Data Center Initiatives Face Potential Setback Under House NDAA Clause

I tested dozens of Bluetooth trackers, but this one shocked me with its AirTag-crushing battery life

GPU Time-Slicing for Concurrent LLM Agents on Kubernetes

Trending

Senate Democrats Push to Overturn Key Ruling on Civil Service Job Protections

Visa’s Bold Move: Powering OpenAI’s AI-Driven Payments – Is It Safe?

Latest Posts

Not More Data, but Better World Models – Unite.AI

OpenAI Is Hiring Head of Preparedness, Amid AI Cyberattack Fears

Subscribe to Updates

Top Posts

[R] I probed 6 open-weight LLMs (7B-9B) for "personality" utilizing hidden states — instruct fine-tuning is related to measurable behavioral constraints

Key Outcomes

1. Distinct fingerprints

2. Instruct fashions present decreased behavioral dimensionality

3. Lifeless zones and the ICC dissociation

Methodology (4 steps)

Limitations

Comply with-up: Phi-4, Qwen3, and Considering Mode

Phi-4 (Microsoft, 14B) — first mannequin exterior the 7–9B vary

Qwen3-8B vs Qwen 2.5 7B — generational fingerprint shift

Considering vs non-thinking mode (Qwen3-8B)

Reproducing

Related Posts