LLMs have constant response kinds even with out a system immediate. I measure these "behavioral fingerprints" by projecting hidden states onto contrastive axes and discover that instruct fine-tuning is related to decreased steerability on particular axes. ("Personality" = secure response type, not human-like inside states.) Contributions:
Findings:
Code: github.com/yunoshev/mood-axis | Which fashions ought to I take a look at subsequent? Presently restricted to 7-9B. Particulars under. Prolonged dialogue on r/LocalLLaMA*:* unique submit Key Outcomes1. Distinct fingerprintsEvery mannequin's default profile throughout 7 axes. No system immediate. Values = hidden-state projections normalized by calibration IQR.
2. Instruct fashions present decreased behavioral dimensionalityStatement. PCA on baseline projection matrices reveals a spectrum of behavioral dimensionality. Gemma 2 9B IT exhibits the best focus (PC1 = 87.9%), probably pushed by variable response size somewhat than behavioral collapse. Axis vectors are geometrically near-orthogonal (low |cos|) however projections are behaviorally correlated (larger |r|). Interpretation. This hole is per fine-tuning constraining how fashions make the most of their illustration capability — however different explanations exist: inherent semantic correlations between axes, SFT knowledge distribution, chat template results, or decoding technique might all contribute. We observe the sample throughout 6 fashions from 5 organizations, however can not isolate which element of the instruct pipeline drives it. Size confound management. Response size might drive spurious axis correlations. I computed per-model Pearson r between n_tokens and every axis projection throughout 30 baseline questions. End result: 6/7 axes are clear (imply |r| < 0.3 throughout fashions). Solely verbose/concise is partially confounded (imply r = 0.50), which is anticipated — longer responses actually are extra verbose. Cross-axis correlations drop solely −7.7% after regressing out size, confirming behavioral bundling shouldn’t be a size artifact.
Base variations of 5 fashions (Llama, Yi, Qwen, Mistral, Gemma) present larger variability on most axes than their instruct counterparts. Most excessive: verbose/concise std ratio = 0.13 (87% decrease in instruct). All 5 organizations present the identical course, although that is observational — base and instruct fashions differ in some ways past alignment. Gemma base can't distinguish empathetic/analytical or formal/informal in any respect (50% accuracy = likelihood), however the instruct model does — suggesting these explicit axes might replicate distinctions launched throughout fine-tuning somewhat than suppressed by it. [IMAGE: pca_calibration_contrast — PCA scatter, Qwen vs Yi] PCA of calibration hidden states. Left: Qwen 2.5 7B (d' = 5.0–12.0) — numerous axis instructions, poles clearly separated. Proper: Yi 1.5 9B (d' = 2.2–5.4) — decrease separability however all axes nonetheless discriminate. 3. Lifeless zones and the ICC dissociationI introduce a composite Lifeless Zone Severity metric (0 = wholesome, 1 = lifeless) combining calibration accuracy (30%), d' (30%), stability cosine (20%), and baseline SNR (20%). The weights are heuristic — I selected them to stability discrimination, stability, and impact dimension, however different weightings might shift particular person mannequin rankings. Three lifeless zone sorts: laborious (fine-tuning suppresses differentiation), tender (unstable throughout calibration units), and uneven (mannequin follows directions in just one course — e.g., Llama achieves 100% for "be concise" however 0% for "be verbose"). An attention-grabbing sample is the dissociation between reliability and validity: imply ICC (test-retest, 5 seeds) is 0.91–0.99 throughout fashions, all 42 model-axis pairs exceed 0.75 — however Llama's benchmark move charge is 60%. That is partly anticipated (a mannequin that all the time outputs impartial could have excessive ICC and low benchmark scores), however the diploma of dissociation varies throughout fashions, suggesting it captures one thing past trivial low-variance instances. Textual content-level validation. I computed text-level compliance metrics (token depend, hedging markers, emotion phrases) between reverse calibration poles throughout all 6 fashions × 7 axes. Spearman correlation between calibration accuracy and text-level impact dimension (Cohen's d): r = 0.47, p = 0.002 (n = 42). Caveat: textual content metrics and hidden states usually are not totally impartial — each are derived from the identical generated textual content, so this correlation partly displays consistency between two views of the identical knowledge somewhat than impartial validation. Nonetheless, it confirms lifeless zones manifest in observable textual content, not simply inner representations. Exterior validation (Claude Opus 4.6 as impartial choose). To deal with the circularity concern above, I had Claude Opus charge 48 baseline responses (8 per mannequin, no system immediate) on all 7 axes utilizing a −2 to +2 scale, primarily based solely on textual content — no entry to hidden states or information of our measurement technique. Per-axis Spearman correlations with hidden-state projections:
3/7 axes attain p < 0.05, with 2 strong beneath bootstrap (heat/chilly and formal/informal: 95% CI excludes 0). Pooled r = 0.38 [0.29, 0.47 bootstrap 95% CI]. Go away-one-model-out: pooled r ranges from +0.30 to +0.58 — no single mannequin drives the end result. The damaging correlation on proactive_reluctant is informative: it's pushed by Llama (lifeless zone — hidden states say "reluctant" whereas textual content is structured and proactive) and DeepSeek (ceiling — projections saturate at +1.00 whereas Claude sees impartial textual content). That is precisely the lifeless zone phenomenon: hidden state projections and observable textual content diverge on constrained axes. verbose_concise exhibits no correlation — Claude charges "verbosity" qualitatively whereas our projection tracks length-correlated hidden state variation. Immediate robustness take a look at (5 formulations × 3 fashions × 3 axes) confirms lifeless zones persist throughout phrasings. Methodology (4 steps)
Config chosen for cross-model robustness by way of 150+ configuration ablation (layer choice × token aggregation × weighting). Not optimum per-model, however the one config that works 85-100% on all 5 ablated fashions.
Limitations
Extra particulars within the repo README: battle drift (20 situations × 12 turns), cross-axis correlations, full methodology. Comply with-up: Phi-4, Qwen3, and Considering ModeAfter posting this work on r/LocalLLaMA, a number of folks requested about newer fashions. I ran a shortened pipeline (calibration + baseline + benchmark, no drift/stability) on two extra fashions in ~30 min on 2×H100 (~$6): Phi-4 (Microsoft, 14B) — first mannequin exterior the 7–9B varyProbably the most excessive cautious/reluctant profile in all the set: chilly (−0.51), extremely cautious (−0.85), strongly reluctant (−0.93). Polar reverse of DeepSeek on confidence and proactivity axes. Verbose/concise is in a lifeless zone (+0.01). Benchmark: 3/9 — Phi-4 can solely lower alongside axes (be chilly, be cautious, be concise) however fails to shift within the optimistic course, suggesting a robust "conservative" alignment prior. Qwen3-8B vs Qwen 2.5 7B — generational fingerprint shiftSimilar household, one technology aside. Two axes invert: assured/cautious flips from −0.36 to +0.38 (Δ = +0.74), formal/informal flips from +0.42 to −0.26 (Δ = −0.67). Proactive/reluctant stays equivalent (+0.47 → +0.45). Qwen3 achieves the best benchmark move charge within the full set (7/9). Behavioral fingerprints usually are not secure throughout mannequin generations, however some axes are extra persistent than others inside a household. Considering vs non-thinking mode (Qwen3-8B)Similar weights, identical calibration axes — solely distinction is Management experiment (max_new_tokens=4096, n=10, 100% seen responses): evaluating seen response after considering vs non-thinking response on the identical questions.
The unique confidence drop reverses signal when correctly managed — considering mode makes the mannequin extra assured, not much less. The most important real shifts are on proactivity (much less proactive) and verbosity (much less verbose after considering). This demonstrates the significance of separating Caveats: n=10 (PoC subset), single mannequin, decay-weighted aggregation means solely the final ~50 tokens of every section contribute to projections. ReproducingPre-computed axes included — measure any mannequin's fingerprint with out re-running calibration. What I'd love suggestions on:
P.S. I’ve a full paper model (LaTeX, ~20 pages with methodology, ablations, reproducibility particulars). Do you suppose that is price placing on arXiv? In that case, I'd be thankful for an endorsement for cs.CL or cs.LG — comfortable to share the draft by way of DM. submitted by /u/yunoshev |
Subscribe to Updates
Get the latest tech insights from TechnologiesDigest.com on AI, innovation, and the future of digital technology.
Trending
- Rearchitecting the Workflows management airplane for the agentic period
- The Firmware Fallacy: Why Bridging the NTN Hole in Large IoT Nonetheless Requires a {Hardware} Actuality Verify
- 7 Steps to Mastering Language Mannequin Deployment
- Bitcoin Development Reversal Might Affirm If BTC Closes Above $76K
- Signed software program abused to deploy antivirus-killing scripts
- Easy methods to Construct a Common Lengthy-Time period Reminiscence Layer for AI Brokers Utilizing Mem0 and OpenAI
- Amid intense scrutiny at Labor Division, new IG brings law-enforcement mindset
- Why Zorin OS 18.1 is just one of the best Linux distro – for anybody



![[R] I probed 6 open-weight LLMs (7B-9B) for "personality" utilizing hidden states — instruct fine-tuning is related to measurable behavioral constraints [R] I probed 6 open-weight LLMs (7B-9B) for "personality" using hidden states — instruct fine-tuning is associated with measurable behavioral constraints](https://technologiesdigest.com/wp-content/uploads/2026/02/R-I-probed-6-open-weight-LLMs-7B-9B-for-quotpersonalityquot-using.png)