Nous Research Unveils Token Superposition Training: Accelerating LLM Pre-Training By Up To 2.5x For Models Ranging From 270M To 10B Parameters

Training large language models from scratch is so costly that even small efficiency gains can lead to significant savings in both time and expense. Nous Research has introduced Token Superposition Training (TST), a technique that dramatically cuts pre-training wall-clock time without altering the model architecture, optimizer, tokenizer, parallelism strategy, or training data.

At the 10B-A1B mixture-of-experts scale, TST achieves a lower final training loss than a matched-FLOPs baseline while using only 4,768 B200-GPU-hours compared to the baseline’s 12,311 — roughly a 2.5× reduction in total pre-training time.

The Problem TST Addresses

Modern LLM pre-training is overwhelmingly data-hungry. Current training practices routinely push well past compute-optimal estimates, and raw text throughput — how much data a model can absorb per FLOP — has become a critical bottleneck. Subword tokenizers like BPE already boost throughput by compressing sequences, and research indicates that much of the BPE advantage over byte-level models stems simply from shorter sequences, meaning the model processes more text per unit of compute.

TST explores whether this throughput lever can be pushed even further during training, independently of the tokenizer and without permanently modifying the model.

How TST Works: Two Phases

TST modifies the standard pre-training loop in two sequential phases:

Phase 1 — Superposition: For the first r fraction of total training steps (the paper finds r ∈ [0.2, 0.4] to be near-optimal across tested scales), the model does not receive individual tokens. Instead, the input sequence of length L is divided into non-overlapping groups of s consecutive tokens. In the embedding layer, each group is collapsed into a single latent “s-token” by averaging the s token embeddings. The transformer then processes a sequence of length L/s.

Crucially, each TST step is kept equal-FLOPs to a standard training step by increasing the data sequence length by a factor of s during the superposition phase. Because each latent position corresponds to s source tokens, the model ingests s times as much text per unit of compute — this is what drives the throughput gain.

On the output side, each latent position predicts the next group of s tokens rather than a single next token. The standard cross-entropy loss is replaced with a multi-hot cross-entropy (MCE) loss, which assigns equal probability mass 1/s to each token in the target group. The MCE loss reduces to a simple mean of standard cross-entropy terms over the s targets — it can be implemented using the existing fused CE kernels already present in any major pre-training library, without writing a new kernel or adding an auxiliary head.

Phase 2 — Recovery: After the superposition phase, training resumes from the saved checkpoint with standard next-token prediction for the remaining 1 - r steps. The TST code is fully removed at this boundary to avoid any experimental contamination. A transient loss spike occurs at the transition, typically between 1 and 2 nats, which resolves within a few thousand steps. After that, the recovered model crosses below the equal-FLOPs baseline and remains there.

The model produced at the end of Phase 2 is architecturally identical to one produced by conventional pre-training, with the same next-token prediction inference behavior.

What the Experiments Reveal

TST was validated at four scales: 270M and 600M dense (SmolLM2 shapes adapted to the Llama3 modeling code, with the Llama3-8B tokenizer and untied input/output embeddings — which makes the 270M model equivalent in size to SmolLM2-135M and the 600M to SmolLM2-360M), 3B dense (SmolLM3I’ll paraphrase the HTML content while keeping the HTML structure intact and maintaining the original language (English).

shape), along with a 10B-A1B mixture-of-experts (MoE) model in the Qwen3 series. For training, the smaller models used the DCLM dataset, while the MoE model was trained on an even split of DCLM and FineWeb-Edu. All experiments used the AdamW optimizer with a Warmup-Stable-Decay learning rate schedule and were executed in TorchTitan with FSDP parallelism—larger models ran on 64 NVIDIA B200 GPUs and smaller ones on 8 B200 GPUs.

At the 3B parameter scale with a bag size of s = 6 and a step ratio of r = 0.3, TST reaches a final loss of 2.676 after 20,000 steps—virtually tied with the 36,000-step baseline at 2.677—while consuming 247 B200-GPU-hours compared to 443. On downstream benchmarks, the 20k-step TST run scores 62.4 on HellaSwag and 66.3 on ARC-Easy, compared to 62.3 and 65.9 for the 36k baseline.

At the 10B-A1B MoE scale with s = 16 and r ≈ 0.25, the TST run processes 2 trillion data tokens and achieves a final loss of 2.236—lower than the baseline’s 2.252 after 1.05 trillion tokens—while outperforming it across all four reported benchmarks: HellaSwag (71.2 vs. 70.1), ARC-Easy (74.2 vs. 73.8), ARC-Challenge (47.3 vs. 46.3), and MMLU (39.0 vs. 37.4).

The researchers evaluate TST against the baseline from three angles: equal compute (FLOPs), equal loss, and equal data. TST consistently comes out ahead under equal-FLOPs and equal-loss comparisons. Under equal total token consumption, however, the baseline wins because TST’s effective compute budget per data token is lower. This is a key boundary condition that defines where TST is most applicable.

Two Distinct Mechanisms

An ablation study separates the input-side and output-side components. Each one independently surpasses the baseline, and combining them yields additional gains with no signs of interference. The authors take this as evidence that TST comprises two independent mechanisms rather than a single technique.

The output-side mechanism—predicting the next bag of tokens—is conceptually similar to multi-token prediction (MTP). Unlike MTP, which introduces k separate prediction heads and additional parameters, TST retains a single output head and only changes the target. This makes it the most lightweight member of an emerging family of future-signal auxiliary objectives. Unlike MTP, it delivers consistent improvements across all tested scales, including small models where MTP has been known to hurt performance.

The input-side mechanism has no direct counterpart in recent pre-training research. The team proposes two possible explanations: it may implicitly regularize the embedding space (since many random s-grams of tokens must remain linearly separable after averaging), or it may function as a kind of pre-pre-training, giving the model exposure to a coarser version of the actual data before full-resolution language modeling begins.

A focused ablation directly tests what happens when representation continuity is disrupted. The team runs a 3B TST experiment where the input embedding layer and output LM head are randomly re-initialized at the start of Phase 2. The result: the final loss climbs to 2.938—worse than both the TST run (2.676) and the standard baseline (2.808). The Phase 1 TST steps contributed nothing to the final model. This confirms that shared representations across both phases are not a side effect of TST’s success—they are the reason it works.

Marktechpost’s Visual Explainer

Token Superposition Training — Practical Guide
arXiv 2605.06546

01 / Overview

What Is Token Superposition Training?

Token Superposition Training (TST) is a two-phase pre-training approach from Nous Research that boosts the number of tokens processed per FLOP without altering the model architecture, optimizer, tokenizer, parallelism strategy, or training dataset.

The core concept: Rather than feeding one token at a time, average the embeddings of s consecutive tokens into a single “s-token,” train on that representation for the first r fraction of total steps, then revert to standard next-token prediction. The finished model is architecturally indistinguishable from one trained conventionally.

Phase 1 (Superposition) — the model processes groups of s tokens and predicts the next group
Phase 2 (Recovery) — standard next-token prediction resumes from the saved checkpoint
Inference — entirely unchanged; no extra heads, no additional parameters
Tested at 270M, 600M, 3B dense and 10B–A1B MoE scales

TST trades compute efficiency for greater data throughput. It is best suited for compute-limited pre-training scenarios, not data-limited ones.

02 / Phase 1

Phase 1 — The Superposition Phase

During the first r fraction of total training steps, the input sequence of length L is divided into non-overlapping groups of s consecutive tokens. Their embeddings are averaged to form a single latent s-token. The transformer then processes a sequence of length L/s—but each position represents s real tokens, so throughput is s× higher for the same FLOPs.

Equal-FLOPs trick: To ensure each step costs the same compute as the baseline, the data sequence length is scaled up by s×—not the batch size. Every TST step requires the same compute as a standard training step.

On the output side, the loss target shifts from a single next token to the next group of s tokens. The multi-hot cross-entropy (MCE) loss distributes equal probability mass of 1/s across each token in the target group:

# L_MCE = mean of s standard CE terms
for i in range(superposition_bag_size):
    target = labels[..., i].flatten(0, 1)
    loss += torch.nn.functional.cross_entropy(pred, target)
loss = loss / superposition_bag_size

No custom kernel is required—the existing fused cross-entropy kernel in your pre-training library is reused as-is.

03 / Phase 2

Phase 2 — The Recovery Phase

After r × total_steps steps of superposition training, resume from the checkpoint with the TST code completely removed. Standard next-token prediction runs for the remaining (1 — r) × total_steps steps.

What happens at the transition: A loss spike of 1–2 nats appears at the phase boundary. It subsides within a few thousand steps. After that, the model drops below the equal-FLOPs baseline and remains there.

Remove all TST code—do not retain it as an auxiliary loss during Phase 2
Do not re-initialize the input embedding or LM head at the transition point
Shared representations across both phases are what make TST effective

Re-initializing the embedding or LM head

Completely breaks down at the phase transition point, TST fails entirely. In a 3B ablation study, this caused the final loss to increase from 2.676 to 2.938 — performing worse than the 2.808 baseline. The Phase 1 training steps provided zero benefit.

04 / Implementation

PyTorch Implementation

Three modifications to the standard training loop — input folding, averaged embedding lookup, and MCE loss.

# 1. Input folding (inside train loop)
if superposition_bag_size is not None and superposition_bag_size > 1:
    bs, seq = inputs.shape
    inputs = inputs.reshape(
        bs, seq // superposition_bag_size, superposition_bag_size
    )

# 2. Averaged embedding lookup (inside model forward)
if len(tokens.shape) == 3:
    bs, sp_seq, superposition_bag_size = tokens.shape
    h = self.tok_embeddings(tokens[..., 0]).float()
    for i in range(1, superposition_bag_size):
        h = h + self.tok_embeddings(tokens[..., i]).float()
    h = (h / superposition_bag_size).to(h_dtype)
else:
    h = self.tok_embeddings(tokens)

Note: Accumulate the sum in float32 to maintain numerical accuracy, then convert back to the training dtype. The embedding layer is the sole change needed in the forward pass.

05 / Hyperparameters

Tuning Bag Size `s` and Step Ratio `r`

Two hyperparameters govern TST. Both have clearly defined practical ranges that have been validated across different model scales.

Step Ratio r
0.2 — 0.4
The proportion of total training steps spent in superposition mode. This range proves robust across all tested scales. Below 0.2, the throughput improvement is negligible. Above 0.5, Phase 2 is unable to fully recover.

Bag Size s
3 — 16
A U-shaped optimal range that shifts depending on model size. Begin within the flat region; choosing too large a bag makes the target too noisy to recover from.

Model Size	Recommended s	Recommended r
270M	3 — 8	0.2 — 0.4
600M	6 — 10	0.2 — 0.4
3B	6 (tested)	0.3 (tested)
10B–A1B MoE	16 (tested)	∼0.25 (tested)

For large bag sizes (s ≥ 8): Switch from uniform MCE loss weighting to power-law weighting (1/i per position). This is motivated by the observation that mutual information between token pairs decays as a power law with distance (fitted exponent k ≈ −1.25 on the DCLM dataset).

06 / Negative Results

What Doesn’t Work

The paper documents several alternative approaches that were tested and found to be ineffective. Save yourself the compute.

Positional encodings before averaging — applying RoPE or sinusoidal encodings to tokens before computing the mean consistently degraded performance. Within-bag permutation invariance seems to be a beneficial property, not a limitation.
RoPE rescaling at the phase transition — this sped up early Phase 2 recovery but occasionally increased final loss. Keep RoPE unchanged across the boundary.
s independent prediction heads — replacing the single MCE head with s separate heads, each predicting a different position, offered no consistent improvement while adding parameter cost and implementation complexity.
Binary cross-entropy / hinge loss — both significantly underperformed the MCE formulation and even fell below the baseline.
Keeping the TST head active in Phase 2 — not yet benchmarked but flagged as future work; do not assume it provides any benefit.

Bottom line: The simplest approach works best — average embeddings in, mean CE loss out, make a clean switch at the phase boundary, and add no extra parameters.

07 / Results

Key Results & When to Use TST

At equal wall-clock — same compute, better loss:

Scale	B200-hrs	TST Loss	Baseline Loss
3B dense	247	2.676	2.808
10B–A1B MoE	4,768	2.236	2.252 (@ 12,311 hrs)

At equal final loss — wall-clock saved:

Scale	TST (B200-hrs)	Baseline (B200-hrs)	Speedup
3B dense	247	443	∼1.8×
10B–A1B MoE	4,768	12,311	∼2.5×

Use TST when
✓ You are compute-constrained
✓ You have abundant training data
✓ You want lower loss for the same FLOPs
✓ You need the same model at inference time

Avoid TST when
✕ Data is the limiting factor (TST consumes s× more tokens in Phase 1)
✕ You are comparing at equal token budgets
✕ Under equal-data conditions, the baseline outperforms TST

Paper: arXiv 2605.06546 • nousresearch.com/token-superposition

Key Takeaways

Nous Research’s Token Superposition Training (TST) reduces LLM pre-training time by up to 2.5× at matched FLOPs — with no changes to architecture, tokenizer, or optimizer.
Phase 1 averages contiguous token embeddings into bags and predicts the next bag using multi-hot cross-entropy; Phase 2 switches back to standard next-token prediction from the same checkpoint.
Validated at 270M, 600M, 3B dense, and 10B–A1B MoE scales — TST outperforms the baseline on loss and downstream evaluations (HellaSwag, ARC, MMLU) at every scale tested.
Optimal hyperparameters: bag size s ∈ [3–8] for smaller models, step ratio r ∈ [0.2, 0.4]; shared embeddings across both phases are essential — re-initializing them makes TST perform worse than the baseline.
Trade-off: TST consumes more raw data tokens per unit of compute — making it best suited for compute-bound training; the output-only variant is the recommended alternative for data-bound scenarios.

Check out the Paper and Project. Also, feel free to follow us on Twitter and don’t forget to join our 150k+ ML SubReddit and subscribe to our Newsletter. Wait! Are you on Telegram? Now you can join us on Telegram as well.

Need to partner with us for promoting your GitHub repo, Hugging Face page, product release, webinar, or more? Connect with us

Top Posts

Charting the Vessel Storm: A Proteomic Blueprint for Vasculitis Remission

Migrate Your On-Prem ERP to Dynamics 365: A Cloud Transformation Journey

Supercharging Smart Homes: The Fibre Internet Revolution Behind IoT Awakening

Nous Research Unveils Token Superposition Training: Accelerating LLM Pre-Training by Up to 2.5x for Models Ranging from 270M to 10B Parameters

What Is Token Superposition Training?

Phase 1 — The Superposition Phase

Phase 2 — The Recovery Phase

PyTorch Implementation

Tuning Bag Size `s` and Step Ratio `r`

What Doesn’t Work

Key Results & When to Use TST

Speed, VRAM, Multi-GPU Smackdown: Unsloth, Axolotl, TRL, or LLaMA-Factory?

5 No-Cost Courses to Transform from AI Newbie to Pro

The System76 Thelio Mira: My Dream Linux Desktop Come True

Google’s Gemini 3.6 Flash: Slashing Enterprise Agent Token Costs

Stop ML Chaos: Your Blueprint for Experiment Order

NVIDIA Cosmos 3 Edge: 4B-Power Robot Brains Thinking and Acting on Your Device

Charting the Vessel Storm: A Proteomic Blueprint for Vasculitis Remission

Migrate Your On-Prem ERP to Dynamics 365: A Cloud Transformation Journey

Supercharging Smart Homes: The Fibre Internet Revolution Behind IoT Awakening

Speed, VRAM, Multi-GPU Smackdown: Unsloth, Axolotl, TRL, or LLaMA-Factory?

Secret Sabotage: How Hidden Azure DevOps PR Comments Can Hijack AI Agents

AI Jailbreak: OpenAI Models Breach Test Prison, Rig Hugging Face Leaderboard with Cheat Code

Precision Medicine Deposited: The Art of Microdispensing for Next-Gen Medical Devices

When the World Cup Collided with the Cloud: 2026’s Digital Traffic Surge

Trending

Charting the Vessel Storm: A Proteomic Blueprint for Vasculitis Remission

Migrate Your On-Prem ERP to Dynamics 365: A Cloud Transformation Journey

Latest Posts

Not More Data, but Better World Models – Unite.AI

OpenAI Is Hiring Head of Preparedness, Amid AI Cyberattack Fears

Subscribe to Updates

Top Posts

Nous Research Unveils Token Superposition Training: Accelerating LLM Pre-Training by Up to 2.5x for Models Ranging from 270M to 10B Parameters

The Problem TST Addresses

How TST Works: Two Phases

What the Experiments Reveal

Two Distinct Mechanisms

Marktechpost’s Visual Explainer

What Is Token Superposition Training?

Phase 1 — The Superposition Phase

Phase 2 — The Recovery Phase

PyTorch Implementation

Tuning Bag Size s and Step Ratio r

What Doesn’t Work

Key Results & When to Use TST

Key Takeaways

Related Posts

Tuning Bag Size `s` and Step Ratio `r`