Training large language models from scratch is so costly that even small efficiency gains can lead to significant savings in both time and expense. Nous Research has introduced Token Superposition Training (TST), a technique that dramatically cuts pre-training wall-clock time without altering the model architecture, optimizer, tokenizer, parallelism strategy, or training data.
At the 10B-A1B mixture-of-experts scale, TST achieves a lower final training loss than a matched-FLOPs baseline while using only 4,768 B200-GPU-hours compared to the baseline’s 12,311 — roughly a 2.5× reduction in total pre-training time.

The Problem TST Addresses
Modern LLM pre-training is overwhelmingly data-hungry. Current training practices routinely push well past compute-optimal estimates, and raw text throughput — how much data a model can absorb per FLOP — has become a critical bottleneck. Subword tokenizers like BPE already boost throughput by compressing sequences, and research indicates that much of the BPE advantage over byte-level models stems simply from shorter sequences, meaning the model processes more text per unit of compute.
TST explores whether this throughput lever can be pushed even further during training, independently of the tokenizer and without permanently modifying the model.
How TST Works: Two Phases
TST modifies the standard pre-training loop in two sequential phases:
Phase 1 — Superposition: For the first r fraction of total training steps (the paper finds r ∈ [0.2, 0.4] to be near-optimal across tested scales), the model does not receive individual tokens. Instead, the input sequence of length L is divided into non-overlapping groups of s consecutive tokens. In the embedding layer, each group is collapsed into a single latent “s-token” by averaging the s token embeddings. The transformer then processes a sequence of length L/s.
Crucially, each TST step is kept equal-FLOPs to a standard training step by increasing the data sequence length by a factor of s during the superposition phase. Because each latent position corresponds to s source tokens, the model ingests s times as much text per unit of compute — this is what drives the throughput gain.
On the output side, each latent position predicts the next group of s tokens rather than a single next token. The standard cross-entropy loss is replaced with a multi-hot cross-entropy (MCE) loss, which assigns equal probability mass 1/s to each token in the target group. The MCE loss reduces to a simple mean of standard cross-entropy terms over the s targets — it can be implemented using the existing fused CE kernels already present in any major pre-training library, without writing a new kernel or adding an auxiliary head.
Phase 2 — Recovery: After the superposition phase, training resumes from the saved checkpoint with standard next-token prediction for the remaining 1 - r steps. The TST code is fully removed at this boundary to avoid any experimental contamination. A transient loss spike occurs at the transition, typically between 1 and 2 nats, which resolves within a few thousand steps. After that, the recovered model crosses below the equal-FLOPs baseline and remains there.
The model produced at the end of Phase 2 is architecturally identical to one produced by conventional pre-training, with the same next-token prediction inference behavior.
What the Experiments Reveal
TST was validated at four scales: 270M and 600M dense (SmolLM2 shapes adapted to the Llama3 modeling code, with the Llama3-8B tokenizer and untied input/output embeddings — which makes the 270M model equivalent in size to SmolLM2-135M and the 600M to SmolLM2-360M), 3B dense (SmolLM3I’ll paraphrase the HTML content while keeping the HTML structure intact and maintaining the original language (English).
shape), along with a 10B-A1B mixture-of-experts (MoE) model in the Qwen3 series. For training, the smaller models used the DCLM dataset, while the MoE model was trained on an even split of DCLM and FineWeb-Edu. All experiments used the AdamW optimizer with a Warmup-Stable-Decay learning rate schedule and were executed in TorchTitan with FSDP parallelism—larger models ran on 64 NVIDIA B200 GPUs and smaller ones on 8 B200 GPUs.
At the 3B parameter scale with a bag size of s = 6 and a step ratio of r = 0.3, TST reaches a final loss of 2.676 after 20,000 steps—virtually tied with the 36,000-step baseline at 2.677—while consuming 247 B200-GPU-hours compared to 443. On downstream benchmarks, the 20k-step TST run scores 62.4 on HellaSwag and 66.3 on ARC-Easy, compared to 62.3 and 65.9 for the 36k baseline.
At the 10B-A1B MoE scale with s = 16 and r ≈ 0.25, the TST run processes 2 trillion data tokens and achieves a final loss of 2.236—lower than the baseline’s 2.252 after 1.05 trillion tokens—while outperforming it across all four reported benchmarks: HellaSwag (71.2 vs. 70.1), ARC-Easy (74.2 vs. 73.8), ARC-Challenge (47.3 vs. 46.3), and MMLU (39.0 vs. 37.4).
The researchers evaluate TST against the baseline from three angles: equal compute (FLOPs), equal loss, and equal data. TST consistently comes out ahead under equal-FLOPs and equal-loss comparisons. Under equal total token consumption, however, the baseline wins because TST’s effective compute budget per data token is lower. This is a key boundary condition that defines where TST is most applicable.
Two Distinct Mechanisms
An ablation study separates the input-side and output-side components. Each one independently surpasses the baseline, and combining them yields additional gains with no signs of interference. The authors take this as evidence that TST comprises two independent mechanisms rather than a single technique.
The output-side mechanism—predicting the next bag of tokens—is conceptually similar to multi-token prediction (MTP). Unlike MTP, which introduces k separate prediction heads and additional parameters, TST retains a single output head and only changes the target. This makes it the most lightweight member of an emerging family of future-signal auxiliary objectives. Unlike MTP, it delivers consistent improvements across all tested scales, including small models where MTP has been known to hurt performance.
The input-side mechanism has no direct counterpart in recent pre-training research. The team proposes two possible explanations: it may implicitly regularize the embedding space (since many random s-grams of tokens must remain linearly separable after averaging), or it may function as a kind of pre-pre-training, giving the model exposure to a coarser version of the actual data before full-resolution language modeling begins.
A focused ablation directly tests what happens when representation continuity is disrupted. The team runs a 3B TST experiment where the input embedding layer and output LM head are randomly re-initialized at the start of Phase 2. The result: the final loss climbs to 2.938—worse than both the TST run (2.676) and the standard baseline (2.808). The Phase 1 TST steps contributed nothing to the final model. This confirms that shared representations across both phases are not a side effect of TST’s success—they are the reason it works.
Marktechpost’s Visual Explainer
Key Takeaways
- Nous Research’s Token Superposition Training (TST) reduces LLM pre-training time by up to 2.5× at matched FLOPs — with no changes to architecture, tokenizer, or optimizer.
- Phase 1 averages contiguous token embeddings into bags and predicts the next bag using multi-hot cross-entropy; Phase 2 switches back to standard next-token prediction from the same checkpoint.
- Validated at 270M, 600M, 3B dense, and 10B–A1B MoE scales — TST outperforms the baseline on loss and downstream evaluations (HellaSwag, ARC, MMLU) at every scale tested.
- Optimal hyperparameters: bag size s ∈ [3–8] for smaller models, step ratio r ∈ [0.2, 0.4]; shared embeddings across both phases are essential — re-initializing them makes TST perform worse than the baseline.
- Trade-off: TST consumes more raw data tokens per unit of compute — making it best suited for compute-bound training; the output-only variant is the recommended alternative for data-bound scenarios.
Check out the Paper and Project. Also, feel free to follow us on Twitter and don’t forget to join our 150k+ ML SubReddit and subscribe to our Newsletter. Wait! Are you on Telegram? Now you can join us on Telegram as well.
Need to partner with us for promoting your GitHub repo, Hugging Face page, product release, webinar, or more? Connect with us



