Why Decade-Old Residual Connections Still Dominate AI—And Why That’s Holding Us Back

1.

Over the last ten years, the field of deep learning has expanded dramatically, driven by both advances in hardware computing power and creative breakthroughs in model architecture. Yet, if you pause to consider it, the core architecture has stayed largely the same in several important respects. While there has been a major transition from convolutional networks to the Transformer architectures that now underpin today’s large language models, the fundamental way these networks pass information from one layer to the next has seen relatively little change.

Recently, a team of researchers at DeepSeek-AI published a paper called “mHC: Manifold-Constrained Hyper-Connections,” (Xie et al., 2025b)¹ which introduces a completely new approach to redesigning this information-routing mechanism. To fully grasp the significance of their proposed solution, it helps to examine how signal propagation has developed across recent generations of models, and why existing methods are running into fundamental limitations.

2. The Foundation: Standard Residual Connections

First, to understand the particular challenge the authors are addressing, we need to go back to where it all began — the standard Residual Connection (He et al., 2015)². First introduced in 2015 alongside ResNets, the residual connection is arguably one of the most critical architectural decisions adopted in virtually every AI model in use today.

(Source: Author)
Visual depiction of the Residual Connection

In mathematical terms, it can be expressed as follows:

(Source: Author)
x_l+1: Final output activation of the layer
x_l: Input activations to the layer
F(.): Transformations applied by the layer

Put simply, a layer’s final output equals the sum of its computed output and the original input it received. The critical element here is the raw x_l term in the residual stream, known as the identity mapping. Its importance lies in providing an unobstructed pathway for gradient signals to travel through the entire network from beginning to end. This characteristic is precisely what prevents gradients from vanishing or exploding during training, enabling us to train models with hundreds of layers while still making sure each layer learns and updates its parameters effectively.

2.1 The Problem with Standard Residual Connections

However, as models have grown ever larger, the limitations of this simple approach have become increasingly apparent.

In a typical transformer model, we can think of the residual stream as having a fixed width, referred to as dimension C. Every piece of context, memory, and feature representation must be compressed into this single C-dimensional vector as it travels upward through the network. As the model’s layers progressively transform the information into more abstract and expressive forms, the x_l term from the residual stream increasingly becomes an information bottleneck.

Generally, if you want to boost the model’s representational capacity, you need to either enlarge the computational layers or stack on more layers. But doing so also dramatically increases the computational resources required to run the model.

2.2 The Improvement: Hyper-Connections (HC)

To address the limitation described above, researchers at ByteDance proposed an alternative to the conventional residual stream called Hyper-Connections (Zhu et al., 2024)³.

(Source: Author)
A visual diagram of information flow in unconstrained Hyper-Connections.

If standard residual streams are too “narrow,” HC broadens them. Rather than depending on a single stream of width C, the concept is to multiply the residual stream’s width by a chosen factor, say n. The result is a wider vector made up of n parallel streams, yielding a combined width of n×C.

However, since the model’s actual computational layers — such as the Attention and MLP blocks — still require a standard input of only C dimensions, HC incorporates a set of learnable weight matrices to transform the vector between the wide and narrow streams:

A Pre-Mapping Matrix: This reads from the wide stream and compresses it down to size C.
A Post-Mapping Matrix: This takes the layer’s narrow output and expands it back into the wide stream.
A Residual Mapping Matrix: This is placed directly on the residual pathway, and its role is to blend information across the n parallel streams as the signal advances.

At its core, HC effectively increases the network’s capacity and makes the residual stream more expressive. The residual mapping matrix now allows the residual stream to do more than just pass an unaltered signal through — it also enables interactions between the channel dimensions. This lets the model maintain a far richer internal representation across multiple streams, all without raising the computational cost of the main layers.

2.3 The Flaws in Hyper-Connections

The reality, though, is that while HC appears highly promising in theory, it introduces several critical flaws when you attempt to scale it to the size of today’s LLMs:

Mathematical Instability: The Residual Mapping matrix, despite being expressive, undermines the essential identity mapping property. Since it can learn arbitrary values, it no longer perfectly preserves the original signal. A minor amplification of feature values in one layer compounds exponentially when multiplied across fifty layers. DeepSeek’s team discovered that the signal could be amplified by an astonishing factor of 3,000, leading to wildly unstable gradients and severe spikes in training loss.
The Hardware Bottleneck: Expanding the stream width by a factor of n forces the memory hardware to read and write substantially more data at every step. Since memory access — rather than the actual computation — is frequently the primary bottleneck in modern AI training, this added overhead severely degrades training throughput and inflates GPU memory consumption by a considerable margin.

So, the researchers at DeepSeek were left with a very

Here is the paraphrased version of the provided HTML content, with the text rewritten for clarity and ease of understanding while preserving the original structure and language:

The core challenge is this: how do you maintain the expressive, high-bandwidth data flows of the Hyper-Connections (HC) approach without undermining the network’s mathematical stability, and without overwhelming the GPU’s memory and input/output operations?

Let’s examine their solution.

3. The Solution: Manifold-Constrained Hyper-Connections (mHC)

To address these two major drawbacks of standard HC, the DeepSeek team developed a refined architecture they term Manifold-Constrained Hyper-Connections, or mHC.

Their approach tackles the problem in two phases. First, they resolved the mathematical foundation to prevent signal instability. Second, they performed intensive systems-level engineering to ensure the solution could execute efficiently on contemporary GPUs. Here’s a detailed breakdown of both components.

3.1 Resolving the Math: The Birkhoff Polytope

The key mathematical innovation was to take the unconstrained Residual Mapping matrix and enforce specific behavioral constraints. They achieved this by projecting the matrix onto a particular mathematical domain called the Birkhoff polytope.

Essentially, they forced the matrix to adopt the structure of a doubly stochastic matrix.

For those unfamiliar, a doubly stochastic matrix is one where all entries are non-negative, and the sum of every row and every column is precisely 1.

(Source: Author)
Illustration of a doubly-stochastic matrix

By restricting the residual matrix to this format, the researchers ensured several advantageous mathematical properties:

Norm Preservation (Preventing Instability): Mathematically, the spectral norm of a doubly stochastic matrix is strictly limited to 1. This guarantees that the matrix cannot amplify or diminish the gradient, effectively eliminating the risk of exploding or vanishing signals.
Compositional Closure (Sustained Stability): Multiplying two doubly stochastic matrices together produces another doubly stochastic matrix. This property ensures signal integrity remains intact even when these matrices are stacked across dozens or hundreds of layers.
Optimal Information Mixing: Geometrically, this matrix structure functions as a blend of various information permutation methods. It allows for thorough mixing across the n parallel data streams without artificially boosting the signal’s overall magnitude.

To transform a standard matrix into a doubly stochastic one during the training process, the team employed the Sinkhorn-Knopp algorithm (Sinkhorn & Knopp, 1967)⁵. During the forward pass, this algorithm first ensures all matrix values are positive, then iteratively adjusts the rows and columns until they each sum to 1.

3.2 Optimizing for Hardware: Advanced Systems Engineering

While the mathematical solution is sound, executing these wide data streams and iterative Sinkhorn-Knopp calculations poses a significant challenge for GPU memory. To overcome this, the DeepSeek team implemented several aggressive infrastructure enhancements:

Kernel Fusion: Rather than executing mathematical operations sequentially (which demands constant data transfers to and from GPU memory), they utilized a framework called TileLang (Wang et al., 2025)⁶ to create custom, integrated GPU kernels. This enabled them to combine matrix multiplications, normalization steps, and Sinkhorn-Knopp iterations into a single operation, significantly reducing memory overhead.
Selective Recomputing: Widening the residual stream typically requires storing vast amounts of intermediate data for the backward pass during training, which would quickly exhaust GPU memory. Their solution was to discard this intermediate data after the forward pass, retaining only the essential inputs. They then rapidly recompute the lightweight mHC using the integrated kernels as needed during the backward pass.
Overlapping Communication: In distributed training environments (using multiple GPUs), wider streams can introduce communication delays. To mitigate this, they adjusted their scheduling system to overlap the communication of wide streams with the intensive computations of attention layers, ensuring that mHC operations do not become a bottleneck during training.

The outcome of these engineering efforts is highly efficient. Despite the added mathematical complexity and wider data paths, mHC introduces only a minimal 6.7% increase in training time compared to a standard baseline model.

4. The Results: Was it Effective?

To validate the effectiveness of their mathematical and engineering work, the DeepSeek team conducted rigorous testing of mHC. They trained multiple language models based on the DeepSeek-V3 architecture (DeepSeek-Ai et al., 2024)⁴, scaling up to a 27-billion parameter model. They benchmarked their new mHC framework against both a standard residual baseline and the original, unstable HC paradigm. Here is an analysis of the experimental outcomes.

4.1 Re-establishing Training Stability

The primary goal of mHC was to correct the erratic training behavior seen in standard HC, caused by unconstrained mapping matrices. As illustrated below, the gradient norm of the standard HC model (graph b) begins to fluctuate wildly around the 12,000-step mark, coinciding with the point where the loss trajectories of HC and mHC diverge (graph a). Thanks to the more consistent and stable gradient norms provided by mHC, the model ultimately reaches a lower final training loss than the original HC version.

(Source: Adapted from Xie et al., 2025, Figure 5)
Graph a: Training loss versus training steps for three model variants. It shows that the mHC-enabled model achieved the lowest training loss.
Graph b: Gradient norm versus training steps. It contrasts the erratic gradient norms of standard HC with the smooth, predictable norms achieved by mHC.

4.2 Enhancing Downstream Performance

A stable model is only valuable if

In fact, it’s even more intelligent. To demonstrate this, the researchers assessed the 27B version across several downstream benchmarks, such as MATH, MMLU, and reasoning tasks including BBH and DROP. As anticipated, the mHC-powered model delivered consistent improvements across all evaluations, outperforming the unconstrained HC on most benchmarks. The reasoning tasks showed especially notable gains, suggesting that the broader residual pathways actively enhance the model’s expressiveness.

(Source: Adapted from Xie et al., 2025, Table 4)
mHC outperforms both the baseline (standard residual connections) and unconstrained HC on every benchmark except MATH.

4.3 Predictable and Robust Scaling

A crucial evaluation for any new deep-learning architecture is whether it follows established scaling laws. Certain design decisions that work well for a 3B parameter model may break down or backfire at 27B parameters. To verify this, the team plotted compute scaling curves for 3B, 9B, and 24B parameter models. The graphs below clearly show that the relative loss improvement remains consistent across all scales, confirming that mHC is a scalable architectural enhancement.

(Source: Adapted from Xie et al., 2025, Figure 6 (a))
Left: Plot of Absolute Loss difference between mHC & Baseline vs model size (in FLOPs).
Right: Plot of Relative Loss difference between mHC and Baseline vs model size (in FLOPs).

4.4 Taming the Signal Explosion

As a final validation, the authors directly tested one of their core claims: that signal amplification should not grow uncontrollably when stacked across many layers. With standard unconstrained HC, signals could be amplified by up to 3,000 times, completely disrupting gradient flow during training. To determine whether mHC resolved this issue, DeepSeek monitored signal propagation dynamics layer by layer. The results confirmed expectations: thanks to the doubly-stochastic mapping matrices, signal gain was limited to approximately 1.6 throughout the network, ensuring stability even after accumulation across multiple layers.

(Source: Adapted from Xie et al., 2025, Figure 7)
Graph a: Plot of Signal Gain factor vs Layer (by index). The plot shows nearly zero gain across various layers, demonstrating that doubly-stochastic matrices effectively prevent signal explosion.
Graph b: Plot of Signal Gain factor vs Layer (by index, compounded). The plot reveals that when compounded, the signal reaches a gain of about 1.6 at layer 20—still well within a healthy and manageable range for training.

5. Counterfactuals: The “Gotchas” and Trade-offs

Before wrapping up, let’s examine some limitations of mHC, since every engineering decision involves trade-offs. While mHC offers a stable alternative to Hyper-Connections, it comes with several important caveats worth noting.

The 6.7% Time Tax: DeepSeek proudly (and justifiably) highlights that their infrastructure optimizations reduced training time overhead to just 6.7% compared to a baseline model. While this sounds impressively low, at the scale of training massive LLMs (hundreds of billions of parameters)—where GPU compute costs reach tens of millions of dollars—a 6.7% increase translates into a very real and substantial financial burden. You’re paying a premium for that added representational power.
Massive Engineering Complexity: You can’t simply open a standard PyTorch script, write a few lines of code, and expect efficient results. To make mHC practical, the DeepSeek team had to develop custom, low-level fused GPU kernels using TileLang, manually manage memory, and rework their pipeline scheduling. This dramatically raises the barrier to entry. For smaller teams or researchers without dedicated infrastructure engineers, implementing mHC efficiently poses a significant challenge.
The Math is an Approximation: In theory, the Sinkhorn-Knopp algorithm transforms the residual mapping matrix into a perfect doubly stochastic matrix. However, achieving perfection technically requires infinite iterations. To maintain speed, the researchers limit it to 20 iterations. Due to this approximation, the matrix isn’t mathematically flawless in practice. If it were perfect, we’d observe exactly 1.0 signal gain—but we don’t. Instead, signal gain gradually increases to around 1.6 across layers. It’s absolutely bounded and safe at this scale, but for even larger models (current LLMs exceed 500B parameters), this approximation may drift further from ideal behavior.

6. Conclusion: Final Thoughts and Adoptability

Ultimately, the “mHC: Manifold-Constrained Hyper-Connections” paper represents a significant research contribution from DeepSeek. It powerfully illustrates what it takes to push the boundaries of foundational models today: a deep grasp of pure mathematics to identify theoretical flaws, combined with rigorous systems engineering to make the solution run efficiently on real hardware.

Standard residual connections have served the field well for over a decade, but as we enter the trillion-parameter era, we need pathways capable of carrying richer, wider representations without compromising network stability. DeepSeek has shown one effective way to achieve broader, more expressive pathways—innovating on an architectural component long considered fixed.

As for adoption, will mHC be rapidly embraced? Probably not. Its heavy dependence on custom GPU kernels and intricate pipeline scheduling creates a steep barrier that will likely take time to abstract into a user-friendly, plug-and-play module for the broader community. That said, DeepSeek has already proven its effectiveness at scale within their own competitive lineup of models.

Given the clear gains in reasoning benchmarks and training stability, I fully expect well-funded AI labs to begin integrating and experimenting with mHC in their next-generation architectures. It’s a meaningful leap forward—and proof that there’s still ample room to innovate on the most fundamental building blocks of neural networks.

7. References

Xie, Z., Wei, Y., Cao, H., et al. (2025). mHC: Manifold-Constrained Hyper-Connections. DeepSeek-AI. arXiv preprint arXiv:2512.24880.
He, K., Zhang, S., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770-778).
Zhu, D., Huang, H., Huang, Z., et al. (2024). Hyper-connections. arXiv preprint arXiv:2409.19606. (The original ByteDance paper proposing unconstrained HC).
Liu, A., Feng, B., Xue, B., et al. (2024). DeepSeek-V3 Technical Report. arXiv preprint arXiv:2412.19437.
Sinkhorn, R., & Knopp, P. (1967). Concerning nonnegative matrices and doubly stochastic matrices. Pacific Journal of Mathematics, 21(2), 343-348. (The foundational mathematics behind the matrix projection).
Wang, L., Cheng, Y., Shi, Y., et al. (2025). TileLang: A composable tiled programming model for AI systems. arXiv preprint arXiv:2504.17577. (The framework used for mHC’s custom GPU kernel fusion).

Top Posts

Senate Democrats Push to Overturn Key Ruling on Civil Service Job Protections

Visa’s Bold Move: Powering OpenAI’s AI-Driven Payments – Is It Safe?

Anthropic Export Controls Spark Global AI Sovereignty Scramble

Why Decade-Old Residual Connections Still Dominate AI—And Why That’s Holding Us Back

Anthropic Export Controls Spark Global AI Sovereignty Scramble

GPU Time-Slicing for Concurrent LLM Agents on Kubernetes

3 Sneaky Signs Your Wi-Fi Is Being Hacked — Plus How to Shut It Down for Good

Vision LLMs Double as Powerful PDF Decoders: Making Charts and Diagrams Retrievable for Smarter RAG Systems

4 Essential Lines Every Claude Skill Must Have

Databricks Unveils Omnigent: The Open-Source Meta-Harness Uniting Claude Code, Codex, and Pi Under One AI Agent Orchestration Layer

Senate Democrats Push to Overturn Key Ruling on Civil Service Job Protections

Visa’s Bold Move: Powering OpenAI’s AI-Driven Payments – Is It Safe?

Anthropic Export Controls Spark Global AI Sovereignty Scramble

Mathematical String Probability: A Human-Powered Solution to the 3Blue1Brown Challenge

Reve 2.0 Review: The Best AI Image Generator for Layout Control

Army Data Center Initiatives Face Potential Setback Under House NDAA Clause

I tested dozens of Bluetooth trackers, but this one shocked me with its AirTag-crushing battery life

GPU Time-Slicing for Concurrent LLM Agents on Kubernetes

Trending

Senate Democrats Push to Overturn Key Ruling on Civil Service Job Protections

Visa’s Bold Move: Powering OpenAI’s AI-Driven Payments – Is It Safe?

Latest Posts

Not More Data, but Better World Models – Unite.AI

OpenAI Is Hiring Head of Preparedness, Amid AI Cyberattack Fears

Subscribe to Updates

Top Posts

Why Decade-Old Residual Connections Still Dominate AI—And Why That’s Holding Us Back

1.

2. The Foundation: Standard Residual Connections

2.1 The Problem with Standard Residual Connections

2.2 The Improvement: Hyper-Connections (HC)

2.3 The Flaws in Hyper-Connections

3. The Solution: Manifold-Constrained Hyper-Connections (mHC)

3.1 Resolving the Math: The Birkhoff Polytope

3.2 Optimizing for Hardware: Advanced Systems Engineering

4. The Results: Was it Effective?

4.1 Re-establishing Training Stability

4.2 Enhancing Downstream Performance

4.3 Predictable and Robust Scaling

4.4 Taming the Signal Explosion

5. Counterfactuals: The “Gotchas” and Trade-offs

6. Conclusion: Final Thoughts and Adoptability

7. References

Related Posts