Skip to content

mHC: How DeepSeek Scaled Residual Connections Without Breaking Training

The success of modern deep learning rests on a deceptively simple idea: the residual connection. Yet after a decade of stacking layers deeper and deeper, researchers at DeepSeek asked a different question—what if we could scale width instead? Their answer, Manifold-Constrained Hyper-Connections (mHC), solves a fundamental instability problem that has blocked this path for years.

In this post, I'll break down the evolution from basic residuals to mHC, explaining why each step was necessary and how exactly DeepSeek's solution works at scale.

TL;DR: Hyper-Connections expand residual streams into multiple parallel flows for faster convergence, but break the identity mapping property that keeps training stable. mHC restores stability by constraining mixing matrices to the Birkhoff Polytope (doubly stochastic matrices) using the differentiable Sinkhorn-Knopp algorithm—achieving only 6.7% training overhead with 4 parallel streams.


The Foundation: Why Residual Connections Work

Before we dive into what mHC fixes, we first need to understand what it builds on.

The Depth Problem

Stacking more layers should increase a model's capacity to learn complex functions. In practice, very deep networks become harder to train—not because they lack capacity, but because gradient-based optimization fails to find good parameters. Gradients either vanish (shrinking to near-zero) or explode (growing unboundedly) as they propagate through many layers.

The Residual Solution

The ResNet paper introduced a elegant fix: instead of learning a direct mapping, learn the residual—the difference from identity:

Standard Residual Connection

The key insight is the identity shortcut. When the residual function F(x) outputs zero, the layer becomes a perfect pass-through. This provides:

  1. Gradient Highway: Gradients flow directly through the shortcut, avoiding the vanishing gradient problem
  2. Easy Optimization: If identity is optimal, the network just learns F(x) → 0

This single architectural change enabled training networks with hundreds of layers.


The Transformer Complication: Layer Normalization Placement

Transformers added a new variable: where to put Layer Normalization (LN). This seemingly minor decision creates a fundamental trade-off.

Post-LN vs Pre-LN Trade-offs

Variant LN Placement Advantage Key Limitation
Post-LN After residual block High model capacity Gradient vanishing—LN in main path rescales gradients at every layer
Pre-LN Before residual block Excellent stability Representation collapse—features become similar across layers

The ResiDual architecture attempted to solve this by using dual residual paths—one Pre-LN for stability, one Post-LN for capacity. But it was still limited to a single residual stream. What if we could have multiple parallel streams?


Hyper-Connections: The Width Revolution

Hyper-Connections (HC) took a fundamentally different approach: instead of just adding depth, expand the residual stream width.

Hyper-Connections Architecture

What is a "Stream"?

In standard transformers, the input token embeddings form a single \(d\)-dimensional vector representing the token's features. This single vector sequence is the "residual stream" that passes through every block.

In Hyper-Connections, a stream is simply one of \(n\) parallel instantiations of this state.

How do we get them? At the very beginning of the network, the initial input embedding vector is exactly replicated \(n\) times (where \(n\) is the "expansion rate", typically 4). This transforms the standard \(d\)-dimensional hidden state into an \(n \times d\) "hyper hidden matrix".

As these \(n\) initially identical streams pass through the network's transformer layers, they are dynamically aggregated, routed, and expanded differently by the mechanisms below. This causes them to immediately diverge and capture distinct representation pathways.

Core Mechanisms

Instead of a single residual pathway, HC maintains these \(n\) parallel streams flowing throughout the entire network. At each transformer block, it applies three operations controlled by small learnable weights:

  1. Aggregation (\(H_{pre}\) - Pre-mapping): The \(n\) incoming parallel streams are compressed into a single, unified input vector for the transformer block using a learnable matrix \(H_{pre}\). This acts as an input filter, where each stream is multiplied by a learnable importance weight.
  2. Expansion (\(H_{post}\) - Post-mapping): After this single vector passes through the core transformer block (Attention or MLP), its output is broadcasted into \(n\) separate streams using a learnable matrix \(H_{post}\). This acts as an output gate, with each stream receiving the output scaled by a unique learnable weight.
  3. Mixing (Inter-stream Routing): Finally, these newly expanded streams are merged with the original residual streams. An \(n \times n\) learnable "feature router" matrix (\(\mathbf{H}^{res}\)) controls how information from each stream bleeds into the others, cross-pollinating features before the next layer.

The mixing matrix H acts as a traffic controller, dynamically routing features between streams based on learned patterns. This creates a much richer flow of information than a single residual path.

The Results

HC Performance

HC achieves ~1.8Ɨ faster convergence compared to standard residuals. The parallel streams provide more pathways for gradient flow and allow the network to maintain more diverse representations.

The Catch

But there's a critical issue: HC is unstable at scale.


Why Hyper-Connections Break

The flexibility that makes HC powerful also destroys the property that makes residuals trainable.

HC Instability Problem

The Math of Instability

In standard residuals, we have:

\[x_{l+1} = x_l + F(x_l)\]

When \(F(x) \rightarrow 0\), this becomes identity: \(x_{l+1} = x_l\). The signal passes through unchanged.

In Hyper-Connections, the residual path includes matrix multiplication:

\[x_{l+1} = \mathbf{H}^{res}_l \cdot x_l + \dots\]

Over L layers, the signal becomes:

\[x_L = \mathbf{H}^{res}_L \times \mathbf{H}^{res}_{L-1} \times \dots \times \mathbf{H}^{res}_1 \times x_0\]

If the values in H deviate even slightly from 1.0, this product either:

  • Explodes: values > 1.0 compound exponentially
  • Vanishes: values < 1.0 decay exponentially

The DeepSeek team measured this with "Amax Gain Magnitude"—a metric tracking the maximum ratio of output to input signal magnitude across all layers. In standard HC, this metric hits a staggering ~3000 in deep networks. At that point, training becomes practically impossible.

The Root Cause: Loss of Identity

The core problem: unconstrained matrices can have arbitrary values—negative numbers, large magnitudes, anything. We need a way to constrain them to "well-behaved" matrices that preserve signal energy like the identity matrix does.


The mHC Solution: Geometric Constraints

The insight behind mHC is that we can have flexible routing and stability—if we constrain the mixing matrices to a specific mathematical structure: the Birkhoff Polytope (the set of all doubly stochastic matrices—matrices where every row and column sums to 1, with all elements non-negative).

The mHC Solution

The Three Constraints

mHC constrains the mixing matrix H^res to be doubly stochastic—a matrix where all entries are non-negative and every row and column sums to exactly 1. This enforces three properties simultaneously:

Constraint Rule Why It Matters
Positivity All elements > 0 Prevents sign oscillation that destabilizes gradients
Row Sum = 1 Each row sums to 1.0 Normalizes output contribution—no single stream dominates
Column Sum = 1 Each column sums to 1.0 Normalizes input distribution—all streams contribute fairly

The critical outcome: Energy In = Energy Out. Signal magnitude is preserved deep into the network, eliminating the exponential explosion problem.

This constraint has powerful mathematical implications:

  1. Spectral norm ≤ 1: The spectral norm (largest singular value) bounds signal amplification—doubly stochastic matrices are mathematically non-expanding
  2. Closed under multiplication: Composing doubly stochastic matrices produces another doubly stochastic matrix
  3. Weighted averaging: The operation becomes a convex combination (weighted average where weights sum to 1) of inputs, preserving total signal magnitude

The Sinkhorn-Knopp Algorithm

The challenge: how do we force a learnable matrix to be doubly stochastic while keeping it differentiable? The answer is the Sinkhorn-Knopp algorithm—an iterative projection that converges to doubly stochastic form in just a few steps.

Sinkhorn Algorithm Detailed

Here's how it works with a concrete example:

Step 1: Positivity — Apply exp() to raw weights, ensuring all elements are strictly positive:

Raw Matrix           →    Positive Matrix
[-0.5  2.1  0.8]          [0.6   7.9   2.2]  Σ=10.7
[ 1.3 -4.0  1.9]    exp   [3.7   0.02  6.7]  Σ=10.4
[ 0.1  0.6 -0.2]    →     [1.1   1.8   0.8]  Ī£=3.7

Step 2: Row Normalization — Divide each row by its sum:

Positive Matrix      →    Row Normalized
[0.6   7.9   2.2]         [0.25  0.65  0.10]  Σ=1.0
[3.7   0.02  6.7]   /row  [0.35  0.01  0.64]  Σ=1.0
[1.1   1.8   0.8]    →    [0.30  0.45  0.25]  Ī£=1.0
                           Ī£=0.9 Ī£=1.1 Ī£=0.99  ← columns not yet =1

Step 3: Column Normalization — Divide each column by its sum:

Row Normalized       →    Doubly Stochastic
[0.25  0.65  0.10]        [0.28  0.45  0.27]  Σ=1.0
[0.35  0.01  0.64]  /col  [0.40  0.09  0.51]  Σ=1.0
[0.30  0.45  0.25]   →    [0.32  0.46  0.22]  Ī£=1.0
                           Ī£=1.0 Ī£=1.0 Ī£=1.0  ← converges in few iterations

Step 4: Iterate — Repeat steps 2-3 for t_max iterations (typically 20) until convergence.

The entire process is differentiable, allowing gradients to flow through during training. The Sinkhorn-Knopp algorithm is also computationally efficient, adding minimal overhead to the training loop.

Beyond the projection algorithm, proper initialization is critical for training stability.

Initialization Refinements

To ensure training starts stable:

  • Sigmoid over Tanh: Ensures coefficients are non-negative and bounded (0 to 1)
  • Scalar 2 multiplier: Sigmoid outputs ~0.5 at initialization; multiplying by 2 gives initial weight ~1.0, matching identity behavior

Complete mHC Architecture

Putting it all together:

mHC Complete Architecture

The flow through each block is:

  1. Input: \(n\) parallel residual streams enter the layer.
  2. Aggregation (\(H_{pre}\)): The \(n\) streams are combined into a single vector via a weighted sum using the \(H_{pre}\) matrix. In mHC, these aggregation weights are locally constrained (\(\sigma(\cdot)\)) to be non-negative to prevent unnatural scaling and destructive interference.
  3. Computation: The standard Transformer block (Attention or MLP) processes the single aggregated vector.
  4. Expansion (\(H_{post}\)): The block's single output is broadcasted and scaled out to form \(n\) separate update streams using the \(H_{post}\) matrix, which is also constrained to be non-negative.
  5. Mixing (\(H_{res}\) Routing): The streams share information via an \(n \times n\) mixing matrix \(\mathbf{H}^{res}\). In mHC, this matrix is strictly constrained to the Birkhoff Polytope (doubly stochastic), ensuring signal energy is perfectly conserved.
  6. Output: The updated \(n\) streams proceed to the next layer without exploding or vanishing.

The key difference from standard HC: all mixing and aggregation operations are mathematically constrained (passing through Sinkhorn constraints or similar normalizations), guaranteeing signal stability across hundreds of layers.


Infrastructure: Making It Practical

Expanding to n=4 streams creates significant overhead. Each stream needs its own memory, and Sinkhorn adds 20 iterations per layer. The DeepSeek team solved this with aggressive optimization:

Kernel Fusion

Using TileLang, they fused Sinkhorn iterations with mixed-precision multiplications into specialized CUDA kernels. This minimizes round-trips to high-bandwidth memory (HBM), which is often the actual bottleneck in modern training.

Selective Recomputation

Storing all intermediate Sinkhorn states for backpropagation would explode memory usage. Instead, mHC:

  • Frees intermediate activations after the forward pass
  • Recomputes them on-the-fly during the backward pass

A modified DualPipe schedule overlaps this recomputation with gradient communication, hiding latency.

Results

With these optimizations, expansion rate n=4 runs with only 6.7% training overhead compared to the baseline—proving complex topological routing is practical at scale.


Empirical Validation

So, do these theoretical guarantees actually map to real-world improvements?

When looking at the raw training dynamics, the difference is significant. Without constraints, deep networks using standard Hyper-Connections see their signal magnitude (Amax Gain) explode to roughly 3,000, leading to massive instability and frequent loss spikes. By enforcing the doubly stochastic constraint, mHC tames this explosion, keeping the Amax Gain at a rock-solid ~1.6 throughout training.

But stability doesn't matter if the model's actual performance degrades. To test representational capacity, the team evaluated an mHC-27B model (built on the DeepSeek-V3 architecture) against both standard ResNet and unconstrained HC baselines. On rigorous reasoning benchmarks like GSM8K and MATH, mHC consistently comes out on top. This confirms the core hypothesis: the performance gains from parallel stream routing are real, and with Sinkhorn constraints, we can finally train these extremely wide residual pathways at scale.


Trade-offs and Considerations

Of course, mHC isn't a free lunch. Here are the main trade-offs to keep in mind:

  1. Computational overhead: While 6.7% is incredibly low for what it accomplishes, it's still additional overhead compared to standard residuals.
  2. Implementation complexity: You can't just write this in native PyTorch and expect it to be fast. Getting that low overhead requires custom, finely-tuned CUDA kernels.
  3. Strong inductive bias: The doubly stochastic constraint forces strict signal conservation. If you're working on a task that genuinely requires signal amplification deeper in the network, this constraint will actively fight you.

Key Takeaways

  1. Residual connections work because of identity mapping—the ability to pass signals through unchanged
  2. Hyper-Connections scale width instead of depth, enabling faster convergence through multi-stream routing
  3. The flexibility of HC destroys identity mapping, causing signal explosion in deep networks
  4. mHC constrains mixing matrices to the Birkhoff Polytope, mathematically guaranteeing stability
  5. Sinkhorn-Knopp makes the constraint differentiable, enabling end-to-end training
  6. Aggressive infrastructure optimization (kernel fusion, selective recomputation) makes it practical at scale

For practitioners: if you're hitting limits with depth scaling and have access to custom kernel development, mHC offers a principled way to scale model capacity through width while maintaining training stability.


References