2026-01-21 · Updated 2026-07-15

Manifold-Constrained Hyper-Connections (mHC): DeepSeek Residual Scaling Explained

Modern deep learning rests on the residual connection. Hyper-Connections (HC) explore another architectural dimension: widen the residual state into several interacting streams. DeepSeek’s Manifold-Constrained Hyper-Connections (mHC) paper studies how to keep that routing stable at larger training scales.

This post starts with standard residual connections, then introduces Hyper-Connections and the instability they create. That sequence makes the final mHC constraint and its implementation easier to understand.

TL;DR: HC replaces one residual state with multiple streams and learnable read, write, and mixing maps. Those unconstrained maps can amplify or attenuate signals when composed across layers. mHC approximately projects the residual mixing map onto the Birkhoff polytope with 20 Sinkhorn-Knopp iterations. The paper reports stable 3B, 9B, and 27B pretraining experiments and 6.7% additional training time for four streams after custom kernel and scheduling work.

Why residual connections work

Before we get to what mHC fixes, we need what it builds on.

The depth problem

Adding layers can increase capacity, but it also makes optimization and signal propagation harder. Depending on initialization, normalization, and architecture, forward activations or backward gradients can shrink, grow, or become poorly conditioned across depth.

The residual solution

The ResNet paper introduced an elegant fix. Instead of learning a direct mapping, learn the residual, the difference from identity:

Standard Residual Connection

The useful property is the identity shortcut. When the residual function $F(x)$ outputs zero, the layer becomes a pass-through. Two consequences follow:

A direct gradient term: backpropagation includes a path through the identity component.
A simple fallback mapping: the residual branch can stay near zero when a layer does not need to change the state much.

This does not eliminate every optimization problem, but it made substantially deeper networks practical.

How layer normalization changes the residual path

Transformers added a new variable: where to put Layer Normalization (LN). The decision looks minor and isn’t.

Post-LN vs Pre-LN Trade-offs

Variant	LN Placement	Advantage	Key Limitation
Post-LN	After residual block	Strong depth contribution	Can be harder to optimize at depth
Pre-LN	Before residual block	More direct residual path	Adjacent-layer representations can become increasingly similar

The ResiDual architecture combines Pre-LN and Post-LN residual paths. HC takes a different route by widening the residual state into multiple streams.

Hyper-Connections add parallel residual streams

Hyper-Connections (HC) takes a different route: expand the residual stream’s width rather than adding depth.

Hyper-Connections Architecture

What a stream means

In a standard Transformer, each token has a $d$ -dimensional state that passes through the blocks. HC replicates the initial state $n$ times, producing an $n \times d$ hidden matrix.

In Hyper-Connections, a stream is one of $n$ parallel instantiations of this state.

How do we get them? At the start of the network, the initial input embedding is replicated $n$ times (where $n$ is the “expansion rate”, typically 4). The standard $d$ -dimensional hidden state becomes an $n \times d$ “hyper hidden matrix”.

The copies begin identically, then diverge as learned maps read from, write to, and mix the streams. The paper interprets them as multiple connection patterns across depth; it does not require each stream to acquire a fixed human-readable role.

Core mechanisms

Instead of one residual pathway, HC keeps $n$ parallel streams flowing through the entire network. At each transformer block, three operations run, each controlled by small learnable weights:

Read ( $\mathcal{H}^{pre}$ ): aggregate the $n$ streams into the $d$ -dimensional input consumed by the attention or feed-forward block.
Write ( $\mathcal{H}^{post}$ ): map that block’s output back into updates for the $n$ streams.
Mix ( $\mathcal{H}^{res}$ ): apply an $n \times n$ residual map before adding the block update.

These maps can be static parameters plus input-dependent terms. The residual map is the stability-critical part because it is multiplied repeatedly across depth.

What the HC paper reports

HC Performance

The HC paper reports 1.8× faster convergence for its OLMoE-1B-7B DHC×4 configuration relative to its baseline, plus downstream gains at 500B tokens. That is one evaluated configuration, not a general speed multiplier for four streams.

The scaling problem

The mHC paper reports instability when it scales unconstrained HC to its 27B setup. The next section explains the mechanism it identifies.

Why unconstrained HC can become unstable

The flexibility that powers HC is also what breaks it. It destroys the identity mapping that makes residuals trainable in the first place.

HC Instability Problem

The composite-map problem

In standard residuals:

$x_{l+1} = x_l + F(x_l)$

When $F(x) \rightarrow 0$ , this is identity: $x_{l+1} = x_l$ . Signal passes through unchanged.

In Hyper-Connections, the residual path includes a matrix multiplication:

$x_{l+1} = \mathbf{H}^{res}_l \cdot x_l + \dots$

Over L layers, the signal becomes:

$x_L = \mathbf{H}^{res}_L \times \mathbf{H}^{res}_{L-1} \times \dots \times \mathbf{H}^{res}_1 \times x_0$

The behavior depends on the composite matrix, not on whether individual entries sit above or below 1. If successive maps have operator gains above one along an aligned direction, signals can grow; gains below one can attenuate them. Negative entries can also introduce cancellation.

The mHC paper measures this with Amax Gain Magnitude: the maximum absolute row sum for forward propagation and column sum for backward propagation in a composite residual map. In its 27B HC experiment, the peak approaches 3,000 and coincides with unstable training behavior.

The Root Cause: Loss of Identity

The design goal is therefore narrower than forcing every map to be the identity: allow inter-stream mixing while bounding amplification across compositions.

The mHC constraint

mHC keeps inter-stream routing but constrains each residual mixing matrix to the Birkhoff polytope, the set of doubly stochastic matrices. Its entries are non-negative, and each row and column sums to one. The constraint makes each output stream a convex combination of input streams and bounds the residual map’s spectral norm by one.

The mHC Solution

What double stochasticity guarantees

mHC constrains the mixing matrix H^res to be doubly stochastic: all entries non-negative, every row and column summing to exactly 1. That enforces three properties at once:

Constraint	Rule	Consequence
Non-negativity	All entries are at least zero	Each output is a convex combination, without sign cancellation
Row sum = 1	Each row sums to one	A constant signal across streams remains constant
Column sum = 1	Each column sums to one	The global mean across streams is conserved

This is not literal Euclidean-energy conservation. A doubly stochastic map can smooth differences between streams. What it provides is mean conservation plus non-expansive routing under the stated norm bound.

This constraint also has useful mathematical consequences:

Spectral norm ≤ 1: the residual routing map cannot amplify the Euclidean norm.
Closed under multiplication: a product of doubly stochastic matrices remains doubly stochastic, so the constraint survives composition across depth.
Convex mixing: by the Birkhoff-von Neumann theorem, the map lies in the convex hull of permutation matrices.

Sinkhorn-Knopp projection

The learnable residual logits are unconstrained. mHC first exponentiates them to obtain a positive matrix, then alternates row and column normalization. With enough iterations, this Sinkhorn-Knopp process approaches a doubly stochastic matrix; the paper uses 20 iterations as an approximate, differentiable projection.

Sinkhorn Algorithm Detailed

For raw logits $A$ , the procedure is:

S = exp(A)
repeat 20 times:
    S = S / row_sum(S)
    S = S / column_sum(S)
return S

The operations are differentiable, but they are not free. mHC relies on a fused forward kernel and a custom backward kernel that recomputes the intermediate normalization states on chip.

Parameterization details

Residual map: exponentiation plus Sinkhorn normalization produces the approximately doubly stochastic $\mathcal{H}^{res}$ .
Read and write maps: sigmoid parameterization keeps $\mathcal{H}^{pre}$ and $\mathcal{H}^{post}$ non-negative, reducing cancellation from mixed-sign coefficients.

Complete mHC architecture

All together:

mHC Complete Architecture

The flow through each block:

Input: $n$ parallel residual streams enter the layer.
Read ( $\mathcal{H}^{pre}$ ): the $n$ streams combine into the input consumed by the layer function. A sigmoid makes the coefficients non-negative.
Computation: The standard Transformer block (Attention or MLP) processes the single aggregated vector.
Write ( $\mathcal{H}^{post}$ ): the block output is mapped into updates for the $n$ streams, again with non-negative coefficients.
Mix ( $\mathcal{H}^{res}$ ): the approximately doubly stochastic residual map mixes the incoming streams before the update is added.
Output: the updated stream matrix moves to the next layer.

Only the residual mixing map uses the Sinkhorn projection. The read and write maps use non-negative parameterizations. That distinction matters because the paper’s composition guarantee applies to $\mathcal{H}^{res}$ .

Infrastructure required for the reported overhead

Four streams increase residual-state memory access, activation storage, and pipeline communication. The paper’s 6.7% timing result depends on the following co-designed implementation.

Kernel fusion

The implementation fuses operations that share memory access, uses mixed precision where appropriate, and implements most custom kernels with TileLang. The Sinkhorn loop and its custom backward pass run inside dedicated kernels to reduce memory traffic and launch overhead.

Selective recomputation

Storing every intermediate Sinkhorn state for backpropagation would blow up memory. Instead, mHC:

Frees intermediate activations after the forward pass.
Recomputes them on-the-fly during the backward pass.

An extended DualPipe schedule overlaps parts of communication, recomputation, and layer work at pipeline boundaries. The achieved overlap is specific to this training system.

Reported system result

For the paper’s large-scale setup, expansion rate $n=4$ adds 6.7% training time relative to its baseline. This is a system result, not the overhead of a plain framework implementation.

What the experiments establish

In the 27B comparison, unconstrained HC reaches a peak composite Amax Gain near 3,000. With 20-step approximate Sinkhorn projection, mHC’s composite backward gain deviates from one but remains bounded at roughly 1.6 in the reported analysis.

The authors also train 3B, 9B, and 27B DeepSeek-V3-inspired MoE variants. At 27B, mHC beats the standard residual baseline on all eight reported downstream benchmarks and beats HC on six of eight; HC is slightly higher on GSM8K and MATH. These are in-house pretraining experiments from the proposing team, so independent replication and comparisons on other architectures remain open.

Trade-offs and open questions

mHC is not a drop-in win for every model. Four questions remain:

System overhead: 6.7% is the optimized paper result; another runtime, device topology, or model shape may see a different cost.
Implementation complexity: a reference implementation can express the method, but matching the reported throughput requires custom kernels, recomputation, and schedule changes.
Mixing bias: double stochasticity preserves the cross-stream mean and prevents expansion through $\mathcal{H}^{res}$ , but it can smooth differences between streams. The block update still changes the overall representation.
Evidence scope: the strongest evidence is language-model pretraining on DeepSeek-V3-inspired MoE architectures. Generalization to other model families is not yet established by this paper.

Key takeaways

Residual connections work because of identity mapping: the ability to pass signals through unchanged.
Hyper-Connections scale width instead of depth, enabling faster convergence through multi-stream routing.
Unconstrained HC can lose the residual conservation property when residual maps compose across depth.
mHC constrains residual mixing to the Birkhoff polytope, conserving the cross-stream mean and bounding amplification.
Sinkhorn-Knopp makes the constraint differentiable, enabling end-to-end training.
The reported 6.7% overhead is a systems achievement, not an architecture-only property.

mHC is a promising way to study wider residual topology while keeping the repeated residual map well conditioned. Whether it is worthwhile for another model depends on independent quality gains and the cost of reproducing its systems stack.

References

mHC: Manifold-Constrained Hyper-Connections - Xie et al. (DeepSeek)
Deep Residual Learning for Image Recognition - He et al. (ResNet)
Hyper-Connections - Original HC paper
TileLang - CUDA kernel optimization framework
DualPipe - Pipeline parallelism scheduler for DeepSeek-V3
ResiDual - Dual residual path architecture