Skip to content

Paper Review

mHC: How DeepSeek Scaled Residual Connections Without Breaking Training

The success of modern deep learning rests on a deceptively simple idea: the residual connection. Yet after a decade of stacking layers deeper and deeper, researchers at DeepSeek asked a different question—what if we could scale width instead? Their answer, Manifold-Constrained Hyper-Connections (mHC), solves a fundamental instability problem that has blocked this path for years.

In this post, I'll break down the evolution from basic residuals to mHC, explaining why each step was necessary and how exactly DeepSeek's solution works at scale.