mHC: How DeepSeek Scaled Residual Connections Without Breaking Training
Modern deep learning rests on the residual connection. After a decade of stacking layers deeper, researchers at DeepSeek asked a different question: what if we scaled width instead? Their answer, Manifold-Constrained Hyper-Connections (mHC), fixes a long-standing stability problem with width scaling.
In this post, I'll walk through the evolution from basic residuals to mHC, explaining why each step was necessary and how DeepSeek's solution actually works at scale.