Signal amplification of 3000x destroyed a 27-billion-parameter model during training.1 DeepSeek's researchers watched unconstrained Hyper-Connections cause catastrophic divergence, with gradients spiraling beyond any hope of recovery. The solution they developed may reshape how the industry builds foundation models.
TL;DR
DeepSeek published a technical paper on December 31, 2025, introducing Manifold-Constrained Hyper-Connections (mHC), a framework that projects neural network connection matrices onto a mathematical manifold using the Sinkhorn-Knopp algorithm.2 The approach solves training instability that plagued previous Hyper-Connection architectures, controlling signal amplification to 1.6x compared to 3000x with unconstrained methods.3 Testing on 3B, 9B, and 27B parameter models demonstrated a 2.1% improvement on BIG-Bench Hard reasoning benchmarks with only 6.7% additional training overhead.4 CEO Liang Wenfeng co-authored the paper, signaling mHC will likely appear in DeepSeek's next flagship model.
The Residual Connection Problem
Every major language model today relies on residual connections, a technique introduced in 2015 with ResNet that fundamentally changed deep learning.5 The concept appears simple: allow information to bypass layers by adding input directly to output, creating "skip connections" that let gradients flow more easily during training.6
Kaiming He's original ResNet paper demonstrated that residual connections solved the "degradation problem" that had plagued deep networks.7 Without skip connections, adding more layers to a network paradoxically increased training error. VGGNet with 19 layers performed worse than AlexNet with 8 layers on certain tasks, despite having more capacity.8
Residual connections enabled the training of networks with hundreds of layers. The technique proved so fundamental that all transformer architectures incorporate residual connections.9 GPT, BERT, Claude, and every other large language model depends on skip connections to function.10
The Limitation
The standard residual connection adds input directly to output with a fixed weight of 1.0. This constraint ensures stable training but limits expressiveness. The network cannot learn that some layers should contribute more than others or that connections between non-adjacent layers might improve performance.11
| Architecture | Year | Residual Type | Connection Weight |
|---|---|---|---|
| ResNet | 2015 | Fixed skip | 1.0 (constant)12 |
| Highway Network | 2015 | Gated skip | Learned gate (0-1)13 |
| DenseNet | 2016 | All-to-all | Equal contribution14 |
| Transformer | 2017 | Fixed skip | 1.0 (constant)15 |
| Hyper-Connections | 2024 | Variable width | Learned matrices16 |
Researchers attempted various modifications. Highway networks added learnable gates to control information flow.17 DenseNet connected every layer to every subsequent layer.18 These approaches improved performance but introduced computational overhead or training challenges at scale.19
Hyper-Connections: The Failed Revolution
Hyper-Connections (HC), introduced in 2024, represented an ambitious attempt to make residual connections fully learnable.20 Instead of fixed skip connections with weight 1.0, HC allowed neural networks to learn arbitrary connection strengths between layers through weight matrices.21
The theory held promise. If networks could learn optimal connection patterns, they might discover architectures that humans would never design manually.22 Early experiments showed substantial performance gains on smaller models.23
The problem emerged at scale.
Catastrophic Instability
When DeepSeek researchers attempted to train a 27-billion-parameter model with unconstrained Hyper-Connections, signal amplification exceeded 3000x.24 The network's internal representations exploded in magnitude, causing gradients to become infinite and training to collapse entirely.25
The mathematical explanation centers on eigenvalues. When arbitrary matrices multiply together across hundreds of layers, any eigenvalue greater than 1.0 causes exponential growth.26 In a 27B parameter model with unconstrained connection matrices, the probability of all eigenvalues staying below 1.0 approaches zero.27
| Model Size | HC Signal Gain | Training Outcome |
|---|---|---|
| 3B parameters | ~50x | Completed with degraded performance28 |
| 9B parameters | ~300x | Completed with significant instability29 |
| 27B parameters | ~3000x | Catastrophic divergence30 |
The identity mapping property that made residual connections work had been destroyed.31 Standard residual connections preserve signal magnitude by adding input to output. Hyper-Connections' arbitrary matrices broke this guarantee, and larger models amplified the problem exponentially.32
The mHC Solution
DeepSeek's Manifold-Constrained Hyper-Connections framework addresses instability by constraining connection matrices to a specific mathematical structure.33 Instead of allowing arbitrary learned matrices, mHC projects connections onto the Birkhoff Polytope, the space of doubly stochastic matrices.34
A doubly stochastic matrix has rows and columns that each sum to 1.0.35 This constraint guarantees that signal magnitude cannot grow or shrink as information passes through the network.36 The identity mapping property returns, but with learned flexibility about how information routes between layers.37
The Sinkhorn-Knopp Algorithm
Converting arbitrary matrices to doubly stochastic form requires the Sinkhorn-Knopp algorithm, an iterative procedure developed in 1967 for matrix normalization.38 The algorithm alternates between normalizing rows and normalizing columns until convergence.39
Input: Non-negative matrix A
Repeat:
1. Normalize each row to sum to 1
2. Normalize each column to sum to 1
Until convergence
Output: Doubly stochastic matrix
DeepSeek's implementation uses 20 iterations of Sinkhorn-Knopp normalization, which experimental results showed provides sufficient accuracy without excessive computation.40 The algorithm integrates into the training loop, projecting learned connection weights onto the Birkhoff Polytope at each step.41
Infrastructure Optimization
Raw Sinkhorn-Knopp normalization would add unacceptable overhead to training. DeepSeek engineers developed several optimizations to make mHC practical at scale.42
Kernel Fusion: Multiple normalization operations merge into single GPU kernel calls, eliminating memory transfer overhead between operations.43
Mixed Precision: TileLang-based kernels enable efficient FP8 computation for matrix operations while maintaining FP32 precision for numerically sensitive normalization steps.44
Selective Recomputation: Rather than storing all intermediate values, the system recomputes certain tensors during the backward pass, trading compute for memory.45
DualPipe Communication Overlap: Multi-GPU training overlaps Sinkhorn-Knopp computation with inter-device communication, hiding normalization latency.46
| Optimization | Overhead Reduction |
|---|---|
| Kernel fusion | ~40% latency reduction47 |
| Mixed precision | ~30% memory reduction48 |
| Selective recompute | ~25% memory reduction49 |
| Communication overlap | ~50% hidden latency50 |
The combined optimizations reduce mHC's training overhead to 6.7% above baseline, making the technique viable for production-scale training.51
Experimental Results
DeepSeek tested mHC against baseline architectures and unconstrained Hyper-Connections across three model scales: 3B, 9B, and 27B parameters.52 All models used the DeepSeek-V3 architecture as foundation, incorporating Multi-Head Latent Attention (MLA) and Mixture-of-Experts (MoE) components.53
Training Stability
The most dramatic improvement appeared in training stability metrics. Signal gain measurements track how much internal representations grow as information passes through the network.54
| Model | Baseline | HC | mHC |
|---|---|---|---|
| 3B signal gain | 1.2x | 48x | 1.5x55 |
| 9B signal gain | 1.3x | 287x | 1.6x56 |
| 27B signal gain | 1.4x | 3012x | 1.6x57 |
mHC-trained models maintained signal gains near the theoretical ideal of 1.0x regardless of model size.58 Unconstrained Hyper-Connections showed exponentially increasing instability with scale, while mHC demonstrated consistent behavior from 3B to 27B parameters.59
Benchmark Performance
Performance improvements appeared across reasoning-focused benchmarks where architectural advances typically show the largest gains.60
| Benchmark | Baseline | mHC | Improvement |
|---|---|---|---|
| BIG-Bench Hard (27B) | 43.8% | 51.0% | +7.2 points61 |
| DROP | 78.2% | 81.4% | +3.2 points62 |
| GSM8K | 82.1% | 84.9% | +2.8 points63 |
| MMLU | 79.4% | 80.8% | +1.4 points64 |
The largest improvements appeared on BIG-Bench Hard, a benchmark specifically designed to test complex, multi-step reasoning.65 DROP, which requires numerical reasoning over long passages, showed the second-largest gain.66 GSM8K mathematical reasoning and MMLU general knowledge benchmarks demonstrated smaller but consistent improvements.67
Training Efficiency
Despite the additional Sinkhorn-Knopp computations, mHC added only 6.7% overhead to total training time.68 The overhead remained constant across model scales, suggesting the technique scales efficiently to even larger models.69
| Model Size | Training Time (Baseline) | Training Time (mHC) | Overhead |
|---|---|---|---|
| 3B | 100 hours | 106.5 hours | 6.5%70 |
| 9B | 280 hours | 298.8 hours | 6.7%71 |
| 27B | 840 hours | 896.3 hours | 6.7%72 |
Loss curves showed mHC achieving lower final loss than both baseline and HC approaches.73 The mHC 27B model achieved a final loss 0.021 lower than baseline, translating directly to the benchmark improvements observed.74
Implications for Foundation Model Development
DeepSeek CEO Liang Wenfeng co-authored the mHC paper, a signal that the technique will likely appear in the company's next flagship model.75 Analysts expect either DeepSeek R2 or V4 to incorporate mHC architecture, potentially launching during Chinese New Year in February 2026.76
The broader implications extend beyond DeepSeek. mHC addresses a fundamental constraint that has limited architectural innovation in large language models. For the past decade, researchers have largely avoided modifying residual connections because any change that broke identity mapping caused training instability at scale.77
Unlocking Architectural Innovation
mHC demonstrates that learnable connection patterns can work at scale when properly constrained.78 The Birkhoff Polytope projection maintains the mathematical properties that make training stable while allowing networks to discover optimal information routing patterns.79
Future research directions opened by mHC include:
Layer-specific connection strengths: Models could learn that early layers benefit from stronger skip connections while deeper layers need different routing patterns.80
Dynamic connections: Connection patterns could vary based on input content, routing different types of information through different paths.81
Cross-attention modifications: The mHC framework could extend to attention mechanisms, potentially improving how models combine information across sequence positions.82
Training Cost Implications
DeepSeek has established a track record of training competitive models at dramatically lower costs than Western competitors.83 DeepSeek-V3 trained for $6 million compared to an estimated $100 million for GPT-4.84 DeepSeek-R1 reportedly cost just $294,000 to train.85
mHC continues this efficiency focus. The 6.7% overhead adds marginal cost while potentially enabling larger models that would otherwise be unstable.86 If mHC allows training 100B+ parameter models that would diverge without the framework, the effective cost-per-capability could decrease substantially.87
Industry Response
The paper generated immediate attention from AI researchers worldwide. Sebastian Raschka, a prominent ML researcher and author, called mHC "a striking breakthrough that could shape the evolution of foundational models."88
Analysts remain divided on specific implications. Lian Jye Su of ABI Research suggested DeepSeek's track record means mHC will "definitely be implemented in their new model."89 Wei Sun of Counterpoint Research offered more caution: "There is most likely no standalone R2 coming," suggesting the technique might appear in V4 instead.90
For organizations operating AI infrastructure, mHC signals continued rapid evolution in model architectures. Teams planning GPU deployments should account for the possibility that next-generation models may have different computational profiles than current architectures.91
Organizations scaling AI infrastructure across multiple regions can consult Introl for deployment strategies that anticipate architectural shifts across 257 global locations with 100,000 GPU deployment capability.
Technical Implementation Considerations
Organizations seeking to implement mHC or similar constrained optimization techniques should consider several technical factors.92
Hardware Requirements
Sinkhorn-Knopp normalization adds compute operations but reduces memory pressure compared to unconstrained alternatives.93 The iterative nature of the algorithm benefits from high-bandwidth memory architectures found in modern AI accelerators.94
| Hardware | mHC Suitability | Notes |
|---|---|---|
| NVIDIA H100 | Excellent | Native FP8 support matches mHC precision requirements95 |
| NVIDIA H200 | Excellent | Additional HBM3e memory beneficial for large models96 |
| NVIDIA B200 | Excellent | Blackwell architecture optimized for iterative algorithms97 |
| AMD MI300X | Good | Requires custom kernel development98 |
Framework Support
DeepSeek has not released public implementation code as of January 2026. Organizations interested in experimenting with mHC techniques must implement from the paper's algorithmic descriptions.99
Key implementation components include:
- Sinkhorn-Knopp kernels: Efficient CUDA implementations of iterative row/column normalization
- Backward pass modifications: Gradient computation through the projection operation
- Mixed precision handling: Maintaining numerical stability across precision levels
- Distributed training integration: Coordinating projections across multi-GPU configurations
Training Infrastructure
mHC's 6.7% overhead applies to forward and backward passes uniformly.100 Organizations should budget for this additional compute when planning training runs. For a hypothetical 1000 H100-hour training job, mHC would require approximately 1067 H100-hours.101
The overhead remains constant regardless of model size, suggesting efficient scaling to larger configurations.102 Organizations planning future models exceeding current scale should consider mHC's stability benefits against the modest compute cost increase.
Key Takeaways
For infrastructure planners: - Budget for 6-7% additional training compute for mHC-style architectures - Prioritize high-bandwidth memory GPUs (H200, B200) for iterative algorithms - Plan for architectural shifts in next-generation foundation models
For operations teams: - Monitor emerging framework support for constrained optimization techniques - Develop expertise in Sinkhorn-Knopp and related iterative algorithms - Prepare infrastructure for hybrid precision training requirements
For strategic planning: - Track DeepSeek model releases for mHC integration signals - Evaluate training stability improvements against compute overhead - Consider how architectural advances affect GPU procurement timelines
For research teams: - Experiment with Birkhoff Polytope projections in smaller-scale models - Investigate extensions to attention mechanisms and cross-layer connections - Monitor follow-up papers exploring mHC variations
References
-
Analytics Vidhya - DeepSeek mHC: Stabilizing Large Language Model Training ↩
-
South China Morning Post - DeepSeek kicks off 2026 with paper signalling push to train bigger models for less ↩
-
GeeksforGeeks - Residual Networks (ResNet) - Deep Learning ↩
-
Dive into Deep Learning - Residual Networks (ResNet) and ResNeXt ↩
-
Michael Brenndoerfer - Residual Connections: Enabling Training of Very Deep Neural Networks ↩
-
EITCA Academy - How do residual connections in ResNet architectures facilitate training ↩
-
viso.ai - ResNet: Revolutionizing Deep Learning in Image Recognition ↩
-
Data Basecamp - ResNet: Residual Neural Networks - easily explained ↩
-
Towards Data Science - Residual blocks — Building blocks of ResNet ↩
-
DEV Community - The Geometry of Stability: Why Manifold-Constrained Hyper-Connections Are the Future ↩
-
Neuronad - Taming the Chaos: How Manifold-Constrained Hyper-Connections Are Evolving AI Architecture ↩
-
editorialge - mHC Manifold Constrained Hyper Connections by DeepSeek ↩
-
Medium - DeepSeek mHC Explained: How Manifold-Constrained Hyper-Connections Redefine Residual Connections ↩
-
Analytics Vidhya - Scaling Transformers, Laws and Challenges ↩
-
Medium - What is DeepSeek mHC? DeepSeek's new paper changes LLMs forever ↩
-
SIAM Journal - The Sinkhorn–Knopp Algorithm: Convergence and Applications ↩
-
alphaXiv - mHC: Manifold-Constrained Hyper-Connections overview ↩
-
Emergent Mind - Sinkhorn-Knopp-Style Algorithm applications ↩
-
Analytics India Mag - New DeepSeek Research Shows Architectural Fix Can Boost Reasoning ↩
-
AI Discoveries - DeepSeek's New Training Method: What It Means for 2026 ↩
-
Scientific American - Secrets of Chinese AI Model DeepSeek ↩