Recommendable! Seems the latest DeepSeek model comes with several interesting innovations!
"Key Takeaways
- Hybrid CSA and HCA attention cuts KV cache to 10% of DeepSeek-V3.2 at 1M tokens.
- Manifold-Constrained Hyper-Connections (mHC) replace residual connections for more stable deep layer training.
- The Muon optimizer replaces AdamW for most parameters, delivering faster convergence and training stability.
- Post-training uses On-Policy Distillation from 10+ domain experts instead of traditional mixed RL.
- DeepSeek-V4-Flash-Base outperforms DeepSeek-V3.2-Base despite having 3x fewer activated parameters.
No comments:
Post a Comment