The compute scheduling part of the paper is also vey good, the way they balanced load to keep compute and communication in check.<p>There is also a lot of thought put into all the tiny bits of optimization to reduce memory usage, using FP8 effectively without significant loss of precision nor dynamic range.<p>None of the techniques by themselves are really mind blowing, but the whole of it is very well done.<p>The DeepSeekV3 paper is really a good read: <a href="https://github.com/deepseek-ai/DeepSeek-V3/blob/main/DeepSeek_V3.pdf">https://github.com/deepseek-ai/DeepSeek-V3/blob/main/DeepSee...</a>