26 pointsby pongogogo5 months ago

2 comments

Crazy amount of innovations in one technical report:- successful fp8 quantized training for SOTA model- multi token prediction, mostly to improve training results, but also to enable speculative decoding- very high sparsity per request (37B activated params per 671B total params)- using reasoning data (from DeepSeek R1) to fine-tune and improve results on math & coding- manual balancing of compute / communication in their infrastructure, up to SM level

pongogogo5 months ago

The big news here is the training costs, $5.576m total cost, equivalent to 2788k training hours on H800 GPU at $2 per hour. This for a model that is (according to DeepSeek's own benchmarks) SOTA for open source.

DeepSeek-v3 Technical Report [pdf]

2 comments

DeepSeek-v3 Technical Report [pdf]

2 comments