The GPU-hours stat here allows us to back out some interesting figures around electricity usage and carbon emissions if we make a few assumptions.<p>2,788,000 GPU-hours * 350W TDP of H800 = 975,800,000 GPU Watt-hours<p>975,800,000 GPU Wh * (1.2 to account for non-GPU hardware) * (1.3 PUE [1]) = 1,522,248,000 Total Wh, or 1,522,248 kWh to train DeepSeek-V3<p>(1,522,248 kWh) * (0.582kg CO2eq/kWh in China [2]) = 885,948 kg CO2 equivalents to train DeepSeek-V3<p>A typical US passenger vehicle emits about 4.6 metric tons of CO2 per year. [3]<p>885,948 kg CO2 per DeepSeek / 4,600 kg CO2 per car = 192.6 cars per DeepSeek<p>So, the final training run for DeepSeek-V3 emitted as much greenhouse gasses as would be emitted from running about 193 more cars on the road for a year.<p>I also did some more math and found that this training run used about as much electricity as 141 US households would use over the course of a year. [4]<p>[1] <a href="https://enviliance.com/regions/east-asia/cn/report_10060" rel="nofollow">https://enviliance.com/regions/east-asia/cn/report_10060</a><p>[2] <a href="https://ourworldindata.org/grapher/carbon-intensity-electricity" rel="nofollow">https://ourworldindata.org/grapher/carbon-intensity-electric...</a><p>[3] <a href="https://www.epa.gov/greenvehicles/greenhouse-gas-emissions-typical-passenger-vehicle" rel="nofollow">https://www.epa.gov/greenvehicles/greenhouse-gas-emissions-t...</a><p>[4] divided total kWh by the value here: <a href="https://www.eia.gov/tools/faqs/faq.php?id=97&t=3" rel="nofollow">https://www.eia.gov/tools/faqs/faq.php?id=97&t=3</a>
Re DeepSeek-V3 0324 - I made some 2.7bit dynamic quants (230GB in size) for those interested in running them locally via llama.cpp! Tutorial on getting and running them: <a href="https://docs.unsloth.ai/basics/tutorial-how-to-run-deepseek-v3-0324-locally">https://docs.unsloth.ai/basics/tutorial-how-to-run-deepseek-...</a>
Hasn't been updated for the -0324 release unfortunately, and diff-pdf shows only a few small additions (and consequent layout shift) for the updated arxiv version on Feb 18.
I like that they give advice to hardware manufacturers:
- offload communication to a dedicated co-proc
- implement decent precision for accumulating fp8 operations
- finer-grained quantization
...