科技回声

They never mention what hardware they're on.Table 1 is the closest thing. Device specs for six devices: 120-989 TFLOPS and 64-96 GB RAM.An RTX 5090 is about 105 TFLOPS.<a href="https://www.techpowerup.com/gpu-specs/geforce-rtx-5090.c4216" rel="nofollow">https://www.techpowerup.com/gpu-specs/geforce-rtx-5090.c4216</a>

I'm pretty surprised by the claimed memory usage for 300B parameters (table 1). If we compare similar models:- Llama 3.1 with 405B parameters: 2 TB of memory (FP32), 500 GB (FP8)- DeepSeek R1 with 671B parameters: 1.3 TB (scaling linearly, around 600 GB for 300B parameters)Ling claims no more than 96 GB of memory, most likely for inference. That's far more than a 20% reduction. Am I missing something?

They've shared some interesting optimization techniques for bigger LLMs that's all, not exactly low powered devices as in power consumption. Still a good read.

I think this is the one where they train LLM without NVIDIA GPU's.

They've shared some interesting optimization techniques for bigger LLMs that's all, not exactly low powered devices as in power consumption. Still a good read.

I think this is the one where they train LLM without NVIDIA GPU's.

Every Flop Counts: Scaling a 300B LLM Without Premium GPUs

4 条评论

Every Flop Counts: Scaling a 300B LLM Without Premium GPUs

4 条评论