科技回声

11 条评论

> you can get 100% GPU utilization by just reading/writing to memory while doing 0 computationsIndeed! Utilization is a proxy for what you actually want (which is good use of available hardware). 100% GPU utilization doesn't actually indicate this.On the other hand, if you aren't getting 100% GPU utilization, you aren't making good use of the hardware.

评论 #41326398 未加载

评论 #41324724 未加载

评论 #41322746 未加载

评论 #41326197 未加载

评论 #41327823 未加载

antognini9 个月前

When understanding the performance of your model it's very helpful to look at a roofline plot [1]. The roofline plot will show you the floating-point performance as a function of arithmetic intensity for the various ops in your model. The plot has two regimes: a memory-bound regime on the left and a compute-bound regime on the right. This can help to identify memory-bound ops that are taking a significant fraction of compute time.[1]: <a href="https://en.wikipedia.org/wiki/Roofline_model" rel="nofollow">https://en.wikipedia.org/wiki/Roofline_model</a>

评论 #41323084 未加载

sundalia9 个月前

Application-specific metrics are the way to go. For ML training this is one example: <a href="https://cloud.google.com/blog/products/ai-machine-learning/goodput-metric-as-measure-of-ml-productivity" rel="nofollow">https://cloud.google.com/blog/products/ai-machine-learning/g...</a>

评论 #41323130 未加载

sergiotapia9 个月前

running GPU models and maximizing utilization is pretty opaque to me as a layman coming into the scene.take this example: <a href="https://gist.github.com/sergiotapia/efc9b3f7163ba803a260b481470255c1" rel="nofollow">https://gist.github.com/sergiotapia/efc9b3f7163ba803a260b481...</a> - running a fairly simple model that takes only 70ms per image pair, but because I have 300 images it becomes a big time sink.by using ThreadPoolExecutor, I cut that down to about 16 seconds. i wonder if there is a fairly obvious way to truly utlize my beefy L40S GPU! is it MPS? I haven't been successful at even running the MPS daemon on my linux server yet. very opaque for sure!

评论 #41323628 未加载

评论 #41322565 未加载

评论 #41322617 未加载

DamonsJ9 个月前

"If we have a CUDA kernel that continuously runs for 10 seconds but only uses 1 SM, on an H100, this would register 100% utilization, but the SM efficiency would be 1 / 132 = 0.7%."does this situation register 100% utilization? BTW, the SM OCCUPANCY is also a metric you need to care about if you concern on kernel efficiency

评论 #41326107 未加载

评论 #41331602 未加载

saagarjha9 个月前

If you have a basic understanding of what your kernels are supposed to do, looking at pipe usage and roofline analysis in Nsight Compute is often helpful, since it will show you how hard you’re saturating those.

pavelstoev9 个月前

I recommend hidet backend in torch.compile - implements many advanced model-specific optimizations automatically. <a href="https://github.com/hidet-org/hidet">https://github.com/hidet-org/hidet</a>

评论 #41326121 未加载

areichenbach9 个月前

I’ve recently been trusting gpu watt usage over utilization. Any idea how good that is as a simple proxy (if I’m just looking at nvidia-smi)?

评论 #41330583 未加载

评论 #41331029 未加载

danielvaughn9 个月前

We ran into a similar problem with CPU utilization at my job. Created an alert for when our systems hit 90% CPU util, and ended up with a ton of noise. We realized that for some of our workloads, this was normal and expected.

ScoutOrgo9 个月前

As someone that is familiar with using nvidia-smi to track util, what are some commands people use to track the SM efficiency? The end of the article had some references, but no examples of what to use explicitly.

评论 #41331701 未加载

AeZ1E9 个月前

gpu utilization is not everything, people! mfus are where it's at. time to recalibrate those expectations and tap into the true potential of your gpus. brace yourselves, the real efficiency is yet to come!

11 条评论

SnowflakeOnIce9 个月前

评论 #41326398 未加载

评论 #41324724 未加载

评论 #41322746 未加载

评论 #41326197 未加载

评论 #41327823 未加载

antognini9 个月前

评论 #41323084 未加载

sundalia9 个月前

评论 #41323130 未加载

sergiotapia9 个月前

评论 #41323628 未加载

评论 #41322565 未加载

评论 #41322617 未加载

DamonsJ9 个月前

评论 #41326107 未加载

评论 #41331602 未加载

saagarjha9 个月前

pavelstoev9 个月前

I recommend hidet backend in torch.compile - implements many advanced model-specific optimizations automatically. <a href="https://github.com/hidet-org/hidet">https://github.com/hidet-org/hidet</a>

评论 #41326121 未加载

areichenbach9 个月前

I’ve recently been trusting gpu watt usage over utilization. Any idea how good that is as a simple proxy (if I’m just looking at nvidia-smi)?

GPU utilization can be a misleading metric

11 条评论

GPU utilization can be a misleading metric

11 条评论