That's interesting and reflects my personal training optimization workflow pretty well.
Usually I'll check nvidia-smi and ensure I have a good GPU util, if not I make sure in order:<p>* That my batch transfers to VRAM are done in a sensible way in the dataloader and don't hide CPU-bound preprocessing<p>* That my batch size is large enough<p>* That the model is adequate for the GPU (even convolutional models can be better on the CPU for specific sizes)<p>It's good enough to go from a CPU-bound pattern to a GPU-bound one but I don't really get that detailed understanding of the spectrum between these so I'm definitely going to try this tool in the future, especially since it's so easy to add.<p>On the subject of optimization tricks, I haven't really found any magic bullets, you can't always increase the batch size to get 100% util because of the performance implications. FP16 precision has never done anything for me, weirdly. My preprocessing is never CPU-bound unless I do dumb shit in it so rewriting it in cpp would do nothing.