Conceptually there should be a predictable tradeoff curve between memory size and execution time. This could be quite useful (e.g. if 100hrs is ok, so are 200hrs), after all you don't train such a model every day.<p>But in practice this curve is probably very non-linear in the low end. This writeup shows nicely how lumpy the various steps of loading kernels, model etc<p><a href="https://huggingface.co/docs/transformers/perf_train_gpu_one" rel="nofollow">https://huggingface.co/docs/transformers/perf_train_gpu_one</a>