Speedup from switch to +=

115 pointsby j0e1over 2 years ago

20 comments

chilleeover 2 years ago

Ok, I work on PyTorch, so probably should clear up some misconceptions in this thread.1. In PyTorch (and other array programming libraries like Numpy), the operations being passed around are tensors/arrays (i.e. large chunks of memory). Thus, += is overloaded to mean "in-place write" to the arrays.So, `+` vs `+=` is the equivalent of<pre><code> a: float[1000] b: float[1000] for i in [0, 1000]: b[i] = a[i] + 2 </code></pre> vs.<pre><code> a: float[1000] for i in [0, 1000]: a[i] = a[i] + 2 </code></pre> The main performance advantage comes in 1. no need to allocate an extra array, 2. you're using less memory overall, so various caching levels can work better. It has nothing to do with python bytecodes.2. As for whether it generally makes sense to do this optimization manually... Usually, PyTorch users don't use in-place operations as its a bit uglier mathematically and have various foot-guns/restrictions that users find confusing. Generally, it's best to have this optimization be done automatically by an optimizing compiler.3. PyTorch in general does support using in-place operations during training, albeit with some caveats.(PS) 4. Putting everything on one line (as some folks suggest) is almost certainly not going to help performance - the primary performance bottlenecks here have almost nothing to do with CPU perf.

评论 #32807708 未加载

FabHKover 2 years ago

Plot twist: it breaks the code...?> Changing this back to the original implementation fixed an error I was getting when doing textual inversion on Windows<a href="https://github.com/lstein/stable-diffusion/commit/62863ac586194a43ff952eba17a83cecf9956500#commitcomment-83696307" rel="nofollow">https://github.com/lstein/stable-diffusion/commit/62863ac586...</a>

评论 #32806052 未加载

评论 #32815991 未加载

nodjaover 2 years ago

I see lots of people answering why it's faster, but not many saying why the engineers chose the slower version.As everyone said, this is more performant because x is being modified in place, the reason this was not done in place is because you can't train a neural network if an instruction is being done in place. During training a network goes literally through all operations that were done and see how well they performed so they can be adjusted using a secondary value called a gradient, this is done during the backwards pass. If you replace something in place you're essentially overwriting the input values that were passed to that function, and by extension, the output values of the function called before, essentially breaking the network chain, unless you also copy the inputs together with the gradients, which would cause an even worse performance hit and be a memory hog.The breakage bug later in the issue is proof of this, when sampling to generate an image only the forward pass is done on the network, but textual inversion requires you to train the network and therefore do the backwards pass, triggering the error since the dependency graph is broken. I should also note that technically the add operation should be safe to do in place as it's reversible, but I'm not a pytorch expert so I'm not sure exactly what's going on in there.

评论 #32806486 未加载

ironhavenover 2 years ago

Because of operator overloading "+=" can call a more optimized method than "+". If this code was written in a language without operator overloading I don't think this would be a very interesting pull request. THis could be a example of why some people don't like operator overloading and why some programing languages (java, zig, etc) do not implment the feature.

评论 #32805978 未加载

评论 #32805996 未加载

评论 #32806085 未加载

JonathonWover 2 years ago

If they're seeing these kinds of gains from relatively minor changes to their Python code, I can't help but wonder how much faster the model would run in a compiled language or a language with a good JIT (way more optimization work's gone into the mainstream Javascript runtimes than CPython).I'd assumed that overall performance in Stable Diffusion was limited by the code running on the GPU, with Python performance being a fairly minor factor-- but I guess that's not the case?

评论 #32805764 未加载

评论 #32805911 未加载

评论 #32806004 未加载

评论 #32805756 未加载

评论 #32806023 未加载

评论 #32805777 未加载

评论 #32806734 未加载

评论 #32805757 未加载

评论 #32805902 未加载

评论 #32805717 未加载

staticassertionover 2 years ago

This isn't a Python issue, this is a "I'm copying when I don't need to" issue. As I mention elsewhere, you can write this sort of "bug" in almost any language pretty easily (as I demonstrate with Rust).This isn't a case of "The Python interpreter is bad" it's just that the code is doing what the user asked it to do - create a completely new copy of the data, then overwrite the old copy with it. Immutable operations like this are slow, mutating the value (what += does) is fast.Granted, a compiled language could recognize that you're doing this, but it also might not - is `+` and `+=` semantically identical such that the compiler can replace one with the other? Maybe? Probably not, if I had to guess. The correct answer is to just use the faster operation, as it is with all language.I don't know the type of `x`, but I'd suggest another optimization here would be to:a) Preallocate the buffer rather before mutating it 3x (which is still likely forcing some allocations)b) Reuse that buffer if it's so important, store it in `self` and clear it before use.

datalopersover 2 years ago

This StackOverflow answer [1] goes into performance details of INPLACE_ADD versus BINARY_ADD.[1] <a href="https://stackoverflow.com/a/15376520" rel="nofollow">https://stackoverflow.com/a/15376520</a>

teruakohatuover 2 years ago

I guess this is the beauty of making a model open source.

评论 #32805708 未加载

eruover 2 years ago

I wonder what version of Python they were using?I'm wondering, because recent version have improved performance a lot. 3.11 is much faster than 3.10, and what's in 3.12 is already much faster than 3.11.

评论 #32805702 未加载

评论 #32806330 未加载

brrrrrmover 2 years ago

It’s not clear a JIT compiled language would help much here unless the operations were baked into the JIT itself (which would have to identify the memory savings of an in-place call).

Waterluvianover 2 years ago

One comment asks about putting it all on one line, and this is where interpreted languages without a JIT kinda blow.Many times I have had to decide if my Python code would be more legible or get free performance.The thing I like about JavaScript is that I can _usually_ trust the JIT to make my code faster than I could, meaning I can focus entirely on writing clean code.P.S. you can always hand optimize. If you do, just comment the heck out of it.

评论 #32806101 未加载

评论 #32806212 未加载

eesmithover 2 years ago

Lincoln Stein. Now that's a name I've not heard in a long time. A long time.He's the author of the essay "How Perl Saved the Genome Project", the books "Network Programming with Perl" and "Writing Apache Modules with Perl and C", and a number of Perl packages including CGI.pm - which helped power the dot-com era - and GD.pm.

teo_zeroover 2 years ago

But wait... x+=y is equivalent to x=x+y not to x=y+x. Only if + is commutative, then the three are equivalent. Are we sure the + operation is commutatve for this type of data? And does the compiler know it?It would be interesting to check whether changing every expression to x=x+y has a performance more similar to += or to ...+x

dahfizzover 2 years ago

Is python in the fast path? Why not rewrite in a performant language for a XXX% speedup?

评论 #32805727 未加载

评论 #32805707 未加载

olliejover 2 years ago

Is this a lookup overhead thing or a memcpy based overhead regression? In the case of the latter it seems like this may result in an unexpected mutation of the source data?

thweorui23432over 2 years ago

Speedup likely won't work for training the model.

评论 #32806632 未加载

MaXtreeMover 2 years ago

There is a case in C# where using compound assignment is actually slower [0]. Based on comments this should be fixed in .NET7 I haven't checked it myself.[0]: <a href="https://mobile.twitter.com/badamczewski01/status/1561817158442782720" rel="nofollow">https://mobile.twitter.com/badamczewski01/status/15618171584...</a>

mhzshover 2 years ago

But why is it faster? A non-associative translation to byte code (or however python works)?

评论 #32805698 未加载

评论 #32805673 未加载

评论 #32805694 未加载

spullaraover 2 years ago

Mutation faster than making a new object.

nooberminover 2 years ago

Whenever I see things like this in highly visible code that people exclaim about across the internet it makes me really take a moment to absorb how much time I spend agonizing over minutae in my daily work and how people who really are just lucky can get away with much worse. Just a reminder about how the idea that "tech" is a meritocracy was never really true.

评论 #32805988 未加载