Rust: Dropping heavy things in another thread can make your code 10000x faster

481 pointsby timoooalmost 5 years ago

32 comments

fpgamineralmost 5 years ago

Some important things I think people should note before blindly commenting:* The example code is obviously contrived. The real gist is that massive deallocations in the UI thread cause lag, which the example code proves. That very thing can easily happen in the real world.* I didn't see any difference on my machine between a debug build and a release build.* The example is preforming 1 _million_ deallocations. That's why it's so pathological. It's not just a "large" vector. It's a vector of 1 million vectors. While that may seem contrived, consider a vector of 1 million strings, something that's not too uncommon, and which would likely suffer the same performance penalty.* Rust is not copying anything, nor duplicating the structures here. In the example code the structures would be moved, not copied, which costs nothing. The deallocation is taking up 99% of the time.* As an aside, compilers have used the trick of not free-ing data structures before, because it provides a significant performance boost. Instead of calling free on all those billions of tiny data structures a compiler would generate during its lifetime, they just let them leak. Since a compiler is short lived its not a problem, they get a free lunch (pun unintended), and the OS takes care of cleaning up after all is said and done. My point is that this post isn't theoretical, we do deallocation trickery in the real world.

评论 #23364412 未加载

评论 #23363633 未加载

评论 #23363920 未加载

评论 #23363331 未加载

评论 #23363986 未加载

评论 #23366146 未加载

评论 #23374995 未加载

评论 #23367443 未加载

chowellsalmost 5 years ago

This is the standard problem with tracing data structures to free them. You frequently run into it with systems based on malloc/free or reference counting. The underlying problem is that freeing the structure takes time proportional to the number of pointers in the structure it has to chase.Generational/compacting GC has the opposite problem. Garbage collection takes time proportional to the live set, and the amount of memory collected is unimportant.It's actually a lot to be said for rust that the ownership system lets you transfer freeing responsibility off-thread safely and cheaply in order to not have it block the critical path.But overall, there's nothing really unexpected here, if you're familiar with memory management.

评论 #23363017 未加载

评论 #23364860 未加载

评论 #23363000 未加载

评论 #23362964 未加载

评论 #23363010 未加载

littlestymaaralmost 5 years ago

The title is slightly wrong: it's not going to make your code faster, it's going to reduce latency on the given thread.It maybe a net win if this is the UI thread of a desktop app, but overall, it will come at a performance cost: because modern allocators have thread-local memory pools, and now you're moving away from it. And if you're running you code on a NUMA system (most server nowadays), when moving from one thread to another, you can end up freeing non-local memory instead of local one. Also, you won't have any backpressure on your allocations, and you are susceptible to run out of memory (especially because your deallocations now occur more slowly than they should)Main takeaway: if you use it blindly it's an anti-pattern, but it can be a good idea in its niche: the UI thread of a GUI.

评论 #23368310 未加载

评论 #23366943 未加载

cesarbalmost 5 years ago

Just be careful, because moving heavy things to be dropped to another thread can change the semantics of the program. For instance, consider what happens if within that heavy thing you had a BufWriter: unless its buffer is empty, dropping it writes the buffer, so now your file is being written and closed in a random moment in the future, instead of being guaranteed to have been sent to the kernel and closed when the function returns.And it can even be worse if it's holding a limited resource, like a file descriptor or a database connection. That is, I wouldn't recommend using this trick unless you're sure that the only thing the "heavy thing" is holding is memory (and even then, keep in mind that memory can also be a limited resource).

评论 #23363403 未加载

评论 #23364315 未加载

评论 #23368022 未加载

评论 #23364130 未加载

pierrebaialmost 5 years ago

I've seen variations on this trick multiple times. Using threads, using a message sent to self, using a list and a timer to do the work "later", using a list and waiting for idle time...They all have one thing in common: pampering over a bad design.In the particular example given, the sub-vector probably come from a common source. One could keep a big buffer (a single allocation) and an array of internal pointers. For example of such a design to hold a large array of text strings, see for example this blog entry and its associated github repo:<pre><code> https://www.spiria.com/en/blog/desktop-software/optimizing-shared-data/ https://github.com/pierrebai/FastTextContainer </code></pre> Roughly it is this:<pre><code> struct TextHolder { const char* common_buffer; std::vector<const char*> internal_pointers; }; </code></pre> This is of course addressing the example, but the underlying message is generally applicable: change your flawed design, don't hide your flaws.

评论 #23367639 未加载

saagarjhaalmost 5 years ago

Why would you ever write a get_size function that drops the object you call it on? Surely in an actual, non-contrived usecase spawning another thread and letting the drop occur there would just be plain worse?

评论 #23362838 未加载

评论 #23362835 未加载

评论 #23362825 未加载

评论 #23362815 未加载

评论 #23362832 未加载

评论 #23362764 未加载

epagealmost 5 years ago

For those wanting a real world example where this can be useful:I am writing a static site generator. When run in "watch" mode, it deletes everything and starts over (I'd like to reduce these with partial updates but can't always do it). Moving that cleanup to a thread would make "watch" more responsive.

评论 #23363471 未加载

评论 #23363080 未加载

ncmncmalmost 5 years ago

There is nothing unique to Rust about this; it is a very old technique. It is usually much inferior to the "arena allocator" method, where all the discarded allocations are coalesced and released in a single, cheap operation that could as well be done without another thread. That method is practical in many languages, Rust possibly included. C++ supports it in the Standard Library, for all the standard containers.If important work must be done in the destructors, it is still better to farm the work out to a thread pool, rather than starting another thread. Again, C++ supports this in its Standard Library, as I think Rust does too.One could suggest that the only reason to present the idea in Rust is the cynical one that Rust articles get free upvotes on HN.

评论 #23366075 未加载

jeffdavisalmost 5 years ago

Speedup numbers should be given when optimizing constant factors -- e.g. "I made this operation 5X faster using SIMD" or "By employing readahead, I sped up this file copy by 10X".The points raised in this article are really different:* don't do slow stuff in your latency-critical path* threads are a nice way to unload slow stuff that you don't need done right away (especially if you have spare cores)* dropping can be slowThe first and second points are good, but not really related to rust, deallocations, or the number 10000.The last point is worth discussing, but still not really related to the number 10000 and barely related to rust. Rust encourages an eager deallocation strategy (kind of like C), whereas many other languages would use a more deferred strategy (like many GCs).It seems like deferred (e.g. GC) would be better here, because after the main object is dropped, the GC doesn't bother to traverse all of the tiny allocations because they are all dead (unreachable by the root), and it just discards them. But that's not the full story either.It's not terribly common to build up zillions of allocations and then immediately free them. What's more common is to keep the structure (and its zillions of allocations) around for a while, perhaps making small random modifications, and then eventually freeing them all at once. If using a GC, while the large structure is alive, the GC needs to scan all of those objects, causing a pause each time, which is not great. The eager strategy is also not great: it only needs to traverse the structure once (at deallocation time), but it needs to individually deallocate.The answer here is to recognize that all of the objects in the structure will be deallocated together. Use a separate region/arena/heap for the entire structure, and wipe out that region/arena/heap when the structure gets dropped. You don't need to traverse anything while the structure is alive, or when it gets dropped.In rust, probably the most common way to approximate this is by using slices into a larger buffer rather than separate allocations. I wish there was a little better way of doing this, though. It would be awesome if you could make new heaps specific to an object (like a hash table), then allocate the keys/values on that heap. When you drop the structure, the memory disappears without traversal.

heftigalmost 5 years ago

If I seriously wanted to move object destruction off-thread, I would use at least a dedicated thread with a channel, so I could make sure the dropper is done at some point (before the program terminates, at the latest). It also avoids starting and stopping threads constantly.Something like this: <a href="https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=bf35e0f30b0a730f41b215e5bff8270a" rel="nofollow">https://play.rust-lang.org/?version=stable&mode=debug&editio...</a>You could have an even more advanced version spawning tasks into something like rayon's thread pool, I assume.

评论 #23364375 未加载

maxtonalmost 5 years ago

I'm not very familiar with Rust, but I don't understand why you wouldn't just use a reference-to-HeavyThing as the function argument, so that the object isn't moved and then dropped in the `get_size` function?

评论 #23362818 未加载

评论 #23362792 未加载

评论 #23363295 未加载

评论 #23362802 未加载

dathinabalmost 5 years ago

One thing I just noticed is that the example doesn't make sure to actually run the new thread to completion before the main thread exists.This means that if you do a "drop in other thread" and then main exists, the drop might never run. Which is often fine as the exit of main causes process termination and as such will free the memory normally anyway.But it would be a problem one some systems where memory cleanup on process exit is less reliable. Through such systems are more rare by now I think.

评论 #23364352 未加载

jkoudysalmost 5 years ago

It'd be interesting to implement this on a type that would defer all of these drop threads (or one big drop threads built off a bunch of futures) until the end of some major action, like sending the http response on an actix-web thread. Could be a great way to get the fastest possible response time, since then the client has their response before any delay on cleanup.

评论 #23374115 未加载

chubotalmost 5 years ago

Looks like Evan Wallace ran into the same issue in practice in esbuild<a href="https://news.ycombinator.com/item?id=22336284" rel="nofollow">https://news.ycombinator.com/item?id=22336284</a>I actually originally wrote esbuild in Rust and Go, and Go was the clear winner.The parser written in Go was both faster to compile and faster to execute than the parser in Rust. The Go version compiled something like 100x faster than Rust and ran at something around 10% faster (I forget the exact numbers, sorry). Based on a profile, it looked like the Go version was faster because GC happened on another thread while Rust had to run destructors on the same thread.ESBuild is a really impressive performance-oriented project:<a href="https://github.com/evanw/esbuild" rel="nofollow">https://github.com/evanw/esbuild</a>The Rust version also had other problems. Many places in my code had switch statements that branched over all AST nodes and in Rust that compiles to code which uses stack space proportional to the total stack space used by all branches instead of just the maximum stack space used by any one branch: <a href="https://github.com/rust-lang/rust/issues/34283" rel="nofollow">https://github.com/rust-lang/rust/issues/34283</a>(copy of lobste.rs comment)

Ididntdothisalmost 5 years ago

I used to do this sometimes with C++ when I realized that clearing out a vector with lots of objects was slow. Is Rust basically based on unique_ptr? One problem with this approach was that you still had to wait for these threads when the application would shut down.

评论 #23363848 未加载

评论 #23362868 未加载

评论 #23363215 未加载

评论 #23363280 未加载

评论 #23363056 未加载

Animatsalmost 5 years ago

There's a worse case in deallocation. Tracing through a data structure being released for a long-running program can cause page faults, unused data having been swapped out. This is part of why some programs take far too long to exit.

SilasXalmost 5 years ago

Completely different dynamic (because no Rust GC), but this reminds me of how Twitch made their server, written in Go, a lot faster by allocating a bunch of dummy memory at the beginning so the garbage collector doesn't trigger nearly as often:<a href="https://news.ycombinator.com/item?id=21670110" rel="nofollow">https://news.ycombinator.com/item?id=21670110</a>

评论 #23364284 未加载

grogersalmost 5 years ago

Contrived examples like this are ridiculous. Creating such a heavy thing is likely even more expensive than tearing it down. So unless you create it on a separate thread, you probably shouldn't be freeing it on a separate one. It's not going to solve your interactivity problem. If you are creating the object on a separate thread then it's already going to be natural to free it on a separate one too.

评论 #23364453 未加载

earthboundkidalmost 5 years ago

Maybe some sort of “collector” could come by a clean up “garbage” memory periodically to improve performance…

snicker7almost 5 years ago

I wonder if it might be possible for OS's to provide a fast, asynchronous way of deallocating memory.

andreygrehovalmost 5 years ago

Does anyone know how would this work in Go?

评论 #23364038 未加载

评论 #23363786 未加载

wmichelinalmost 5 years ago

Minor typo, `froget` instead of `forget`

thickicealmost 5 years ago

Is this applicable for Go as well ?

thePunisheralmost 5 years ago

The obvious solution would be to borrow the HeavyThing instead of having it dropped inside the function.

crimsonalucard1almost 5 years ago

I guess choosing when or how a program deallocates is important in a language that's close to the metal.Rust tries to be zero cost while providing abstractions that make it seemingly a high level language but ultimately things like this show that it's not exactly zero cost because abstractions can incur hidden penalties. There needs to be some internal syntax that allows a rust user to explicitly control deallocation when needed.If I started reading code where people would randomly move a value into another thread and essentially do nothing I would be extremely confused. Any language that begins to rely on "trick" or "hacks" as standard patterns exposes a design flaw.Maybe if rust provided special syntax that a function can be decorated with so that it does deallocation in another thread automatically? Or maybe an internal function called drop_async...? This would make this pattern an explicit part of the language rather than a strange hack/trick.

rhackeralmost 5 years ago

Pass by reference?

评论 #23362953 未加载

cpercivaalmost 5 years ago

If freeing the data structure in question takes this long, how much time are you wasting duplicating the data structure?

评论 #23362940 未加载

评论 #23362823 未加载

评论 #23362807 未加载

andrewfromxalmost 5 years ago

hmm my first thought its, having to do that is a lot like c and cleaning up my own allocations. This feels like something rust should automatically do for me?

评论 #23362824 未加载

评论 #23362762 未加载

评论 #23362834 未加载

评论 #23363716 未加载

dirtydroogalmost 5 years ago

Oh my good god.I'm hoping this is down to developer naivety rather than being a feature of rust.

评论 #23364995 未加载

评论 #23364435 未加载

评论 #23363125 未加载

cs702almost 5 years ago

In other words, Rust's automagical memory deallocation is NOT a zero-cost abstraction:<pre><code> fn get_len1(things: HeavyThings) -> usize { things.len() } fn get_len2(things: HeavyThings) -> usize { let len = things.len(); thread::spawn(move || drop(things)); len } </code></pre> The OP shows an example in which a function like get_len2 is 10000x faster than a function like get_len1 for a hashmap with 1M keys.See also this comment by chowells: <a href="https://news.ycombinator.com/item?id=23362925" rel="nofollow">https://news.ycombinator.com/item?id=23362925</a>

评论 #23363308 未加载

评论 #23363194 未加载

floppy123almost 5 years ago

Why should i ever need to drop a heavy object for only getting a size? Not in C++ and also not in Rust, the diffent thread idea is just creativ stupidity, sorry

staticfloatalmost 5 years ago

It seems that this would be a great reason to not pass the entire heavy object through your function, and to instead pass it as a reference. When passing an object (rather than a reference to an object) there's a lot more work going on both in function setup, and in object dropping. I'm not a rust guru, so I don't know the precise wording, but it's simple enough to realize that if this function, as claimed, must drop all the sub-objects within the `HeavyObject` type, then those objects must have been copied from the original object.If you instead define the function to take in a reference (by adding just two `&` characters into your program), the single-threaded case is now almost 100x faster than the multithreaded case.Here's a link to a Rust Playground with just those two characters changed: <a href="https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=9bbe169c77cd31b6cd25b74c24635e12" rel="nofollow">https://play.rust-lang.org/?version=stable&mode=debug&editio...</a>Note that the code that drops the data in a separate thread is not timing the amount of time your CPU is spinning, dropping the data. So while this does decrease the latency of the original thread, the best solution is to avoid copying and then freeing large, complex objects as much as possible. While it is of course necessary to do this sometimes, this particular example is just not one of them. :)As an aside, I'm somewhat surprised that the Rust compiler isn't inlining and eliminating all the copying and dropping; this would seem to be a classic case where compiler analysis should be able to determine that `a.size()` should be computable without copying `a`, and it should be able to eliminate the function call cost as well. Manually doing this gives the exact same timing as my gist above, so I assume that this is happening when passing a reference, but not happening when passing the object itself.

评论 #23363384 未加载

评论 #23367573 未加载

评论 #23363065 未加载

评论 #23363283 未加载