Some important things I think people should note before blindly commenting:<p>* The example code is obviously contrived. The real gist is that massive deallocations in the UI thread cause lag, which the example code proves. That very thing can easily happen in the real world.<p>* I didn't see any difference on my machine between a debug build and a release build.<p>* The example is preforming 1 _million_ deallocations. That's why it's so pathological. It's not just a "large" vector. It's a vector of 1 million vectors. While that may seem contrived, consider a vector of 1 million strings, something that's not too uncommon, and which would likely suffer the same performance penalty.<p>* Rust is not copying anything, nor duplicating the structures here. In the example code the structures would be moved, not copied, which costs nothing. The deallocation is taking up 99% of the time.<p>* As an aside, compilers have used the trick of not free-ing data structures before, because it provides a significant performance boost. Instead of calling free on all those billions of tiny data structures a compiler would generate during its lifetime, they just let them leak. Since a compiler is short lived its not a problem, they get a free lunch (pun unintended), and the OS takes care of cleaning up after all is said and done. My point is that this post isn't theoretical, we do deallocation trickery in the real world.
This is the standard problem with tracing data structures to free them. You frequently run into it with systems based on malloc/free or reference counting. The underlying problem is that freeing the structure takes time proportional to the number of pointers in the structure it has to chase.<p>Generational/compacting GC has the opposite problem. Garbage collection takes time proportional to the live set, and the amount of memory collected is unimportant.<p>It's actually a lot to be said for rust that the ownership system lets you transfer freeing responsibility off-thread safely and cheaply in order to not have it block the critical path.<p>But overall, there's nothing really unexpected here, if you're familiar with memory management.
The title is slightly wrong: it's not going to make your code <i>faster</i>, it's going to reduce <i>latency</i> on the given thread.<p>It maybe a net win if this is the UI thread of a desktop app, but overall, it will come at a performance cost: because modern allocators have thread-local memory pools, and now you're moving away from it. And if you're running you code on a NUMA system (most server nowadays), when moving from one thread to another, you can end up freeing non-local memory instead of local one. Also, you won't have any backpressure on your allocations, and you are susceptible to run out of memory (especially because your deallocations now occur more slowly than they should)<p>Main takeaway: if you use it blindly it's an anti-pattern, but it can be a good idea in its niche: the UI thread of a GUI.
Just be careful, because moving heavy things to be dropped to another thread can change the <i>semantics</i> of the program. For instance, consider what happens if within that heavy thing you had a BufWriter: unless its buffer is empty, dropping it writes the buffer, so now your file is being written and closed in a random moment in the future, instead of being guaranteed to have been sent to the kernel and closed when the function returns.<p>And it can even be worse if it's holding a limited resource, like a file descriptor or a database connection. That is, I wouldn't recommend using this trick unless you're sure that the only thing the "heavy thing" is holding is memory (and even then, keep in mind that <i>memory</i> can also be a limited resource).
I've seen variations on this trick multiple times. Using threads, using a message sent to self, using a list and a timer to do the work "later", using a list and waiting for idle time...<p>They all have one thing in common: pampering over a bad design.<p>In the particular example given, the sub-vector probably come from a common source. One could keep a big buffer (a single allocation) and an array of internal pointers. For example of such a design to hold a large array of text strings, see for example this blog entry and its associated github repo:<p><pre><code> https://www.spiria.com/en/blog/desktop-software/optimizing-shared-data/
https://github.com/pierrebai/FastTextContainer
</code></pre>
Roughly it is this:<p><pre><code> struct TextHolder
{
const char* common_buffer;
std::vector<const char*> internal_pointers;
};
</code></pre>
This is of course addressing the example, but the underlying message is generally applicable: change your flawed design, don't hide your flaws.
Why would you ever write a get_size function that drops the object you call it on? Surely in an actual, non-contrived usecase spawning another thread and letting the drop occur there would just be plain worse?
For those wanting a real world example where this can be useful:<p>I am writing a static site generator. When run in "watch" mode, it deletes everything and starts over (I'd like to reduce these with partial updates but can't always do it). Moving that cleanup to a thread would make "watch" more responsive.
There is nothing unique to Rust about this; it is a very old technique. It is usually much inferior to the "arena allocator" method, where all the discarded allocations are coalesced and released in a single, cheap operation that could as well be done without another thread. That method is practical in many languages, Rust possibly included. C++ supports it in the Standard Library, for all the standard containers.<p>If important work must be done in the destructors, it is still better to farm the work out to a thread pool, rather than starting another thread. Again, C++ supports this in its Standard Library, as I think Rust does too.<p>One could suggest that the only reason to present the idea in Rust is the cynical one that Rust articles get free upvotes on HN.
Speedup numbers should be given when optimizing constant factors -- e.g. "I made this operation 5X faster using SIMD" or "By employing readahead, I sped up this file copy by 10X".<p>The points raised in this article are really different:<p>* don't do slow stuff in your latency-critical path<p>* threads are a nice way to unload slow stuff that you don't need done right away (especially if you have spare cores)<p>* dropping can be slow<p>The first and second points are good, but not really related to rust, deallocations, or the number 10000.<p>The last point is worth discussing, but still not really related to the number 10000 and barely related to rust. Rust encourages an eager deallocation strategy (kind of like C), whereas many other languages would use a more deferred strategy (like many GCs).<p>It seems like deferred (e.g. GC) would be better here, because after the main object is dropped, the GC doesn't bother to traverse all of the tiny allocations because they are all dead (unreachable by the root), and it just discards them. But that's not the full story either.<p>It's not terribly common to build up zillions of allocations and then immediately free them. What's more common is to keep the structure (and its zillions of allocations) around for a while, perhaps making small random modifications, and then eventually freeing them all at once. If using a GC, while the large structure is alive, the GC needs to scan all of those objects, causing a pause each time, which is not great. The eager strategy is also not great: it only needs to traverse the structure once (at deallocation time), but it needs to individually deallocate.<p>The answer here is to recognize that all of the objects in the structure will be deallocated together. Use a separate region/arena/heap for the entire structure, and wipe out that region/arena/heap when the structure gets dropped. You don't need to traverse anything while the structure is alive, or when it gets dropped.<p>In rust, probably the most common way to approximate this is by using slices into a larger buffer rather than separate allocations. I wish there was a little better way of doing this, though. It would be awesome if you could make new heaps specific to an object (like a hash table), then allocate the keys/values on that heap. When you drop the structure, the memory disappears without traversal.
If I seriously wanted to move object destruction off-thread, I would use at least a dedicated thread with a channel, so I could make sure the dropper is done at some point (before the program terminates, at the latest). It also avoids starting and stopping threads constantly.<p>Something like this: <a href="https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=bf35e0f30b0a730f41b215e5bff8270a" rel="nofollow">https://play.rust-lang.org/?version=stable&mode=debug&editio...</a><p>You could have an even more advanced version spawning tasks into something like rayon's thread pool, I assume.
I'm not very familiar with Rust, but I don't understand why you wouldn't just use a reference-to-HeavyThing as the function argument, so that the object isn't moved and then dropped in the `get_size` function?
One thing I just noticed is that the example doesn't make sure to actually run the new thread to completion before the main thread exists.<p>This means that if you do a "drop in other thread" and then main exists, the drop might never run. Which is often fine as the exit of main causes process termination and as such will free the memory normally anyway.<p>But it would be a problem one some systems where memory cleanup on process exit is less reliable. Through such systems are more rare by now I think.
It'd be interesting to implement this on a type that would defer all of these drop threads (or one big drop threads built off a bunch of futures) until the end of some major action, like sending the http response on an actix-web thread. Could be a great way to get the fastest possible response time, since then the client has their response before any delay on cleanup.
Looks like Evan Wallace ran into the same issue in practice in esbuild<p><a href="https://news.ycombinator.com/item?id=22336284" rel="nofollow">https://news.ycombinator.com/item?id=22336284</a><p><i>I actually originally wrote esbuild in Rust and Go, and Go was the clear winner.</i><p><i>The parser written in Go was both faster to compile and faster to execute than the parser in Rust. The Go version compiled something like 100x faster than Rust and ran at something around 10% faster (I forget the exact numbers, sorry). Based on a profile, it looked like the Go version was faster because GC happened on another thread while Rust had to run destructors on the same thread.</i><p>ESBuild is a really impressive performance-oriented project:<p><a href="https://github.com/evanw/esbuild" rel="nofollow">https://github.com/evanw/esbuild</a><p><i>The Rust version also had other problems. Many places in my code had switch statements that branched over all AST nodes and in Rust that compiles to code which uses stack space proportional to the total stack space used by all branches instead of just the maximum stack space used by any one branch</i>: <a href="https://github.com/rust-lang/rust/issues/34283" rel="nofollow">https://github.com/rust-lang/rust/issues/34283</a><p>(copy of lobste.rs comment)
I used to do this sometimes with C++ when I realized that clearing out a vector with lots of objects was slow. Is Rust basically based on unique_ptr? One problem with this approach was that you still had to wait for these threads when the application would shut down.
There's a worse case in deallocation. Tracing through a data structure being released for a long-running program can cause page faults, unused data having been swapped out. This is part of why some programs take far too long to exit.
Completely different dynamic (because no Rust GC), but this reminds me of how Twitch made their server, written in Go, a lot faster by allocating a bunch of dummy memory at the beginning so the garbage collector doesn't trigger nearly as often:<p><a href="https://news.ycombinator.com/item?id=21670110" rel="nofollow">https://news.ycombinator.com/item?id=21670110</a>
Contrived examples like this are ridiculous. Creating such a heavy thing is likely even more expensive than tearing it down. So unless you create it on a separate thread, you probably shouldn't be freeing it on a separate one. It's not going to solve your interactivity problem. If you are creating the object on a separate thread then it's already going to be natural to free it on a separate one too.
I guess choosing when or how a program deallocates is important in a language that's close to the metal.<p>Rust tries to be zero cost while providing abstractions that make it seemingly a high level language but ultimately things like this show that it's not exactly zero cost because abstractions can incur hidden penalties. There needs to be some internal syntax that allows a rust user to explicitly control deallocation when needed.<p>If I started reading code where people would randomly move a value into another thread and essentially do nothing I would be extremely confused. Any language that begins to rely on "trick" or "hacks" as standard patterns exposes a design flaw.<p>Maybe if rust provided special syntax that a function can be decorated with so that it does deallocation in another thread automatically? Or maybe an internal function called drop_async...? This would make this pattern an explicit part of the language rather than a strange hack/trick.
hmm my first thought its, having to do that is a lot like c and cleaning up my own allocations. This feels like something rust should automatically do for me?
In other words, Rust's automagical memory deallocation is NOT a zero-cost abstraction:<p><pre><code> fn get_len1(things: HeavyThings) -> usize {
things.len()
}
fn get_len2(things: HeavyThings) -> usize {
let len = things.len();
thread::spawn(move || drop(things));
len
}
</code></pre>
The OP shows an example in which a function like get_len2 is 10000x faster than a function like get_len1 for a hashmap with 1M keys.<p>See also this comment by chowells: <a href="https://news.ycombinator.com/item?id=23362925" rel="nofollow">https://news.ycombinator.com/item?id=23362925</a>
Why should i ever need to drop a heavy object for only getting a size? Not in C++ and also not in Rust, the diffent thread idea is just creativ stupidity, sorry
It seems that this would be a great reason to not pass the entire heavy object through your function, and to instead pass it as a reference. When passing an object (rather than a reference to an object) there's a lot more work going on both in function setup, and in object dropping. I'm not a rust guru, so I don't know the precise wording, but it's simple enough to realize that if this function, as claimed, must drop all the sub-objects within the `HeavyObject` type, then those objects must have been copied from the original object.<p>If you instead define the function to take in a reference (by adding just two `&` characters into your program), the single-threaded case is now almost 100x faster than the multithreaded case.<p>Here's a link to a Rust Playground with just those two characters changed: <a href="https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=9bbe169c77cd31b6cd25b74c24635e12" rel="nofollow">https://play.rust-lang.org/?version=stable&mode=debug&editio...</a><p>Note that the code that drops the data in a separate thread is not timing the amount of time your CPU is spinning, dropping the data. So while this does decrease the latency of the original thread, the best solution is to avoid copying and then freeing large, complex objects as much as possible. While it is of course necessary to do this sometimes, this particular example is just not one of them. :)<p>As an aside, I'm somewhat surprised that the Rust compiler isn't inlining and eliminating all the copying and dropping; this would seem to be a classic case where compiler analysis should be able to determine that `a.size()` should be computable without copying `a`, and it should be able to eliminate the function call cost as well. Manually doing this gives the exact same timing as my gist above, so I assume that this is happening when passing a reference, but not happening when passing the object itself.