Should small Rust structs be passed by-copy or by-borrow? (2019)

264 点作者 aloukissas超过 2 年前

40 条评论

bjackman超过 2 年前

A potential lesson here (i.e. I am applying confirmation bias to retroactively view this article as justification for a strongly held opinion, lol):Unless you are gonna benchmark something, for details like this you should pretty much always just trust the damn compiler and write the code in the most maintainable way.This comes up in code review a LOT at my work:- "you can write this simpler with XYZ"- "but that will be slower because it's a copy/a function call/an indirect branch/a channel send/a shared memory access/some other combination of assumptions about what the compiler will generate and what is slow on a CPU"I always ask them to either prove it or write the simple thing. If the code in question isn't hot enough to bother benchmarking it, the performance benefits probably aren't worth it _even if they exist_.

评论 #34199414 未加载

评论 #34199611 未加载

评论 #34199802 未加载

评论 #34199529 未加载

评论 #34199060 未加载

评论 #34200315 未加载

评论 #34200215 未加载

评论 #34198796 未加载

评论 #34202889 未加载

评论 #34200193 未加载

评论 #34199179 未加载

celeritascelery超过 2 年前

I don’t feel like this gave a satisfactory answer the question. Since everything was inlined, the argument passing convention made no difference in the micro benchmarks. But what happens when it does not inline? Then you would actually be testing by-borrow be by-copy instead of how good rust is at optimizing.

评论 #34198031 未加载

评论 #34197140 未加载

评论 #34197947 未加载

dwheeler超过 2 年前

This is one advantage of Ada, where parameters are abstractly declared as "in" or "in out" or "out". The compiler can then decide how to best implement it for that specific size and architecture.

评论 #34196854 未加载

评论 #34198794 未加载

评论 #34199235 未加载

评论 #34198310 未加载

评论 #34196786 未加载

评论 #34203237 未加载

mcguire超过 2 年前

This is one of those questions where you really, honestly, do need to look at a very low level.Back in the ancient days, I worked at IBM doing benchmarking for an OS project that was never released. We were using PPC601 Sandalfoots (Sandalfeet?) as dev machines. A perennial fight was devs writing their own memcpy using dst++ = src++ loops rather than the one in the library, which was written by one of my coworkers and consisted of 3 pages of assembly that used at least 18 registers.The simple loop was something like X cycles/byte, while the library version was P + (Q cycles/byte) but the difference was such that the crossover point was about 8 bytes. So, scraping out the simple memcpy implementations from the code was about a weekly thing for me.At this point, we discovered that our C compiler would pass structs by value (This was the early-ish days of ANSI C and was a surprise to some of my older coworkers.) and benchmarked that.And discovered that its copy code was worse than the simple dst++ = src++ loops. By about a factor of 4. (The simple loop would be optimized to work with word-sized ints, while the compiler was generating code that copied each byte individually.)If you are doing something where this matters, something like VTune is very important. So is the ability to convince people who do stupid things to stop doing the stupid things.

lukaszwojtow超过 2 年前

I always prefer by-borrow. That's because in the future this struct may become non-copy and that means some unnecessary refactoring. My thinking is a bit like "don't take ownership if not needed" - the "not needed" part is the most important thing. Don't require things that are not needed.

评论 #34199446 未加载

评论 #34198349 未加载

评论 #34200141 未加载

评论 #34199339 未加载

评论 #34202075 未加载

arcticbull超过 2 年前

> Blech! Having to explicitly borrow temporary values is super gross.I don’t think you ever have to write code like this. Implement your math traits in terms for both value and reference types like the standard library does.Go down to Trait Implementations for scalar types, for instance i32 [1]impl Add<&i32> for &i32impl Add<&i32> for i32impl Add<i32> for &i32impl Add<i32> for i32Once you do that your ergonomics should be exactly the same as with built in scalar types.[1] <a href="https://doc.rust-lang.org/std/primitive.i32.html" rel="nofollow">https://doc.rust-lang.org/std/primitive.i32.html</a>

forrestthewoods超过 2 年前

Oh neat, that’s my blog. My old posts don’t resurface on HN that often.Lots of criticism of my methodology in the comments here. That’s fine. That post was more of a self nerd snipe that went way deeper than I expected.I hoped that my post would lead to a more definitive answer from some actual experts in the field. Unfortunately that never happened, afaik. Bummer.

评论 #34199175 未加载

评论 #34198287 未加载

ergonaught超过 2 年前

It's compiled, so, without any investigation at all, I would have been disappointed if there were any significant difference in the code emitted in these cases. I would expect the compiler to do the efficient thing based on usage rather than the particular syntax. I may have too much faith in the compiler.

评论 #34197447 未加载

评论 #34199502 未加载

spuz超过 2 年前

I'd be interested to know what the benchmarks of the two rust solutions are when inlining is disabled so we can get an idea of the different performance characteristics of each function call even if it's not a very realistic scenario.The other question I have is which style should you use when writing a library? It's obviously not possible to benchmark all the software that will call your library but you still want to consider readability, performance as well as other factors such as common convention.

ptero超过 2 年前

I would go with the version that gives the clean user interface (that is, by copy in this case). If it turns out that the other version is significantly more performant and this additional performance is critical for the end users consider adding the by-borrow option.The clarity of the code using a particular library is such an big (but often under-appreciated) benefit that I would heavily lean in this direction when considering interface options. My 2c.

评论 #34197048 未加载

Rustwerks超过 2 年前

I just went through all of this when building a raytracer.* Sprinkling & around everything in math expressions does make them ugly. Maybe rust needs an asBorrow or similar?* If you inline everything then the speed is the same.* Link time optimizations are also an easy win.<a href="https://github.com/mcallahan/lightray">https://github.com/mcallahan/lightray</a>

评论 #34200060 未加载

评论 #34196978 未加载

zamalek超过 2 年前

The benchmarks lack the standard deviation, so the results may well be equivalent. Don't roll your own micro-benchmark runners.References may get optimized to copies where possible and sound (i.e. blittable and const), a common heuristic involves the size of a cache line (64b on most modern ISAs, including x86_64).Using a Vector4 would have pushed the structure size beyond the 64b heuristic. You would also need to disable inlining for the measured methods.

评论 #34199527 未加载

kibwen超过 2 年前

Note that this is from 2019, so it's probably worth re-benchmarking to see if anything has changed in the interim. Can we get the year added to the title?

eloff超过 2 年前

For this code, the compiler inlined the call. So there should be no difference between pass by copy or pass by reference, which is what was measured. Where it could matter is when the code isn’t inlined. But with small structs it might not matter all that much.It does sometimes matter though. One optimization I’ve seen in a few places is to box the error type, so that a result doesn’t copy the (usually empty) error by value on the stack. That actually makes a small performance difference, on the order of about 5-10%.

BooneJS超过 2 年前

Folks, processors continue to give smaller and smaller gains every year. Something has to give. If you have critical path code that absolutely must max out the core, then this type of analysis (as pedantic as it is) is useful in the long run.

cryptonector超过 2 年前

It's not like you can do arithmetic with references, so maybe the ergonomics of by-value vs. by-reference shouldn't really be that different.The cost of by-value lies in memory copies, while the cost of by-reference lies in dereferencing pointers where the values are needed, which might mean many more memory reads are needed than with by-value (depends on what you're doing). So it's just hard to tell which will do better in general -- there's no answer to that.For a library, maybe providing by-value and by-reference interfaces should be good (except that will bloat the library). For everything else just use by-value as it has the best ergonomics.

FpUser超过 2 年前

I did the test on my computer:Rust - By-Copy: 14124, By-Borrow: 8150C++ - By-Copy: 12160, By-Ref: 11423P.S. Just built it using LLVM under CLion IDE and the results are:<pre><code> G:\temp\cpp\rust-cpp-bench\cpp\cmake\cmake-build- release\fts_cmake_cpp_bench.exe Totals: Overlaps: 220384338 By-Copy: 4397 By-Ref: 4396 Delta: -0.0227428% Process finished with exit code 0</code></pre>

评论 #34199327 未加载

francasso超过 2 年前

This actually bothers me. I think the rust performance here is praise worthy. What bothers me is that we piled complexity over complexity at the hardware and compiler levels, and ended up in a situation where you got no way to get a reasonable understanding of how low level code will perform. Nowadays the main reason to program in a "low level" language is that you know that on average the compiler will be able to do a better job because the language doesn't have abstractions that map poorly to the hardware model. But for much of it you can forget about "I know what the hardware is going to do"

jgerrish超过 2 年前

I'm late to this discussion, sorry.But at the risk of loss of respect, I'll wait for Rust2ShinyNewLanguage to solve this.All I know is I hope I'm smart enough to understand ShinyNewLanguage's compiler. Or maybe even build it.I've got several projects that could use some additional Boxes of structures, and borrow instead of move, and maybe a few more complex reference counting mechanics.Rust forced me to understand what that meant. That's good for building a better engineer.But it's not fun to work with.I hope the next experience is better. Sorry Rustaceans.

cmrdporcupine超过 2 年前

There is no single answer to this question because it's going to depend completely on call patterns further up. Especially in regards to how much of the rest of the running program's data fits in L1 cache, and most especially in regards to what's going on in terms of concurrency.The benchmark made here could completely fall apart once more threads are added.Modern computer architectures are non-uniform in terms of any kind of memory accesses. The same logical operations can have extremely varied costs depending on how the whole program flow goes.

im3w1l超过 2 年前

My first thought was "now what is the calling convention for float parameters again? they are passed in registers right? the compiler can probably arrange so they don't have to actually be copied" and then I realized it will probably even inline it.Anyway, assuming it's not inlined I would guess pass-by-copy, maybe with an occasional exception in code with heavy register pressure.Edit: Actually since it's a structure, the calling convention is to memory allocate it and pass a pointer, doh. So it should actually compile the same.

评论 #34197281 未加载

评论 #34196580 未加载

kolbe超过 2 年前

Anyone know why seemingly knowledgeable people (like the person who wrote this article) don't use micro benchmarking frameworks when they run these tests?Also, whenever you do one of these, please post the full source with it. There's no reason to leave your readers in the dark, wondering what could be going on, which is exactly what I'm doing now, because there's almost no excuse for c++ to be slower in a task than rust--it's just a matter of how much work you need to put in to make it get there.

评论 #34198004 未加载

评论 #34198556 未加载

Veedrac超过 2 年前

The general usability impact matters slightly less than it looks here, in part because the `do_math` with references in the article has two extra &s, and in part because methods autoreference when called like x.f().Performance-wise, if you're likely to touch every element in a type anyway, err on the side of copies. They are going to have to end up in registers eventually anyway, so you might as well let the caller find out the best way to put them there.

yobbo超过 2 年前

The Rust-test implements the traits Add, Sub, Mul by value. This makes the few references less important in the total test. The ergonomics argument is motivated by using these traits. Otherwise, references would have had the same ergonomics.But also, the struct is 3x32 bits, and Rust auto-implements the Copy-trait for it. It is barely larger than u64, which is the size of the reference.But life is only simpler when Copy and Clone can be auto-implemented.

YesThatTom2超过 2 年前

I covet Ada’s feature where you just specify if a parameter is in, out, or inout; the compiler’s figures out whether to copy or pass a pointer.

afdbcreid超过 2 年前

I haven't benchmarked that, but in Rust `ScalarPair`s (i.e., structs who have up to to scalars) are passed in two registers, while bigger structs are always passed by pointer. Therefore, passing bigger structs by move will require the compiler to copy them, while with references it is not required to, so references may be faster in that case.

lowbloodsugar超过 2 年前

I understand that this is an example for the purposes of answering the given question, but when actually doing things with 3D vertices one should be thinking in terms of structures of arrays. As someone said here already: good generals worry about strategy and great generals worry about logistics.

redox99超过 2 年前

I'm surprised he tested MSVC and Clang, and not GCC which usually generates faster code than those two.

评论 #34199931 未加载

评论 #34205488 未加载

datafulman超过 2 年前

You are comparing two completely different compilers; I wouldn't worry all that much about the difference between rust and C++. If you do want to compare them directly, why not use LLVM for C++ as well? That will highlight any language-specific differences.

dang超过 2 年前

Discussed at the time:Should small Rust structs be passed by-copy or by-borrow? - <a href="https://news.ycombinator.com/item?id=20798033" rel="nofollow">https://news.ycombinator.com/item?id=20798033</a> - Aug 2019 (107 comments)

tuetuopay超过 2 年前

This is not really surprising in such a case. The Rust compiler is pretty good at optimizing out uneeded copies. Here it does see that the copied value is not used after the function call, so it should simply not emit the copies in the final assembly.

ardel95超过 2 年前

Minor nit: many of the differences in the article aren't really specific to the Rust vs C++, but rather differences between llvm vs whatever compiler backend is used by msvc.

the__alchemist超过 2 年前

Interesting! Of note, My `Vec3` and `Quaternion` types (f32 and f64) have `Copy` APIs, but I've wondered about this since their inception.

TEP_Kim_Il_Sung超过 2 年前

I don't know much about this, but by-copy sounds nice. I'd rather own stuff and be happy.

评论 #34203405 未加载

throwawaybycopy超过 2 年前

Should have also tried pass-by-move .

29athrowaway超过 2 年前

A more direct comparison would have been a r-value reference.

cuteboy19超过 2 年前

So c++ is less complicated than rust in some cases?

amelius超过 2 年前

This is one of the problems I have with writing rust code. You have to think about so many mundane details that you barely have time left to think about more important and more interesting things.

评论 #34198027 未加载

评论 #34197719 未加载

评论 #34197755 未加载

评论 #34202459 未加载

评论 #34198339 未加载

birdyrooster超过 2 年前

I guess by-copy bc I’m cool

m00dy超过 2 年前

It is a problem of statistics and depends on internals of underlying operating system. I’m not sure you really need that sort of optimisation

评论 #34196613 未加载