The Performance Impact of C++'s `final` Keyword

247 pointsby hasheddanabout 1 year ago

38 comments

mgaunardabout 1 year ago

What final enables is devirtualization in certain cases. The main advantage of devirtualization is that it is necessary for inlining.Inlining has other requirements as well -- LTO pretty much covers it.The article doesn't have sufficient data to tell whether the testcase is built in such a way that any of these optimizations can happen or is beneficial.

评论 #40128945 未加载

评论 #40117555 未加载

评论 #40117201 未加载

评论 #40130080 未加载

tombertabout 1 year ago

I don't do much C++, but I have definitely found that engineers will just assert that something is "faster" without any evidence to back that up.Quick example, I got in an argument with someone a few years ago that claimed in C# that a `switch` was better than an `if(x==1) elseif(x==2)...` because switch was "faster" and rejected my PR. I mentioned that that doesn't appear to be true, we went back and forth until I did a compile-then-decompile of a minimal test with equality-based-ifs, and showed that the compiler actually converts equality-based-ifs to `switch` behind the scenes. The guy accepted my PR after that.But there's tons of this stuff like this in CS, and I kind of blame professors for a lot of it [1]. A large part of becoming a decent engineer [2] for me was learning to stop trusting what professors taught me in college. Most of what they said was fine, but you can't assume that; what they tell you could be out of date, or simply never correct to begin with, and as far as I can tell you have to always test these things.It doesn't help that a lot of these "it's faster" arguments are often reductive because they only are faster in extremely minimal tests. Sometimes a microbenchmark will show that something is faster, and there's value in that, but I think it's important that that can also be a small percentage of the total program; compilers are obscenely good at optimizing nowadays, it can be difficult to determine when something will be optimized, and your assertion that something is "faster" might not actually be true in a non-trivial program.This is why I don't really like doing any kind of major optimizations before the program actually works. I try to keep the program in a reasonable Big-O and I try and minimize network calls cuz of latency, but I don't bother with any kind of micro-optimizations in the first draft. I don't mess with bitwise, I don't concern myself on which version of a particular data structure is a millisecond faster, I don't focus too much on whether I can get away with a smaller sized float, etc. Once I know that the program is correct, then I benchmark to see if any kind of micro-optimizations will actually matter, and often they really don't.[1] That includes me up to about a year ago.[2] At least I like to pretend I am.

评论 #40118212 未加载

评论 #40117587 未加载

评论 #40117566 未加载

评论 #40117564 未加载

评论 #40117562 未加载

评论 #40119643 未加载

评论 #40117660 未加载

评论 #40132005 未加载

评论 #40121250 未加载

评论 #40126968 未加载

评论 #40127794 未加载

评论 #40120062 未加载

评论 #40117770 未加载

andrewlaabout 1 year ago

I'm surprised that it has any impact on performance at all, and I'd love to see the codegen differences between the applications.Mostly the `final` keyword serves as a compile-time assertion. The compiler (sometimes linker) is perfectly capable of seeing that a class has no derived classes, but what `final` assures is that if you attempt to derive from such a class, you will raise a compile-time error.This is similar to how `inline` works in practice -- rather than providing a useful hint to the compiler (though the compiler is free to treat it that way) it provides an assertion that if you do non-inlinable operations (e.g. non-tail recursion) then the compiler can flag that.All of this is to say that `final` can speed up runtimes -- but it does so by forcing you to organize your code such that the guarantees apply. By using `final` classes, in places where dynamic dispatch can be reduced to static dispatch, you force the developer to not introduce patterns that would prevent static dispatch.

评论 #40117277 未加载

评论 #40117171 未加载

评论 #40117891 未加载

评论 #40117411 未加载

评论 #40129170 未加载

评论 #40117063 未加载

mgraczykabout 1 year ago

The main case where I use final and where I would expect benefits (not covered well by the article) is when you are using an external library with pure virtual interfaces that you implement.For example, the AWS C++ SDK uses virtual functions for everything. When you subclass their classes, marking your classes as final allows the compiler to devirtualize your own calls to your own functions (GCC does this reliably).I'm curious to understand better how clang is producing worse code in these cases. The code used for the blog post is a bit too complicated for me to look at, but I would love to see some microbenchmarks. My guess is that there is some kind of icache or code side problem. where inlining more produces worse code.

评论 #40117721 未加载

评论 #40129689 未加载

lionkorabout 1 year ago

The only thing worse than no benchmark is a bad benchmark.I don't think this really shows what `final` does, not to code generation, not to performance, not to the actual semantics of the program. There is no magic bullet - if putting `final` on every single class would always make it faster, it wouldn't be a keyword, it'd be a compiler optimization.`final` does one specific thing: It tells a compiler that it can be sure that the given object is not going to have anything derive from it.

评论 #40133570 未加载

评论 #40132354 未加载

评论 #40132931 未加载

akoboldfryingabout 1 year ago

I would expect "final" to have no effect on this type of code at all. That it does in some cases cause measurable differences I put down to randomly hitting internal compiler thresholds (perhaps one of the inlining heuristics is "Don't inline a function with more than 100 tokens", and the "final" keyword pushes a couple of functions to 101).Why would I expect no performance difference? I haven't looked at the code, but I would expect that for each pixel, it iterates through an array/vector/list etc. of objects that implement some common interface, and calls one or more methods (probably something called intersectRay() or similar) on that interface. By design, that interface cannot be made final, and that's what counts. Whether the concrete derived classes are final or not makes no difference.In order to make this a good test of "final", the pointer type of that container should be constrained to a concrete object type, like Sphere. Of course, this means the scene is limited to spheres.The only case where final can make a difference, by devirtualising a call that couldn't otherwise be devirtualised, is when you hold a pointer to that type, and the object it points at was allocated "uncertainly", e.g., by the caller. (If the object was allocated in the same basic block where the method call later occurs, the compiler already knows its runtime type and will devirtualise the call anyway, even without "final".)

评论 #40125337 未加载

评论 #40129182 未加载

ein0pabout 1 year ago

You should use final to express design intent. In fact I’d rather it were the default in C++, and there was some sort of an opposite (‘derivable’?) keyword instead, but that ship has sailed long time ago. Any measurable negative perf impact should be filed as a bug and fixed.

评论 #40117537 未加载

评论 #40117427 未加载

评论 #40117361 未加载

评论 #40117536 未加载

评论 #40117133 未加载

ndesaulniersabout 1 year ago

As an LLVM developer, I really wish the author filed a bug report and waited for some analysis BEFORE publishing an article (that may never get amended) that recommends not using this keyword with clang for performance reasons. I suspect there's just a bug in clang.

评论 #40130384 未加载

评论 #40130282 未加载

mastaxabout 1 year ago

Changes in the layout of the binary can have large impacts on the program performance [0] so it's possible that the unexpected performance decrease is caused by unpredictable changes in the layout of the binary between compilations. I think there is some tool which helps ensure layout is consistent for benchmarking, but I can't remember what it's called.[0]: <a href="https://research.facebook.com/publications/bolt-a-practical-binary-optimizer-for-data-centers-and-beyond/" rel="nofollow">https://research.facebook.com/publications/bolt-a-practical-...</a>

jeffbeeabout 1 year ago

I profiled this project and there are abundant opportunities for devirtualization. The virtual interface `IHittable` is the hot one. However, the WITH_FINAL define is not sufficient, because the hot call is still virtual. At `hit_object |= _objects[node->object_index()]->hit` I am still seeing ` mov (%rdi),%rax; call *0x18(%rax)` so the application of final here was not sufficient to do the job. Whatever differences are being measures are caused by bogons.

评论 #40125099 未加载

评论 #40118116 未加载

bluGillabout 1 year ago

I use final more for communication. Don't look for deeper derived classes as there are none. that it results in slower code is an annoying surprise.

leni536about 1 year ago

This is the gist of the difference in code generation when final is involved:<a href="https://godbolt.org/z/7xKj6qTcj" rel="nofollow">https://godbolt.org/z/7xKj6qTcj</a>edit: And a case involving inlining:<a href="https://godbolt.org/z/E9qrb3hKM" rel="nofollow">https://godbolt.org/z/E9qrb3hKM</a>

alex_smartabout 1 year ago

One thing that wasn't mentioned in the article that I wished it did was the size of the compiled binary with and without final. Only reason I would expect the final version to be slower is that we are emitting more code because of inlining and that is resulting in a larger portion of instruction cache misses.Also, now that I think of it, they should have run the code under perf and compared the stats.

评论 #40132892 未加载

magnatabout 1 year ago

> I created a "large test suite" to be more intensive. On my dev machine it needed to run for 8 hours.During such long and compute-intensive tests, how are thermal considerations mitigated? Not saying that this was case here, but I can see how after saturating all cores for 8 hours, the whole PC might get hot to the point CPU starts throttling, so when you reboot to next OS or start another batch, overall performance could be a bit lower.

评论 #40119822 未加载

评论 #40141684 未加载

JackYoustraabout 1 year ago

I really wish he'd listed all the flags he used. To add on to the flags already listed by some other commenters, `-mcpu` and related flags are really crucial in these microbenchmarks: over such a small change and such a small set of tight loops, you could just be regression on coincidences in the microarchitecture scheduler vs higher level assumptions.

评论 #40123011 未加载

gpderettaabout 1 year ago

1% is nothing to scoff of. But I suspect that the variability of compilation (specifically quirks of instruction selection, register allocation and function alignment) more than mask any gains.The clang regression might be explainable by final allowing some additional inlining and clang making an hash of it.

fransje26about 1 year ago

I'm actually more worried about Clang being close to 100% slower than GCC on Linux. That doesn't seem right.I am prepared to believe that there is some performance difference between the two, varying per case, but I would expect a few percent difference, not twice the run time..

评论 #40128859 未加载

评论 #40162062 未加载

lanzaabout 1 year ago

If you're measuring a compiler you need to post the flags and version used. Otherwise the entire experiment is in the noise.

sfinkabout 1 year ago

tldr: sprinkled a keyword around in the hopes that it "does something" to speed things up, tested it, got noisy results but no miraculous speedup.I started skimming this article after a while, because it seemed to be going into the weeds of performance comparison without ever backing up to look at what the change might be doing. Which meant that I couldn't tell if I was going to be looking at the usual random noise of performance testing or something real.For `final`, I'd want to at least see if it changing the generated code by replacing indirect vtable calls with direct or inlined calls. It might be that the compiler is already figuring it out and the keyword isn't doing anything. It might be that the compiler is changing code, but the target address was already well-predicted and it's perturbing code layout enough that it gets slower (or faster). There could be something interesting here, but I can't tell without at least a little assembly output (or perhaps a relevant portion of some intermediate representation, not that I would know which one to look at).If it's not changing anything, then perhaps there could be an interesting investigation into the variance of performance testing in this scenario. If it's changing something, then there could be an interesting investigation into when that makes things faster vs slower. As it is, I can't tell what I should be looking for.

评论 #40124667 未加载

评论 #40117294 未加载

评论 #40127838 未加载

pklauslerabout 1 year ago

Mildly related programming language trivia:Fortran has virtual functions ("type bound procedures"), and supports a NON_OVERRIDABLE attribute on them that is basically "final". (FINAL exists but means something else.). But it also has a means for localizing the non-overridable property.If a type bound procedure is declared in a module, and is PRIVATE, then overrides in subtypes ("extended derived types") work as usual for subtypes in the same module, but can't be affected by overrides that appear in other modules. This allows a compiler to notice when a type has no subtypes in the same module, and basically infer that it is non-overridable locally, and thus resolve calls at compilation time.Or it would, if compilers implemented this feature correctly. It's not well described in the standard, and only half of the Fortran compilers in the wild actually support it. So like too many things in the Fortran world, it might be useful, but it's not portable.

jeffbeeabout 1 year ago

It's difficult to discuss this stuff because the impact can be negligible or negative for one person, but large and consistently positive for another. You can only usefully discuss it on a given baseline, and for something like final I would hope that baseline would be a project that already enjoys PGO, LTO, and BOLT.

pcvarmintabout 1 year ago

Each of the test cases measured needs to be run at least 3 times in a row, to warm caches (not just CPU but OS too) and to detect and remove noise.In fact, I would run the same test repeatedly, keeping track of the k fastest times (k being ~3-7), and only stopping when the first and the kth fastest times are within a certain tolerance (as low as 1%). This ensures repeatability.One sample of performance data for each test is not enough. This study provides no new insights.Performance analyst

chris_wotabout 1 year ago

Surely "final" is a conceptual thing... in other words, you don't want anyone else to derive from the class for good reasons. It's for conceptual understanding, surely?

MathMonkeyManabout 1 year ago

I think it was Chandler Carruth who said "If you're not measuring, then you don't care about performance." I agree, and by that measure, nobody I've ever worked with cares about performance.The best I'll see is somebody who cooked up a naive microbenchmark to show that style 1 takes fewer wall nanoseconds than style 2 on his laptop.People I've worked with don't use profilers, claiming that they can't trust it. Really they just can't be bothered to run it and interpret the output.The truth is, most of us don't write C++ because of performance; we write C++ because that's the language the code is written in.The performance gained by different C++ techniques seldom matters, and when it does you have to measure. Profiler reports almost always surprise me the first few times -- your mental model of what's going on and what matters is probably wrong.

评论 #40127146 未加载

jcalvinowensabout 1 year ago

That's interesting. Maybe final enabled more inlining, and clang is being too aggressive about it for the icache sizes in play here? I'd love to see a comparison of the generated code.I'm disappointed the author's conclusion is "don't use final", not "something is wrong with clang".

评论 #40117213 未加载

indigoabstractabout 1 year ago

If it does have a noticeable impact, that would be surprising, a bit like going back to the days when 'inline' was supposed to tell the compiler to inline the designated functions (no longer its main use case nowadays).

account42about 1 year ago

I'm amused at the AI advert spam in the comments here that can't even be bothered to make the spam even vaguely normal looking comments.

pineapple_sauceabout 1 year ago

What should be evaluated is removing indirection and tightly packing your data. I'm sure you'll gain a better performance improvement. virtual calls and shared_ptr are littered in the codebase.In this way: you can avoid the need for the `final` keyword and do the optimization the keyword enables (de-virtualize calls).>Yes, it is very hacky and I am disgusted by this myself. I would never do this in an actual productWhy? What's with the C++ community and their disgust for macros without any underlying reasoning? It reminds me of everyone blindly saying "Don't use goto; it creates spaghetti code".Sure, if macros are overly used: it can be hard to read and maintain. But, for something simple like this, you shouldn't be thinking "I would never do this in an actual product".

评论 #40117335 未加载

评论 #40119554 未加载

评论 #40117503 未加载

p0w3n3dabout 1 year ago

I would say the most performance impact would give `constexpr` followed by `const`. I wouldn't bet any money on `final` which in C++ is a guard of inheritance, and C++ function invocation address is resolved the `vtable` hence final wouldn't change anything. Maybe the author was mistaken with `final` keyword in Java

评论 #40117050 未加载

评论 #40133294 未加载

teeuwenabout 1 year ago

I do not see how the final keyword would make a difference in performance at all in this case. The compiler should be able to build an inheritance tree and determine by itself which classes are to be treated as final.Now for libraries, this is a different story. There I can imagine final keyword could have an impact.

评论 #40129303 未加载

评论 #40129364 未加载

juliangmpabout 1 year ago

>Personally, I'm not turning it on. And would in fact, avoid using it. It doesn't seem consistent.I feel like we'd have to repeat these tests quite a few times to get to a decent conclusion. Hell small variations in performance could be caused by all sorts of things outside the actual program.

评论 #40129758 未加载

jeyabout 1 year ago

I wonder if LTO was turned on when using Clang? Might lead to a performance improvement.

AtNightWeCodeabout 1 year ago

Most benchmarks are wrong. I doubt this is correct. Final should have been the default in the lang I think though.There are tons of these suggestions. Like always using sealed in C# or never use private in Java.

headlineabout 1 year ago

re: final macro> I would never do this in an actual productwhat, why?

kasajianabout 1 year ago

I'm surprised by this article. the author genuinely believes that a language construct to benefit performance was added to the language without anyone ever running any metrics to verify. "just trust me bro", is the quote.It's is an insane level of ignorance about how these things are decided by the standards committee.

评论 #40129773 未加载

manlobsterabout 1 year ago

This seems like a reasonable use of the preprocessor to me. I've seen similar use in high-quality codebases. I wonder why the author is so disgusted by it.

LorenDBabout 1 year ago

Man, I wish this blog had an RSS feed.

kookamamieabout 1 year ago

> And probably, that reason is performance.That's the first problem I see with the article. C++ isn't a fast language, as it is. There are far too many issues with e.g. aliasing rules, lack of proper vectorization (for the runtime arch), etc.If you wish to have a relatively good performance for your code, try ISPC, which still allows you to get great performance with vectorization up to AVX-512, without turning to intrisics.

评论 #40128898 未加载

38 comments

mgaunardabout 1 year ago

评论 #40128945 未加载

评论 #40117555 未加载

评论 #40117201 未加载

评论 #40130080 未加载

tombertabout 1 year ago

评论 #40118212 未加载

评论 #40117587 未加载

评论 #40117566 未加载

评论 #40117564 未加载

评论 #40117562 未加载

评论 #40119643 未加载

评论 #40117660 未加载

评论 #40132005 未加载

评论 #40121250 未加载

评论 #40126968 未加载

评论 #40127794 未加载

评论 #40120062 未加载

评论 #40117770 未加载

andrewlaabout 1 year ago

评论 #40117277 未加载

评论 #40117171 未加载

评论 #40117891 未加载

评论 #40117411 未加载

评论 #40129170 未加载

评论 #40117063 未加载

mgraczykabout 1 year ago

评论 #40117721 未加载

评论 #40129689 未加载

lionkorabout 1 year ago

评论 #40133570 未加载

评论 #40132354 未加载

评论 #40132931 未加载

akoboldfryingabout 1 year ago

评论 #40125337 未加载

评论 #40129182 未加载

ein0pabout 1 year ago

评论 #40117537 未加载

评论 #40117427 未加载

评论 #40117361 未加载

评论 #40117536 未加载

评论 #40117133 未加载

ndesaulniersabout 1 year ago

评论 #40130384 未加载

评论 #40130282 未加载

mastaxabout 1 year ago

jeffbeeabout 1 year ago

评论 #40125099 未加载

评论 #40118116 未加载

bluGillabout 1 year ago

I use final more for communication. Don't look for deeper derived classes as there are none. that it results in slower code is an annoying surprise.

leni536about 1 year ago

alex_smartabout 1 year ago

评论 #40132892 未加载

magnatabout 1 year ago

评论 #40119822 未加载

评论 #40141684 未加载

JackYoustraabout 1 year ago

评论 #40123011 未加载

gpderettaabout 1 year ago

fransje26about 1 year ago

评论 #40128859 未加载

评论 #40162062 未加载

lanzaabout 1 year ago

If you're measuring a compiler you need to post the flags and version used. Otherwise the entire experiment is in the noise.

sfinkabout 1 year ago

评论 #40124667 未加载

评论 #40117294 未加载

评论 #40127838 未加载

pklauslerabout 1 year ago

jeffbeeabout 1 year ago

pcvarmintabout 1 year ago

chris_wotabout 1 year ago

Surely "final" is a conceptual thing... in other words, you don't want anyone else to derive from the class for good reasons. It's for conceptual understanding, surely?

MathMonkeyManabout 1 year ago

评论 #40127146 未加载

jcalvinowensabout 1 year ago

评论 #40117213 未加载

indigoabstractabout 1 year ago

account42about 1 year ago

I'm amused at the AI advert spam in the comments here that can't even be bothered to make the spam even vaguely normal looking comments.

pineapple_sauceabout 1 year ago

评论 #40117335 未加载

评论 #40119554 未加载

评论 #40117503 未加载

p0w3n3dabout 1 year ago

评论 #40117050 未加载

评论 #40133294 未加载

teeuwenabout 1 year ago

评论 #40129303 未加载

评论 #40129364 未加载

juliangmpabout 1 year ago

评论 #40129758 未加载

jeyabout 1 year ago

I wonder if LTO was turned on when using Clang? Might lead to a performance improvement.

AtNightWeCodeabout 1 year ago

headlineabout 1 year ago

re: final macro> I would never do this in an actual productwhat, why?

kasajianabout 1 year ago

评论 #40129773 未加载

manlobsterabout 1 year ago

This seems like a reasonable use of the preprocessor to me. I've seen similar use in high-quality codebases. I wonder why the author is so disgusted by it.

LorenDBabout 1 year ago

Man, I wish this blog had an RSS feed.

kookamamieabout 1 year ago

评论 #40128898 未加载