Common Systems Programming Optimizations and Tricks

375 pointsby wheresvic3over 5 years ago

13 comments

kabdibover 5 years ago

"Repurposing Top Bits" - don't do that. Honest.The IBM 360 shipped with 32-bit addresses but only 24 bits decoded. "Hey, there's a whole byte up top that nobody's using today, let's put some stuff there!" When they wanted the address space IBM found themselves architecturally hamstrung, and the cost to dig out was significant.The 128K Macintosh used a 68000; it had 32-bit addresses but only 24 bits were decoded. "Hey, there's a whole byte up top that nobody's using today, let's put some stuff there!" When Apple needed the address space they found themselves hamstrung by pieces of MacOS that did use those bits, and many applications that did, too. The cost to dig out was significant.It is practically guaranteed that "Hey, there's 16 whole bits up there that nobody's using today" will wind up the same, because this industry just never learns.You can do things with lower bits and usually get away with it; many systems put GC tags and type bits down there. But those upper address bits do not belong to you.

评论 #21029526 未加载

评论 #21030432 未加载

评论 #21030042 未加载

评论 #21029553 未加载

评论 #21031018 未加载

评论 #21032003 未加载

评论 #21033139 未加载

评论 #21033074 未加载

评论 #21031725 未加载

评论 #21039929 未加载

评论 #21030417 未加载

SlySherZover 5 years ago

For everyone that enjoyed this, there's an entire free online MIT course called Performance Engineering of Software Systems[1] where you'll learn plenty more tricks and common pitfalls like these. You'll also learn how to use tools to debug the low level performance of your programs: looking at cache misses, cpu utilization, time spent per assembly operation and so on. It's pretty cool :)[1] <a href="https://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-172-performance-engineering-of-software-systems-fall-2010/" rel="nofollow">https://ocw.mit.edu/courses/electrical-engineering-and-compu...</a>

评论 #21030860 未加载

vardumpover 5 years ago

Very good article, facts looked correct and it had useful advice.I'd add, keep things local. Don't access memory (or cache) outside core (L1 & L2), NUMA region or processor socket boundary unnecessarily.Keep networking, GPU, etc. code in same NUMA region where the physical adapters are.Use memory like tape, stream through. CPU branch predictors love that kind of access pattern.Oh, and perhaps most importantly: use a profiler that can access CPU internal performance counters. Do this on different system types, from low power laptops to servers with 2 or more CPU sockets.One annoying thing, though. Remember that the fastest thing in a microbenchmark might not be the fastest thing on a real system when different code modules fight for shared limited resources, like memory bandwidth, caches and inter-core communication links.

评论 #21028185 未加载

评论 #21031404 未加载

ncmncmover 5 years ago

Last time this came up (<a href="https://news.ycombinator.com/item?id=20808778" rel="nofollow">https://news.ycombinator.com/item?id=20808778</a>) and disappeared almost instantly, I wrote:A discussion of systems programming optimization that doesn't start with single-writer ring buffers starts on the wrong foot.Those other tricks are excellent, and I use all of them, in cases where they work at all. But, e.g., seeking a way not to need to take a lock at all should come before discovering a low-contention locking protocol.Readers should note that packing spare bits into the bottom bits of suitably aligned pointers is more portable than using high bits. Any page-aligned pointer has at least 12 bits free at the bottom, and any malloc'd pointer has at least 2, more often 4.Ring buffer counters into a power-of-2 sized buffer can be incremented without bound, enabling use of ordinary arithmetic on them, and high bits masked off cheaply on each use. [But use 64 bits!]Probably the most neglected primitive data structure is the bitmapped set. A `uint32_t` gives you a universe of 32 elements; a byte is enough for the days of the week. The popcount native instruction is very valuable here, usually expressed as `__builtin_popcount` in source code. C++98's `std::bitset` provides Standard, portable access to it, but C++20 offers `std::popcount` directly.[I add here that storing things in high bits of addresses is very likely to fail on systems with ASLR, and that I have learned MSVC bitsets have a very slow popcount.]

评论 #21033374 未加载

评论 #21033280 未加载

dbcurtisover 5 years ago

Good article. Basics that everyone can benefit from knowing.Just one nit/warning... breaking coarse locks into fine-grained locks can be taken too far. There is a point of diminishing returns where you end up spending increased time acquiring/releasing/waiting-for locks. At some point you want to clump together under a single lock resources that tend to often be used together, even if you often end up locking an extra resource or two unnecessarily. As always, benchmark workloads are your friends.

评论 #21029538 未加载

评论 #21027535 未加载

Symmetryover 5 years ago

In most cases your compiler should do the clever work of turning your division or modulo operation into easier to do bit banging... but only if you're operating on a reasonable constant. Powers of two are best but you can do other constant divisions by stringing together 1-latency operations in ways that are still far faster than division.

评论 #21033787 未加载

legulereover 5 years ago

Instead of repurposing top bits you can also repurpose the Bits beyond alignment. E.g 32 bit integers are aligned to 4 bytes, so you can use the lower two bits of pointers to them instead.

评论 #21027632 未加载

评论 #21028081 未加载

ufoover 5 years ago

> Part of why the change couldn’t be enabled by default is because various high performance programs, notably various JavaScript engines and LuaJIT, use this repurposing trick to pack some extra data into pointers.Does any one know if this sentence can be backed up by a citation?I know that the NaN-tagging trick assumes that pointers have 48 bits (small enough to fit inside a floating point mantissa), but was this ever a factor for deciding whether 5-level page tables should be added to the Linux kernel or not?

评论 #21033842 未加载

vymagueover 5 years ago

Interesting article. Is there a reason why there isn't a book/article that has a more comprehensive list?

评论 #21028163 未加载

holy_cityover 5 years ago

The false-sharing macro in the example expands to __attribute__((alligned(/* etc / )) or __declspec(align(/ etc*/). Is there a reason these are preferred over the alignas specifier introduced in C++ 11?

评论 #21028928 未加载

y7over 5 years ago

> Now, to support multiple processors on a single machine reading and writing from the same memory in a coherent way, only one processor on a machine can have exclusive access to a given cache line.Does this also apply when multiple processors are only reading memory?

评论 #21031897 未加载

评论 #21034248 未加载

评论 #21032408 未加载

floatbothover 5 years ago

> The Magic Power of 2: Division Is SlowalooIs LLVM not smart enough to optimize this?

评论 #21032262 未加载

SomaticPirateover 5 years ago

Are there any golang implantation a of high performance hash maps? Does the the standard lib do this?

评论 #21034263 未加载