"Repurposing Top Bits" - don't do that. Honest.<p>The IBM 360 shipped with 32-bit addresses but only 24 bits decoded. "Hey, there's a whole byte up top that nobody's using today, let's put some stuff there!" When they wanted the address space IBM found themselves architecturally hamstrung, and the cost to dig out was significant.<p>The 128K Macintosh used a 68000; it had 32-bit addresses but only 24 bits were decoded. "Hey, there's a whole byte up top that nobody's using today, let's put some stuff there!" When Apple needed the address space they found themselves hamstrung by pieces of MacOS that <i>did</i> use those bits, and many applications that did, too. The cost to dig out was significant.<p>It is practically guaranteed that "Hey, there's 16 whole bits up there that nobody's using today" will wind up the same, because this industry just never learns.<p>You can do things with lower bits and usually get away with it; many systems put GC tags and type bits down there. But those upper address bits do not belong to you.
For everyone that enjoyed this, there's an entire free online MIT course called Performance Engineering of Software Systems[1] where you'll learn plenty more tricks and common pitfalls like these. You'll also learn how to use tools to debug the low level performance of your programs: looking at cache misses, cpu utilization, time spent per assembly operation and so on. It's pretty cool :)<p>[1] <a href="https://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-172-performance-engineering-of-software-systems-fall-2010/" rel="nofollow">https://ocw.mit.edu/courses/electrical-engineering-and-compu...</a>
Very good article, facts looked correct and it had useful advice.<p>I'd add, keep things local. Don't access memory (or cache) outside core (L1 & L2), NUMA region or processor socket boundary unnecessarily.<p>Keep networking, GPU, etc. code in same NUMA region where the physical adapters are.<p>Use memory like tape, stream through. CPU branch predictors love that kind of access pattern.<p>Oh, and perhaps most importantly: use a profiler that can access CPU internal performance counters. Do this on different system types, from low power laptops to servers with 2 or more CPU sockets.<p>One annoying thing, though. Remember that the fastest thing in a microbenchmark might not be the fastest thing on a real system when different code modules fight for shared limited resources, like memory bandwidth, caches and inter-core communication links.
Last time this came up (<a href="https://news.ycombinator.com/item?id=20808778" rel="nofollow">https://news.ycombinator.com/item?id=20808778</a>) and disappeared almost instantly, I wrote:<p>A discussion of systems programming optimization that doesn't start with single-writer ring buffers starts on the wrong foot.<p>Those other tricks are excellent, and I use all of them, in cases where they work at all. But, e.g., seeking a way not to need to take a lock at all should come before discovering a low-contention locking protocol.<p>Readers should note that packing spare bits into the bottom bits of suitably aligned pointers is more portable than using high bits. Any page-aligned pointer has at least 12 bits free at the bottom, and any malloc'd pointer has at least 2, more often 4.<p>Ring buffer counters into a power-of-2 sized buffer can be incremented without bound, enabling use of ordinary arithmetic on them, and high bits masked off cheaply on each use. [But use 64 bits!]<p>Probably the most neglected primitive data structure is the bitmapped set. A `uint32_t` gives you a universe of 32 elements; a byte is enough for the days of the week. The popcount native instruction is very valuable here, usually expressed as `__builtin_popcount` in source code. C++98's `std::bitset` provides Standard, portable access to it, but C++20 offers `std::popcount` directly.<p>[I add here that storing things in high bits of addresses is very likely to fail on systems with ASLR, and that I have learned MSVC bitsets have a very slow popcount.]
Good article. Basics that everyone can benefit from knowing.<p>Just one nit/warning... breaking coarse locks into fine-grained locks <i>can</i> be taken too far. There is a point of diminishing returns where you end up spending increased time acquiring/releasing/waiting-for locks. At some point you want to clump together under a single lock resources that tend to often be used together, even if you often end up locking an extra resource or two unnecessarily. As always, benchmark workloads are your friends.
In most cases your compiler should do the clever work of turning your division or modulo operation into easier to do bit banging... but only if you're operating on a reasonable constant. Powers of two are best but you can do other constant divisions by stringing together 1-latency operations in ways that are still far faster than division.
Instead of repurposing top bits you can also repurpose the Bits beyond alignment. E.g 32 bit integers are aligned to 4 bytes, so you can use the lower two bits of pointers to them instead.
> Part of why the change couldn’t be enabled by default is because various high performance programs, notably various JavaScript engines and LuaJIT, use this repurposing trick to pack some extra data into pointers.<p>Does any one know if this sentence can be backed up by a citation?<p>I know that the NaN-tagging trick assumes that pointers have 48 bits (small enough to fit inside a floating point mantissa), but was this ever a factor for deciding whether 5-level page tables should be added to the Linux kernel or not?
The false-sharing macro in the example expands to __attribute__((alligned(/* etc <i>/ )) or __declspec(align(/</i> etc*/). Is there a reason these are preferred over the alignas specifier introduced in C++ 11?
> Now, to support multiple processors on a single machine reading and writing from the same memory in a coherent way, only one processor on a machine can have exclusive access to a given cache line.<p>Does this also apply when multiple processors are only reading memory?