A discussion of systems programming optimization that doesn't start with single-writer ring buffers starts on the wrong foot.<p>Those other tricks are excellent, and I use all of them, in cases where they work at all. But, e.g., seeking a way not to need to take a lock at all should come before discovering a low-contention locking protocol.<p>Readers should note that packing spare bits into the bottom bits of suitably aligned pointers is more portable than using high bits. Any page-aligned pointer has at least 12 bits free at the bottom, and any malloc'd pointer has at least 2, often 4.<p>Ring buffer counters into a power-of-2 sized buffer can be incremented without bound, enabling use of ordinary arithmetic on them, and high bits masked off cheaply on each use.<p>Probably the most neglected primitive data structure is the bitmapped set. A `uint32_t` gives you a universe of 32 elements; a byte is enough for the days of the week. The popcount native instruction is very valuable here, usually expressed as `__builtin_popcount` in source code. C++98's `std::bitset` provides Standard, portable access to it, but C++20 offers `std::popcount` directly.