Somewhat related - but there are a ridiculous number of platform/systems that ship with derivative of Dinkum C++ standard library - MSVC (included). When you lookup the company behind it - it's basically small shop that seems to be primarily one guy up in MA.
The SSO capacity tables seem obviously wrong to me. For clang/libc++ we see 11 or 22 bytes, in each case it's the size of the whole data structure, minus one byte for the bit flag and length value, and another byte for the zero (ASCII NUL) that's obligatory in C++<p>But for MSVC and GCC we're told 16 bytes as each has a 16-byte buffer. However they still need that obligatory zero byte, for ASCII NUL so surely the table should show 15.<p>The most popular string in C++, the one which makes SSO almost obligatory as a design feature for the language's string object, is the empty string. On a modern (64-bit) machine Clang can store that inline in a local 24-byte object, MSVC and GCC need 32-bytes. Without SSO it would mean probably 24 bytes <i>and</i> a heap allocation. That's hard to swallow.
This is great. It would be a good C/C++ interview question to compare these. Of course you can't expect a Raymond Chen level performance, but it should give some insight into experience with low level programming.
This is so interesting!<p>I wonder if anyone has ever tried an implementation that prioritizes even more minimal memory usage for small strings?<p>Of the three implementations discussed, the smallest (on a 64-bit system) is 24 bytes.<p>How about 8 bytes: the union of a single pointer and a char[8]?<p>For strings of 6 or fewer characters, it uses the first 6 bytes to store the string, one for the null terminator, and the last byte to store the length, but with a trick (see below).<p>For strings of 7 or more characters, it uses all 8 to store the address of a larger block of memory storing the capacity, size, and string bytes.<p>The trick is to use the last byte as a way to know whether the bytes are encoding a short string or a pointer. Since pointers will never be odd, we just store an odd number in that last byte. For example, you could store (size << 1) | 0x1<p>So if the last bit is 1, it's a short string. The size is bytes[7] >> 1, and the data() pointer is just equal to the string itself.<p>If the last bit is 0, treat the whole data as a pointer to a structure that encodes the capacity and size and then string bytes as usual.
The most interesting thing imo is what they all do similarly: they all store the size, instead of the end pointer -- unlike, say, std::vector. Exercise for the reader as to why this is the right tradeoff.
GCC putting a pointer at the top of the structure seems reminiscent of the way Pascal stored strings. A PString is the address of a character buffer like C, but the length of the string is stored at a negative offset. I may be remembering wrong but I think there was an older C++ STL that also used negative offsets.<p>As much as these snippets make clang look heavier, I wonder what it compiles to in practice when the compiler can make better inferences. If you can prove the state of the `is_small` bit those branches disappear. Even at runtime, which implementation is more performant? Real-world profiling may favor clang with branch prediction and speculative processing. Then again, speculation has become a dirty word lately.[1]<p>[1] Get it? "Dirty" because of the cache. I'm sorry, that pun was entirely unintentional.
Obligatory link to a must watch, the CppCon 2016 talk on the complexities of std::string: <a href="https://youtu.be/kPR8h4-qZdk?si=x2DbgNIZcKyK5PKt" rel="nofollow">https://youtu.be/kPR8h4-qZdk?si=x2DbgNIZcKyK5PKt</a>
Note that EASTL (an alternative STL used in game development) does std::string the "Clang way": <a href="https://github.com/electronicarts/EASTL/blob/master/include/EASTL/string.h">https://github.com/electronicarts/EASTL/blob/master/include/...</a>
tl;dr: libc++ is just bad, libstdc++ and MSVC trade punches for first place, with the eyeball win going to the FSF.<p>Though really the performance gates on string-heavy code tend to be in the heap and not the string library itself.