Utf8 is one of the most momentous and under appreciated / relatively unknown achievements in software.<p>A sketch on a diner placemat has lead to every person in the world being able to communicate written language digitally using a common software stack. Thanks to Ken Thompson and Rob Pike we have avoided the deeply siloed and incompatible world that code pages, wide chars and other insufficient encoding schemes were guiding us towards.
UTF-8 is one of the most brilliant things I've ever seen. I only wish it had been invented and caught on before so many influential bodies started using UCS-2 instead.
Fun fact: Ken Thompson and Rob Pike of Unix, Plan 9, Go, and other fame had a heavy influence on the standard while working on Plan 9. To quote Wikipedia:<p>> Thompson's design was outlined on September 2, 1992, on a placemat in a New Jersey diner with Rob Pike.<p>If that isn't a classic story of an international standard's creation/impactful update, then I don't know what is.<p><a href="https://en.wikipedia.org/wiki/UTF-8#FSS-UTF" rel="nofollow">https://en.wikipedia.org/wiki/UTF-8#FSS-UTF</a>
Recently I learned about UTF-16 when doing some stuff with PowerShell on Windows.<p>Parallel with my annoyance with Microsoft, I realized how long it’s been since I encountered any kind of text encoding drama. As a regular typer of åäö, many hours of my youth was spent on configuring shells, terminal emulators, and IRC clients to use compatible encodings.<p>The wide adoption of UTF-8 has been truly awesome. Let’s just hope it’s another 15-20 years until I have to deal with UTF-16 again…
I never understood why ITF-8 did not use the <i>much</i> simpler encoding of:<p><pre><code> - 0xxxxxxx -> 7 bits, ASCII compatible (same as UTF-8)
- 10xxxxxx -> 6 bits, more bits to come
- 11xxxxxx -> final 6 bits.
</code></pre>
It has multiple benefits:<p><pre><code> - It encodes more bits per octet: 7, 12, 18, 24 vs 7, 11, 16, 21 for UTF-8
- It is easily extensible for more bits.
- Such extra bits extension is backward compatible for reasonable implementations.
</code></pre>
The last point is key: UTF-8 would need to invent a new prefix to go beyond 21 bits. Old software would not know the new prefix and what to do with it. With the simpler scheme, they could potentially work out of the box up to at least 30 bits (that's a billion code points, much more than the mere million of 21 bits).<p>The
Rob Pike wrote up his version of its inception almost 20 years ago.<p>The history of UTF-8 as told by Rob Pike (2003): <a href="http://doc.cat-v.org/bell_labs/utf-8_history" rel="nofollow">http://doc.cat-v.org/bell_labs/utf-8_history</a><p>Recent HN discussion: <a href="https://news.ycombinator.com/item?id=26735958" rel="nofollow">https://news.ycombinator.com/item?id=26735958</a>
Excellent presentation! One improvement to consider is that many usages of "code point" should be "Unicode scalar value" instead. Basically, you don't want to use UTF-8 to encode UTF-16 surrogate code points (which are not scalar values).<p>Fun fact, UTF-8's prefix scheme can cover up to 31 payload bits. See <a href="https://en.wikipedia.org/wiki/UTF-8#FSS-UTF" rel="nofollow">https://en.wikipedia.org/wiki/UTF-8#FSS-UTF</a> , section "FSS-UTF (1992) / UTF-8 (1993)".<p>A manifesto that was much more important ~15 years ago when UTF-8 hadn't completely won yet: <a href="https://utf8everywhere.org/" rel="nofollow">https://utf8everywhere.org/</a>
The following article is one of my favorite primers on Character sets/Unicode :
<a href="https://www.joelonsoftware.com/2003/10/08/the-absolute-minim" rel="nofollow">https://www.joelonsoftware.com/2003/10/08/the-absolute-minim</a>...
I spent 2 hours last Friday trying to wrap my head around what UTF-8 was (<a href="https://www.joelonsoftware.com/2003/10/08/the-absolute-minim" rel="nofollow">https://www.joelonsoftware.com/2003/10/08/the-absolute-minim</a> is great, but doesn't explain the inner workings like this does) and completely failed, could not understand it. This made it super easy to grok, thank you!
><i>NOTE: You can always find a character boundary from an arbitrary point in a stream of octets by moving left an octet each time the current octet starts with the bit prefix 10 which indicates a tail octet. At most you'll have to move left 3 octets to find the nearest header octet.</i><p>This is incorrect. You can only find boundaries between code points this way.<p>Until your you learn that not all "user perceived characters" (grapheme clusters) can be expressed as single code point Unicode seems cool. These UTF-8 explanations explain the encoding but leave out this unfortunate detail. Author might not even know this because they deal with subset of Unicode in their life.<p>If you want to split text between two user perceived characters, not between them, this tutorial does not help.<p>Unicode encodings are is great if you want to handle subset of languages and characters, if you want to be complete, it's a mess.
Not sure if the issue is with Chrome or my local config generally (bog standard Windows, nothing fancy), but the us-flag example doesn't render as intended. It shows as "US" with the components in the next step being "U" and "S" (not the ASCII characters U & S, the encoding is as intended but those characters are being given in place of the intended).<p>Displays as I assume intended in Firefox on the same machine: American flag emoji then when broken down in the next step U-in-a-box & S-in-a-box. The other examples seem fine in Chrome.<p>Take care when using relatively new additions to the Unicode emoji-set, test to make sure your intentions are correctly displayed in all the brower's you might expect your audience to be using.
Great explanation. The only part that tripped me up was in determining the number of octets to represent the codepoint. From the post:<p>>From the previous diagram the value 0x1F602 falls in the range for a 4 octets header (between 0x10000 and 0x10FFFF)<p>Using the diagram in the post would be a crutch to rely on. It seems easier to remember the maximum number of "data" bits that each octet layout can support (7, 11, 16, 21). Then by knowing that 0x1F602 maps to 11111011000000010, which is 17 bits, you know it must fit into the 4-octet layout, which can hold 21 bits.
Excellent article, really helped me learn.<p>I'd like to add a correction. The binary/ascii/utf-8 value of 'a' (hex 0x61) is not 01010111, but instead 01100001.<p>This is used incorrectly in both the Giant reference card, and in the "ascii encoding" diagram above it.
There is an error in the first example under Giant Reference Card.<p>The bytes come out as:<p>0xF0 0x9F 0x87 0xBA 0xF0 0x9F 0x87 0xBA<p>but the bits directly above them all of the bit pattern: 010 10111
I love Tom Scott's explanation of Unicode: <a href="https://www.youtube.com/watch?v=MijmeoH9LT4" rel="nofollow">https://www.youtube.com/watch?v=MijmeoH9LT4</a>
If you’re more into watching a presentation, I recorded “A Brief History of Unicode” last year, And there’s a YouTube recording of it as well as the slides:<p><a href="https://speakerdeck.com/alblue/a-brief-history-of-unicode-4524a734-aac3-4ce9-8c4a-6f4ada04f464" rel="nofollow">https://speakerdeck.com/alblue/a-brief-history-of-unicode-45...</a><p><a href="https://youtu.be/NN3g4JbbjTE" rel="nofollow">https://youtu.be/NN3g4JbbjTE</a>
Great post and intuitive visuals! I recently had to rack my brain around UTF-8 encoding and decoding when building the Unicode ETH Project (<a href="https://github.com/devstein/unicode-eth" rel="nofollow">https://github.com/devstein/unicode-eth</a>) and this post would have been very useful
BTW here’s a surprise I had to learn at some point: strings in JS are UTF-16. Keep that in mind if you want to use the console to follow this great article, you’ll get the surrogate pair for the emoji instead.
I wonder how large must a font be to display all UTF8 characters...<p>I'm also waiting for new emojis, they recently added more and more that can be used as icons, which is simpler than integrating PNG or SVG icons.
Is there any standard system where each byte/word maps to one character/grapheme? I feel there is a general sentiment that not being able to jump to the Nth character is programmatically.. irritating and disappointing. I'm sure such a system wouldn't support some languages - but in the words of Lord Farquaad "That's a sacrifice I'm willing to make". Most of the world's languages would do just fine and it'd make sense to exclude right to left arabic ligatured text in, for instance, your monospaced computer code.<p>I'm guessing you could extract a subset of UTF-8 - but has anyone done anything like that?