How UTF-8 Works

359 pointsby SethMLarsonover 3 years ago

28 comments

bhawksover 3 years ago

Utf8 is one of the most momentous and under appreciated / relatively unknown achievements in software.A sketch on a diner placemat has lead to every person in the world being able to communicate written language digitally using a common software stack. Thanks to Ken Thompson and Rob Pike we have avoided the deeply siloed and incompatible world that code pages, wide chars and other insufficient encoding schemes were guiding us towards.

评论 #30266650 未加载

评论 #30265966 未加载

评论 #30264219 未加载

评论 #30268044 未加载

评论 #30263316 未加载

评论 #30265351 未加载

评论 #30263453 未加载

mark-rover 3 years ago

UTF-8 is one of the most brilliant things I've ever seen. I only wish it had been invented and caught on before so many influential bodies started using UCS-2 instead.

评论 #30261391 未加载

评论 #30260722 未加载

评论 #30268592 未加载

评论 #30260247 未加载

评论 #30262745 未加载

jjiceover 3 years ago

Fun fact: Ken Thompson and Rob Pike of Unix, Plan 9, Go, and other fame had a heavy influence on the standard while working on Plan 9. To quote Wikipedia:> Thompson's design was outlined on September 2, 1992, on a placemat in a New Jersey diner with Rob Pike.If that isn't a classic story of an international standard's creation/impactful update, then I don't know what is.<a href="https://en.wikipedia.org/wiki/UTF-8#FSS-UTF" rel="nofollow">https://en.wikipedia.org/wiki/UTF-8#FSS-UTF</a>

评论 #30260229 未加载

filleokusover 3 years ago

Recently I learned about UTF-16 when doing some stuff with PowerShell on Windows.Parallel with my annoyance with Microsoft, I realized how long it’s been since I encountered any kind of text encoding drama. As a regular typer of åäö, many hours of my youth was spent on configuring shells, terminal emulators, and IRC clients to use compatible encodings.The wide adoption of UTF-8 has been truly awesome. Let’s just hope it’s another 15-20 years until I have to deal with UTF-16 again…

评论 #30263589 未加载

评论 #30260852 未加载

评论 #30262691 未加载

pierrebaiover 3 years ago

I never understood why ITF-8 did not use the much simpler encoding of:<pre><code> - 0xxxxxxx -> 7 bits, ASCII compatible (same as UTF-8) - 10xxxxxx -> 6 bits, more bits to come - 11xxxxxx -> final 6 bits. </code></pre> It has multiple benefits:<pre><code> - It encodes more bits per octet: 7, 12, 18, 24 vs 7, 11, 16, 21 for UTF-8 - It is easily extensible for more bits. - Such extra bits extension is backward compatible for reasonable implementations. </code></pre> The last point is key: UTF-8 would need to invent a new prefix to go beyond 21 bits. Old software would not know the new prefix and what to do with it. With the simpler scheme, they could potentially work out of the box up to at least 30 bits (that's a billion code points, much more than the mere million of 21 bits).The

评论 #30261441 未加载

评论 #30261508 未加载

评论 #30262196 未加载

评论 #30261495 未加载

评论 #30266075 未加载

评论 #30261483 未加载

jvolkmanover 3 years ago

Rob Pike wrote up his version of its inception almost 20 years ago.The history of UTF-8 as told by Rob Pike (2003): <a href="http://doc.cat-v.org/bell_labs/utf-8_history" rel="nofollow">http://doc.cat-v.org/bell_labs/utf-8_history</a>Recent HN discussion: <a href="https://news.ycombinator.com/item?id=26735958" rel="nofollow">https://news.ycombinator.com/item?id=26735958</a>

nayukiover 3 years ago

Excellent presentation! One improvement to consider is that many usages of "code point" should be "Unicode scalar value" instead. Basically, you don't want to use UTF-8 to encode UTF-16 surrogate code points (which are not scalar values).Fun fact, UTF-8's prefix scheme can cover up to 31 payload bits. See <a href="https://en.wikipedia.org/wiki/UTF-8#FSS-UTF" rel="nofollow">https://en.wikipedia.org/wiki/UTF-8#FSS-UTF</a> , section "FSS-UTF (1992) / UTF-8 (1993)".A manifesto that was much more important ~15 years ago when UTF-8 hadn't completely won yet: <a href="https://utf8everywhere.org/" rel="nofollow">https://utf8everywhere.org/</a>

评论 #30260510 未加载

评论 #30262761 未加载

karsinkkover 3 years ago

The following article is one of my favorite primers on Character sets/Unicode : <a href="https://www.joelonsoftware.com/2003/10/08/the-absolute-minim" rel="nofollow">https://www.joelonsoftware.com/2003/10/08/the-absolute-minim</a>...

RoddaWallProover 3 years ago

I spent 2 hours last Friday trying to wrap my head around what UTF-8 was (<a href="https://www.joelonsoftware.com/2003/10/08/the-absolute-minim" rel="nofollow">https://www.joelonsoftware.com/2003/10/08/the-absolute-minim</a> is great, but doesn't explain the inner workings like this does) and completely failed, could not understand it. This made it super easy to grok, thank you!

nabla9over 3 years ago

>NOTE: You can always find a character boundary from an arbitrary point in a stream of octets by moving left an octet each time the current octet starts with the bit prefix 10 which indicates a tail octet. At most you'll have to move left 3 octets to find the nearest header octet.This is incorrect. You can only find boundaries between code points this way.Until your you learn that not all "user perceived characters" (grapheme clusters) can be expressed as single code point Unicode seems cool. These UTF-8 explanations explain the encoding but leave out this unfortunate detail. Author might not even know this because they deal with subset of Unicode in their life.If you want to split text between two user perceived characters, not between them, this tutorial does not help.Unicode encodings are is great if you want to handle subset of languages and characters, if you want to be complete, it's a mess.

评论 #30263567 未加载

dspillettover 3 years ago

Not sure if the issue is with Chrome or my local config generally (bog standard Windows, nothing fancy), but the us-flag example doesn't render as intended. It shows as "US" with the components in the next step being "U" and "S" (not the ASCII characters U & S, the encoding is as intended but those characters are being given in place of the intended).Displays as I assume intended in Firefox on the same machine: American flag emoji then when broken down in the next step U-in-a-box & S-in-a-box. The other examples seem fine in Chrome.Take care when using relatively new additions to the Unicode emoji-set, test to make sure your intentions are correctly displayed in all the brower's you might expect your audience to be using.

评论 #30260374 未加载

评论 #30260083 未加载

daenzover 3 years ago

Great explanation. The only part that tripped me up was in determining the number of octets to represent the codepoint. From the post:>From the previous diagram the value 0x1F602 falls in the range for a 4 octets header (between 0x10000 and 0x10FFFF)Using the diagram in the post would be a crutch to rely on. It seems easier to remember the maximum number of "data" bits that each octet layout can support (7, 11, 16, 21). Then by knowing that 0x1F602 maps to 11111011000000010, which is 17 bits, you know it must fit into the 4-octet layout, which can hold 21 bits.

评论 #30263115 未加载

评论 #30263033 未加载

loco5ninerover 3 years ago

Excellent article, really helped me learn.I'd like to add a correction. The binary/ascii/utf-8 value of 'a' (hex 0x61) is not 01010111, but instead 01100001.This is used incorrectly in both the Giant reference card, and in the "ascii encoding" diagram above it.

DannyB2over 3 years ago

There is an error in the first example under Giant Reference Card.The bytes come out as:0xF0 0x9F 0x87 0xBA 0xF0 0x9F 0x87 0xBAbut the bits directly above them all of the bit pattern: 010 10111

评论 #30262500 未加载

burtekdover 3 years ago

I love Tom Scott's explanation of Unicode: <a href="https://www.youtube.com/watch?v=MijmeoH9LT4" rel="nofollow">https://www.youtube.com/watch?v=MijmeoH9LT4</a>

alblueover 3 years ago

If you’re more into watching a presentation, I recorded “A Brief History of Unicode” last year, And there’s a YouTube recording of it as well as the slides:<a href="https://speakerdeck.com/alblue/a-brief-history-of-unicode-4524a734-aac3-4ce9-8c4a-6f4ada04f464" rel="nofollow">https://speakerdeck.com/alblue/a-brief-history-of-unicode-45...</a><a href="https://youtu.be/NN3g4JbbjTE" rel="nofollow">https://youtu.be/NN3g4JbbjTE</a>

devsteinover 3 years ago

Great post and intuitive visuals! I recently had to rack my brain around UTF-8 encoding and decoding when building the Unicode ETH Project (<a href="https://github.com/devstein/unicode-eth" rel="nofollow">https://github.com/devstein/unicode-eth</a>) and this post would have been very useful

zaikover 3 years ago

Those diagrams look really good. How were they made?

评论 #30261144 未加载

satysinover 3 years ago

This is without question one of the best short technical presentations I've seen. To the author hats off to a masterful job.

jsrcoutover 3 years ago

This may be the first explanation of Unicode representation that I can actually follow. Great work.

评论 #30260601 未加载

ctxcover 3 years ago

Such clean presentation, refreshing.

who-shot-jrover 3 years ago

Fantastic! Very well explained.

评论 #30259908 未加载

bussyfumesover 3 years ago

BTW here’s a surprise I had to learn at some point: strings in JS are UTF-16. Keep that in mind if you want to use the console to follow this great article, you’ll get the surrogate pair for the emoji instead.

jokoonover 3 years ago

I wonder how large must a font be to display all UTF8 characters...I'm also waiting for new emojis, they recently added more and more that can be used as icons, which is simpler than integrating PNG or SVG icons.

评论 #30261417 未加载

评论 #30261244 未加载

geokonover 3 years ago

Is there any standard system where each byte/word maps to one character/grapheme? I feel there is a general sentiment that not being able to jump to the Nth character is programmatically.. irritating and disappointing. I'm sure such a system wouldn't support some languages - but in the words of Lord Farquaad "That's a sacrifice I'm willing to make". Most of the world's languages would do just fine and it'd make sense to exclude right to left arabic ligatured text in, for instance, your monospaced computer code.I'm guessing you could extract a subset of UTF-8 - but has anyone done anything like that?

评论 #30268986 未加载

评论 #30269351 未加载

评论 #30268842 未加载

brian_rakover 3 years ago

This was presented well. A follow up for unicode might be in order!

评论 #30260558 未加载

Simplicitasover 3 years ago

I still wanna know in WHICH Jersey diner it was invented in! :-)

riwskyover 3 years ago

How UTF-8 works?“pretty well, all things considered”