What every software developer must know about Unicode in 2023

989 pointsby mrzoolover 1 year ago

93 comments

layer8over 1 year ago

<a href="https://archive.ph/LtKk0" rel="nofollow noreferrer">https://archive.ph/LtKk0</a>

评论 #37742100 未加载

jcranmerover 1 year ago

There's one part of this document that I would push extremely hard against, and that's the notion that "extended grapheme clusters" are the one true, right way to think of characters in Unicode, and therefore any language that views the length in any other way is doing it wrong.The truth of the matter is that there are several different definitions of "character", depending on what you want to use it for. An extended grapheme cluster is largely defined on "this visually displays as a single unit", which isn't necessarily correct for things like "display size in a monospace font" or "thing that gets deleted when you hit backspace." Like so many other things in Unicode, the correct answer is use-case dependent.(And for this reason, String iteration should be based on codepoints--it's the fundamental level on which Unicode works, and whatever algorithm you want to use to derive the correct answer for your purpose will be based on codepoint iteration. hsivonen's article (<a href="https://hsivonen.fi/string-length/" rel="nofollow noreferrer">https://hsivonen.fi/string-length/</a>), linked in this one, does try to explain why extended grapheme clusters is the wrong primitive to use in a language.)

评论 #37741786 未加载

评论 #37739596 未加载

评论 #37742841 未加载

评论 #37743561 未加载

评论 #37742484 未加载

评论 #37744984 未加载

评论 #37744513 未加载

评论 #37745427 未加载

评论 #37749554 未加载

评论 #37754430 未加载

gumbyover 1 year ago

This is quite a good write up. An answer to one of the author's questions:> Why does the ﬁ ligature even have its own code point? No idea.On of the principles of Unicode is round trip compatibility. That is you should be able to read in a file encoded with some obsolete coding system and write it out again properly. Maybe frob it a bit with your unicode-based tools first. This is a good principle, though less useful today.So the ﬁ ligature was in a legacy encoding system and thus must be in Unicode. That's also why things like digits with a circle around them exist: they were in some old Japanese character set. Nowadays we might compose them with some zwj or even just leave them to some higher level formatting (my preference).

评论 #37739580 未加载

评论 #37742490 未加载

评论 #37739171 未加载

评论 #37750087 未加载

davidhamover 1 year ago

Is it just me, or is anyone else seeing what looks like the mouse pointer of everyone else reading the page, like 1,000 little ants on the screen

评论 #37739177 未加载

评论 #37737803 未加载

评论 #37737711 未加载

评论 #37746213 未加载

评论 #37739224 未加载

评论 #37737621 未加载

评论 #37737542 未加载

评论 #37737684 未加载

评论 #37748437 未加载

评论 #37737679 未加载

评论 #37742561 未加载

评论 #37742179 未加载

评论 #37741086 未加载

评论 #37783992 未加载

评论 #37738682 未加载

评论 #37741463 未加载

评论 #37752948 未加载

评论 #37737277 未加载

qwerty456127over 1 year ago

> People are not limited to a single locale. For example, I can read and write English (USA), English (UK), German, and Russian. Which locale should I set my computer to?Ideally - the "English-World" locale is supposedly meant for us, cosmopolitans. It's included with Windows 10 and 11.Practically, as "English-World" was not available in the past (and still wasn't available on platforms other than Windows the last time I checked), I have always been setting the locale to En-US even though I have never been to America. This leads to a number of annoyances though. E.g. LibreOffice always creates new documents for the Letter paper format and I have to switch it to A4 manually every time. It's even worse on Linux where locales appear to be less easy to customize than in Windows. Windows always offered a handy configuration dialog to granularly tweak your locale choosing what measures system you prefer, whether your weeks begin on sundays or mondays and even define your preferred date-time format templates fully manually.A less-spoken about problem is Windows' system-wide setting for the default legacy codepage. I happen to use single-language legacy (non-Unicode) apps made by people from a number of very different countries. Some apps (e.g. I can remeber the Intel UHD Windows driver config app) even use this setting (ignoring the system locale and system UI language) to detect your language and render their whole UI in it.> English (USA), English (UK)This deserves a separate discussion. I doubt many English speakers (let alone those who don't live in a particular anglophone country) care to distinguish between English dialects. To us presence of a huge number of these (don't forget en-AU, en-TT, en-ZW etc - there are more!) in the options lists brings only annoyance, especially when one chooses some non-US one and this opens another can of worms.By the way I wonder how do string capitalization and comparision functions manage to work on computers of people who use both English and Turkish actively (Turkish locale distinguishes between dotted and undotted İ).

评论 #37739360 未加载

评论 #37738819 未加载

评论 #37741933 未加载

评论 #37741624 未加载

评论 #37738099 未加载

评论 #37744897 未加载

评论 #37745781 未加载

评论 #37739280 未加载

评论 #37746656 未加载

评论 #37738558 未加载

评论 #37743661 未加载

评论 #37746081 未加载

oefrhaover 1 year ago

> many Chinese, Japanese, and Korean logograms that are written very differently get assigned the same code pointThis leads to absolutely horrendous rendering of Chinese filenames in Windows if the system locale isn’t Chinese. The characters seem to be rendered in some variant of MS Gothic and it’s very obviously a mix of Chinese and Japanese glyphs (of somewhat different sizes and/or stroke widths IIRC). I think the Chinese locale avoids the issue by using Microsoft YaHei UI.

tr888over 1 year ago

What on EARTH is that mouse cursor thing all about? Why would you even bother writing this, then making it impossible to read properly?

评论 #37737858 未加载

评论 #37739015 未加载

评论 #37737900 未加载

评论 #37738463 未加载

评论 #37748998 未加载

dathinabover 1 year ago

> The only modern language that gets it right is Swift:I disagree.What is the "right" things is use-case dependent.For UI it's glyph bases, kinda, more precise some good enough abstraction over render width. For which glyphs are not always good enough but also the best you can get without adding a ton of complexity.But for pretty much every other use-case you want storage byte size.I mean in the UI you care about the length of a string because there is limited width to render a strings.But everywhere else you care about it because of (memory) resource limitations and costs in various ways. Weather that is for bandwidth cost, storage cost, number of network packages, efficient index-ability, etc. etc. In rare cases being able to type it, but then it's often us-ascii only, too.

评论 #37737971 未加载

评论 #37747625 未加载

评论 #37739546 未加载

评论 #37737961 未加载

评论 #37737963 未加载

TacticalCoderover 1 year ago

> For example, é (a single grapheme) is encoded in Unicode as e (U+0065 Latin Small Letter E) + ´ (U+0301 Combining Acute Accent). Two code points!It's a poor and misleading example for it is definitely not how 'é' is encoded in 99.999% of all the text written in, say, french out there (french is the language where 'é' is the most common).'é' is U+00F9, one codepoint, definitely not two.Now you could say: but it is also the two codepoints one. But that's precisely what makes Unicode the complete, total and utter clusterfuck that it is.And hence even an article explaining what every programmer should know about Unicode cannot even get the most basic example right. Which is honestly quite ironic.

评论 #37744784 未加载

评论 #37743864 未加载

评论 #37745011 未加载

评论 #37746295 未加载

评论 #37758997 未加载

评论 #37744373 未加载

alexmolasover 1 year ago

I tried to read the articles since it seemed interesting. After exactly 30 seconds trying it I had to leave the page. Impossible to read more than two sentences with all those pointer moving there - and for a folk with ADHD even more difficult. Sorry, but I couldn't make it :(

评论 #37737713 未加载

评论 #37745227 未加载

w10-1over 1 year ago

A real question is why IBM, Apple, and Microsoft poured millions into developing the unicode standard instead of treating character encoding like file formats as a venue for competition.IBM and Apple in the early 1990's combined in Taligent to try to beat MS NT, but failed. But a lot of internationalization came out of that and was made open, at the perfect time for Java to adopt it.Interestingly it wasn't just CJK but Thai language variants that drove much of the flexibility in early unicode, largely because some early developers took a fancy to it.When you look at the actual variety in written languages, Unicode grapheme/code-point/byte seems rather elegant.We're in the early days of term vectors, small floats, and differentiable numerics (not to mention big integers). Are lessons from the history of unicode relevant?

评论 #37741394 未加载

elcaroover 1 year ago

Another Unicode article that mentions Swift, but not Raku :(Raku's Str type has a `.chars` method that counts graphemes. It has a separate `.codes` method to count codepoints. It also can do O(1) string indexing at the grapheme level.That Zalgo "word" example is counted as 4 chars, and the different comparisons of "Å" are all True in Raku.You can argue about the merits of it's approach (indeed several commenters here disagree that graphemes are the "one true way" to count characters), but it feels lacking to not at least _mention_ Raku when talking about how different programming languages handle Unicode.

titzerover 1 year ago

> The problem is, you don’t want to operate on code points. A code point is not a unit of writing; one code point is not always a single character. What you should be iterating on is called “extended grapheme clusters”, or graphemes for short.It's best to avoid making overly-general claims like this. There are plenty of situations that warrant operating on code points, and it's likely that software trying and failing to make sense of grapheme clusters will result it in a worse screwup. Codepoints are probably the best default. For example, it probably makes the most sense for programming languages to define strings as arrays of code points, and not characters or 16-bit chunks or an encoding, or whatever.

评论 #37743769 未加载

评论 #37743650 未加载

layer8over 1 year ago

This is pretty good. One thing I would add is to mention that Unicode defines algorithms for bidirectional text, collation (sorting order), line breaking and other text segmentation (words and sentences, besides grapheme clusters). The main point here is to know that there are specifications one should take into account when topics like that come up, instead of just inventing your own algorithm.

thefringthingover 1 year ago

> Unicode is a standard that aims to unify all human languages, both past and present, and make them work with computers.This is doubly wrong.First, it conflates languages and writing systems. Malay and English use the same writing system but are different languages. American Sign Language is a language, but it has no standard or widely-adopted writing system. Hakka is a language, but Hakka speakers normally write in Modern Standard Mandarin, a different language.Second, it's not that case that Unicode aims to encode all writing systems. For example, there are many hobbyist neographies (constructed writing systems) which will not be included in Unicode.

评论 #37747411 未加载

评论 #37746273 未加载

nwellnhofover 1 year ago

Unicode is a total mess. In a sane system, "extended grapheme clusters" would equal "codepoints" and it wouldn't make a difference for 99% of languages. Now we ended up with grapheme clusters, normalization, decomposition, composition, Zalgo text, etc. But instead of deprecating this nonsense, Unicode doubled down with composed Emojis.

评论 #37737776 未加载

评论 #37738179 未加载

评论 #37737739 未加载

评论 #37737791 未加载

评论 #37739435 未加载

评论 #37737843 未加载

JonChesterfieldover 1 year ago

Prior to this article, I knew graphemes were a thing and that proper unicode software is supposed to count those instead of bytes or code points.I didn't know that unicode changes the definition of grapheme in backwards incompatible fashion annually, so software which works by grapheme count is probably inconsistent with other software using a different version of the standard anyway.I'm therefore going to continue counting bytes. And comparing by memcmp. If the bytes look like unicode to some reader, fine. Opaque string as far as my software is concerned.

评论 #37739972 未加载

评论 #37747637 未加载

评论 #37739770 未加载

评论 #37745828 未加载

评论 #37740434 未加载

ssokolowover 1 year ago

> Before comparing strings or searching for a substring, normalize!...and learn about the TR39 Skeleton Algorithm for Unicode Confusables. Far too few people writing spam-handling code know about that thing.(Basically, it generates matching keys from arbitrary strings so that visually similar characters compare identical, so those Disqus/Facebook/etc. spam messages promoting things like BITCO1N pump-and-dumps or using esoteric Unicode characters to advertise work-from-home scams will be wasting their time trying to disguise their words.)...and since it's based on a tabular plaintext definition file, you can write a simple parser and algorithm to work it in reverse and generate sample spam exploiting that approach if you want.<a href="https://www.unicode.org/Public/security/latest/confusables.txt" rel="nofollow noreferrer">https://www.unicode.org/Public/security/latest/confusables.t...</a>> and CD-ROM!I think you mean Microsoft Windows's Joliet extensions to ISO9660 which, by the way, use UCS-2, not UTF-16. (Try generating an ISO on Linux (eg. using K3b) with the Joliet option enabled and watch as filenames with emoji outside the Basic Multilingual Plane cause the process to fail.)The base ISO9660 filesystem uses bytewise-encoded filenames.

评论 #37752276 未加载

dekken_over 1 year ago

Am I supposed to hate this website, cause I kinda do

评论 #37737181 未加载

评论 #37744873 未加载

评论 #37737217 未加载

评论 #37738587 未加载

评论 #37737098 未加载

评论 #37738272 未加载

zzzeekover 1 year ago

I wondered about how to do simple text centering / spacing justification given graphemes showing string lengths that don't match up human-perceived characters, like in 'Café' (python len('Café') returns 5, even though we see four letters).Found this! good to know about. <a href="https://pypi.org/project/grapheme/" rel="nofollow noreferrer">https://pypi.org/project/grapheme/</a> "A Python package for working with user perceived characters. "(apparently the article talks about this however the blog post is largely unreadable due to dozens of animated arrow pointers jumping all over the screen)

WillAdamsover 1 year ago

Just had this come up at work --- needed a checkbox in Microsoft Word --- oddly the solution to entering it was to use the numeric keypad, hold down the alt key and then type out 128504 which yielded a check mark when the Arial font was selected _and_ unlike Insert Symbol and other techniques didn't change the font to Segoe UI Symbol or some other font with that symbol.Oddly, even though the Word UI indicated it was Arial, exporting to a PDF and inspecting that revealed that Segoe UI Symbol was being used.As I've noted in the past, "If typography was easy, Microsoft Word wouldn't be the foetid mess which it is."

评论 #37740077 未加载

badcppdevover 1 year ago

Just a nitpick because the page says: "Unicode is a standard that aims to unify all human languages, both past and present, and make them work with computers." but of course unicode is only relevant to written languages as opposed to spoken languages (and signed languages)I wish that was the only thing wrong with that page

phformsover 1 year ago

Regarding UTF-8 encoding:“And a couple of important consequences:- You CAN’T determine the length of the string by counting bytes.- You CAN’T randomly jump into the middle of the string and start reading.- You CAN’T get a substring by cutting at arbitrary byte offsets. You might cut off part of the character.”One of the things I had to get used to when learning the programming language Janet is that strings are just plain byte sequences, unaware of any encoding. So when I call `length` on a string of one character that is represented by 2 bytes in UTF-8 (e.g. `ä`), the function returns 2 instead of 1. Similar issues occur when trying to take a substring, as mentioned by the author.As much as I love the approach Janet took here (it feels clean and simple and works well with their built-in PEGs), it is a bit annoying to work with outside of the ASCII range. Fortunately, there are libraries that can deal with this issue (e.g. <a href="https://github.com/andrewchambers/janet-utf8">https://github.com/andrewchambers/janet-utf8</a>), but I wish they would support conversion to/from UTF-8 out of the box, since I generally like Janet very much.One interesting thing I learned from the article is that the first byte can always be determined from its prefix. I always wondered how you would recognize/separate a unicode character in a Janet string since it may have 1-4 bytes length, but I guess this is the answer.

评论 #37745627 未加载

loegover 1 year ago

Pretty clearly, "every software developer" doesn't need to understand Unicode with this level of familiarity, much like "every programmer" doesn't need to know the full contents of the 114 page Drepper paper. For example, I work on a GUID-addressed object store. Everything is in term of bytes and 128-bit UUIDs. Unicode is irrelevant to everyone on my team, and most adjacent teams. There is lots of software like this.

评论 #37745777 未加载

评论 #37745610 未加载

评论 #37746733 未加载

ycomfakeuserover 1 year ago

There is another 'modern' language that does utf8 right and has done it right for a long time. I know it's mostly fallen out of favour, but we're still out here: Perl.$ perl -wle 'use utf8; print length("");' 1Without use utf8: $ perl -wle 'print length("");' 3It's funny: after Perl fell out of favour, is when it got all its best stuff. It's still my preferred language for just about everything.

评论 #37862691 未加载

rdtscover 1 year ago

> The only modern language that gets it right is Swift:<pre><code> print("...".count) // => 1 </code></pre> And Erlang/Elixir! I guess they are not "cool" enough. But they correctly interpret that as one grapheme cluster.<pre><code> % erl +pc unicode > string:length("..."). 1 </code></pre> (... here is the U+1F926 U+1F3FB U+200D U+2642 U+FE0F emoji)

评论 #37750351 未加载

samatmanover 1 year ago

Please don't refer to codepoints as characters. Some are, some are not, it isn't a useful or informative approximation, it's just wrong. Unicode is a table which assigns unique numbers to different codepoints, most of which are characters. ZWJ is not a character at all, and extended grapheme clusters made of several codepoints are.

评论 #37744300 未加载

thyseliusover 1 year ago

Wonderful to learn more about Unicode.Does anyone know how to write a function (preferably in swift) to remove emoji? This is surprisingly hard (if the string can be any language, like English or Chinese).There’s been multiple attempts on Stackoverflow but they’re all missing some of them, as Unicode is so complex.

评论 #37741718 未加载

评论 #37743374 未加载

kalinfover 1 year ago

> Among them is assigning the same code point to glyphs that are supposed to look differently, like Cyrillic Lowercase K and Bulgarian Lowercase K (both are U+043A).This is nonsense, Bulgaria has been using the Cyrillic alphabet sinse its creation in … Bulgaria!What you’ve shown is two different fonts, and both renderings are perfectly fine in Bulgaria.Read up more about it on wikipedia: <a href="https://en.wikipedia.org/wiki/Bulgarian_alphabet" rel="nofollow noreferrer">https://en.wikipedia.org/wiki/Bulgarian_alphabet</a>

Terr_over 1 year ago

> normalizationA quick war-story on this: We had a system which was taking web-user input for human names, and then some of it had to be sanitized for a crappy third-party system. However some of the names were getting mangled in unexpected ways.One of the (multiple) issues was that we were sometimes entirely dropping accented characters even when a good alternative existed. This occurred when we were getting "é" (U+00E9) instead of "é" (U+0065 U+0301), a regular letter E plus a special accent modifier. By forcing the second form (D normalization) we were able to strip "just the accents" and avoid excessively-wrong names.Going further with K+D normalization, weird stuff like "⑧" (letter 8 in a circle) becomes a regular number 8.

评论 #37747661 未加载

moelfover 1 year ago

<pre><code> > The only modern language that gets it right is Swift: </code></pre> arguably not true:<pre><code> julia> using Unicode # for some reason HN doesn't allow emoji julia> graphemes(" ") length-1 GraphemeIterator{String} for " " help?> graphemes search: graphemes graphemes(s::AbstractString) -> GraphemeIterator Return an iterator over substrings of s that correspond to the extended graphemes in the string, as defined by Unicode UAX #29. (Roughly, these are what users would perceive as single characters, even though they may contain more than one codepoint; for example a letter combined with an accent mark is a single grapheme.)</code></pre>

评论 #37739946 未加载

评论 #37739793 未加载

评论 #37739548 未加载

pifover 1 year ago

> The minimum every software developer must know about UnicodeJust a nitpick...Once more, as it is typical on HN, web programming is confused with the entire universe of software development.There are plenty of software realms where ASCII not only is enough, but it actually MUST be enough.

评论 #37740525 未加载

评论 #37739990 未加载

评论 #37739850 未加载

WalterBrightover 1 year ago

Quotes from the article illustrating what a train wreck Unicode has become:"The problem is, in Unicode, some graphemes are encoded with multiple code points!""An Extended Grapheme Cluster is a sequence of one or more Unicode code points that must be treatead as a single, unbreakable character.""Starting roughly in 2014, Unicode has been releasing a major revision of their standard every year.""Å" === "Å" "Å" === "Å" "Å" === "Å" What do you get? False? You should get false, and it’s not a mistake."That’s why we need normalization.""Unicode is locale-dependent"The article forgot one: characters that switch presentation to right-to-left.

hot_grilover 1 year ago

This is a lot more than the minimum that every software dev must know about Unicode. Even if you only do web frontends, you will do fine not knowing most of this. Still a nice read, though.

Nevermarkover 1 year ago

What an interesting mess!It occurs to me that a canonical semantic representation of all known (extracted) language concepts would be useful too.Now that we have multi-language LLM's it would be an interesting challenge to create/design a canonical representation for a minimum number of base concepts, their relations and orthogonal "voice" modifiers, extracted from the latent representations of an LLM across a whole training set, over all training languages.While the best LLMs still have complex reasoning issues, their understanding of concepts and voice at the sentence level is highly intuitive and accurate. So the design process could be automated.The result would be a human language agnostic, cross-culture concept inclusive, regularized & normalized (relatively speaking) semantic language. Call it SEMANTICODE.We need to get this right, using one standard LLM lineage, before the Unicode people create a super standard that spans 150 different LLM's and 150 different latent spaces! :OStability between updates would be guaranteed by including SEMANTICODE as a non-human language in training of future LLM's. Perhaps including a (highly) pre-normalized semantic artificial language would dramatically speed up and reduce the parameter count needed for future multi-language training?*Then LLMs could use SEMANTICODE talk to each other more reliably, efficiently, and with greater concept specificity than any of our single languages.

bedersover 1 year ago

Tonsky, dude.I stopped reading your article because of your little websocket experiment.

throwaway2037over 1 year ago

It sounds like a generic length function in Unicode in 2023 is no longer a good idea. These articles complaining about the variety of lengths in Unicode are annoying at this point. Pretty much all of them can be summed up as, "Well, it depends." And, that isn't wrong. But nerds love to argue until they are blue in the face about the One Correct Answer. Sheesh.This is the most interesting comparison article I have seen in years about Unicode processing in C++: <a href="https://thephd.dev/the-c-c++-rust-string-text-encoding-api-landscape" rel="nofollow noreferrer">https://thephd.dev/the-c-c++-rust-string-text-encoding-api-l...</a>The author is also the lead on an open source C++ Unicode library called ztd.txt: <a href="https://github.com/soasis/text">https://github.com/soasis/text</a>

racl101over 1 year ago

The fuck is up with the cursor nonsense? I would've read this thing if it wasn't for that.

dathinabover 1 year ago

The author seem to hate people which concentration issues and/or various visual sicknesses.That coloration tools shows the moving mouse coursers of other participants even if they aren't needed/wanted is already pretty bad, why bring it to a website?

评论 #37737893 未加载

kazinatorover 1 year ago

If you have to recognize a grapheme cluster, it will be easier to do that from a sequence of code points, than from UTF-8.It's like saying that we don't need to tokenize, because you never want to deal with tokens anyway, but phrase structures!Mmkay, whatever ...

nabla9over 1 year ago

>3 Grapheme Cluster Boundaries>It is important to recognize that what the user thinks of as a “character”—a basic unit of a writing system for a language—may not be just a single Unicode code point. Instead, that basic unit may be made up of multiple Unicode code points. To avoid ambiguity with the computer use of the term character, this is called a user-perceived character. For example, “G” + grave-accent is a user-perceived character: users think of it as a single character, yet is actually represented by two Unicode code points. These user-perceived characters are approximated by what is called a grapheme cluster, which can be determined programmatically.

评论 #37742995 未加载

评论 #37739587 未加载

russellbeattieover 1 year ago

We need, desperately and without question, two Unicode symbols for bold and italic.These are part of language and should not be an optional proprietary add on that can be skipped or deleted from text. We've been using the two "formats" to convey important information since the sixteenth century!!!It boggles my mind that we can give flesh tone to emojis, yet not mark a word as bold or italic. It makes zero sense. Especially how easy it would be to implement. It would work exactly the same way: Letters following the mark would be formatted as bold or italic until a space character or equivalent.

rkagererover 1 year ago

"the definition of graphemes changes from version to version"In what twisted reality did someone think this a good idea?Doesn't it go against the whole premise of everyone in the world agreeing on how to represent a meaningful unit of text?"What’s sad for us is that the rules defining grapheme clusters change every year as well. What is considered a sequence of two or three separate code points today might become a grapheme cluster tomorrow! There’s no way to know! Or prepare!""Even worse, different versions of your own app might be running on different Unicode standards and report different string lengths!"

评论 #37742017 未加载

Karellenover 1 year ago

> The simplest possible encoding for Unicode is UTF-32. It simply stores code points as 32-bit integers.Skipping over UTF-32-BE and UTF-32-LE there...(I mean, it might not be an issue if it's just being used as an internal representation, but still)

hosejaover 1 year ago

<a href="https://tonsky.me/blog/unicode/overview@2x.png" rel="nofollow noreferrer">https://tonsky.me/blog/unicode/overview@2x.png</a>Wow, what abominable mix of decimal and hexadecimal.

评论 #37737389 未加载

bumbledravenover 1 year ago

> what to you think "ẇ͓̞͒͟͡ǫ̠̠̉̏͠͡ͅr̬̺͚̍͛̔͒͢d̠͎̗̳͇͆̋̊͂͐".length should be?This is a nice example of the kind of thing we need to think about when defining a measure of length for Unicode strings.

评论 #37738987 未加载

cryptonectorover 1 year ago

> Another unfortunate example of locale dependence is the Unicode handling of dotless i in the Turkish language.This isn't quite Unicode's fault, as the alternative would be to have two codepoints each for `i` and `I`, one pair for the Latin versions and one for the Turkish versions, and that would be very annoying too.Whereas the Russian/Bulgarian situation is different. There used to be language tags in Unicode for that, but IIRC they got deprecated, and maybe they'll have to get undeprecated.

heldridaover 1 year ago

The mouse cursos ir really annoying, stopped reading for that reason

everyoneover 1 year ago

I'm always gonna point out these overly broad titles assuming "every software developer" is some kind of internetty web dev type. I'm a game dev, I try and never touch strings at all, they are a nightmare data type. Strings in a game are like graphics or audio assets, your game might read them and show them to player, but they should never come anywhere near your code or even be manipulated by it. I dont need to know any of that stuff about Unicode.

rurbanover 1 year ago

The Why is "Å" !== "Å" !== "Å"? section still strikes me as wrong. The strings are equal even when the representations differ.

评论 #37737646 未加载

评论 #37737671 未加载

ggcampinhoover 1 year ago

Elixir also gets the length correctly, not only Swift.

ameliusover 1 year ago

Can we please get a standard that describes how emoji are supposed to look?Now they look different on every platform and many subtleties are lost in translation.

评论 #37738849 未加载

zackmorrisover 1 year ago

The only modern language that gets it right is Swift:Apple did a fairly good job with unicode string handling starting in Cocoa and Objective-C, by providing methods to get the number of code points and/or bytes:<a href="https://stackoverflow.com/questions/15582267/cfstring-count-of-characters-not-code-points-in-a-string/15582268#15582268" rel="nofollow noreferrer">https://stackoverflow.com/questions/15582267/cfstring-count-...</a>I feel that this support of both character count and buffer size in bytes is probably the way to go. But Python 3 went wrong by trying to abstract it away with encodings that have unintuitive pitfalls that broke compatibility with Python 2:<a href="https://blog.feabhas.com/2019/02/python-3-unicode-and-byte-strings/" rel="nofollow noreferrer">https://blog.feabhas.com/2019/02/python-3-unicode-and-byte-s...</a>There's also the normalization issue. Apple goofed (IMHO) when they used NFD in HFS+ filenames while everyone else went with NFC, but fixed that in APFS:<a href="https://unicode.org/faq/normalization.html" rel="nofollow noreferrer">https://unicode.org/faq/normalization.html</a><a href="https://medium.com/@sthadewald/the-utf-8-hell-of-mac-osx-feef5ea42407" rel="nofollow noreferrer">https://medium.com/@sthadewald/the-utf-8-hell-of-mac-osx-fee...</a>

francisofasciiover 1 year ago

I enjoyed how the timeline graphic included Joel's article. Because my first thought was hey, isn't this the same title.

throwaway_fjmrover 1 year ago

And yet, many modern, recent apps can't even encode the accented European character in my given name. Sigh.

Tomteover 1 year ago

> They will look the same (Å vs Å)No. In my browser the first A has the ring glued to it, and the second has a little gap.

ilytover 1 year ago

> Unicode is locale-dependentWell, there is a new fact that I learned and immediately hated.The fuck were authors thinking...I am now firmly convinced people developing unicode hate developers. I suspected it before just due to how messy it was (same character having different encodings ? Really ? Fuck you), but this cements it.

评论 #37738188 未加载

评论 #37737911 未加载

评论 #37738452 未加载

评论 #37738856 未加载

nottorpover 1 year ago

> Since everybody in the world agrees on which numbers correspond to which characters, and we all agree to use Unicode, we can read each other’s texts.Hmm? I thought some code points combine to create a character. Even accented latin ones can be like that.Also we need to agree on what is a character.

评论 #37744441 未加载

kajaktumover 1 year ago

I am torn between supporting all languages (which easily leaks into supporting emojis) versus just using the 90~ Latin characters as the lingua franca.Look, I would love to be able to read/write Sanskrit, Arabic, Chinese, Japanese etc and share those content and have everyone render and see the same thing. The problem is that I feel like most of these are:1. a kind of an open problem 2. very subjective 3. very, very subjective as what you is mostly dictated by the implementation (fonts)For example, why does a gun emoji looks like water gun? Why is the skull-and-crossbones symbol looks so benign. In fact, it is often used as a meme (see deadass :skull:) Why is the basmala a single "character"?In my opinion, people should just learn how to use kaomoji. Granted, kaomojis rely on a lot more than the Latin characters but it is at least artful, skillfull and a natural extension of the "actual" languages.> inb4 languages evolvesYes, but it mostly happens naturally. I feel like what happens today mostly happens at the whim of a few passionate people in the standard.

评论 #37742383 未加载

wyldfireover 1 year ago

> “I know, I’ll use a library to do strlen()!” — nobody, ever.The standard library provided by languages like C, C++ is a library. Features like character strings are present and it's a totally reasonable expectation for the length to give you the cluster count.

评论 #37738901 未加载

评论 #37738956 未加载

permo-wover 1 year ago

>That gives us a space of about 11 million code points. About 170,000, or 15%, are currently defined. An additional 11% are reserved for private use. The rest, about 800,000 code points, are not allocated at the moment. They could become characters in the future.1.1 million?

评论 #37737658 未加载

tootieover 1 year ago

One of my favorite interview questions is simply "What's the difference between Unicode and utf-8?" I feel like that's pretty mandatory knowledge for any specialty but it doesn't get answered correctly very often.

aembletonover 1 year ago

In Java/Kotlin, I've found this Grapheme Splitter library to be useful: <a href="https://github.com/hiking93/grapheme-splitter-lite">https://github.com/hiking93/grapheme-splitter-lite</a>

khakiemover 1 year ago

> The only modern language that gets it right is SwiftThe only modern language ((he knows of)) that gets it right to be precise. Ruby also gets it right.``` [1] pry(main)> "".size => 1 ```

user3939382over 1 year ago

I once bought an O'Reilly book on encoding. It was like 2000 pages. I never read it, that was about 15 years ago. My take away is that encoding is really complex and I just kind of pray it works which most of the time it does.

charcircuitover 1 year ago

The number of graphene clusters in a string depend on the font being used. The length of a string should be the number of code points because that is not length specific.Better yet, there shouldn't be a function called length.

kipcole9over 1 year ago

> The only modern language that gets it right is Swift:Elixir too:<pre><code> Interactive Elixir (1.15.4) - press Ctrl+C to exit (type h() ENTER for help) iex(1)> String.length "ẇ͓̞͒͟͡ǫ̠̠̉̏͠͡ͅr̬̺͚̍͛̔͒͢d̠͎̗̳͇͆̋̊͂͐" 4</code></pre>

hyggetroldover 1 year ago

Is there a way to read this with the mouse cursors disabled? It seems like great content but all the movement on the page is way too distracting.EDIT: I've never been downvoted for asking a question before. Weird, but okay.

gh0stcloudover 1 year ago

ther article's background color deserves to be named: <a href="https://colornames.org/color/fddb29" rel="nofollow noreferrer">https://colornames.org/color/fddb29</a>

garfieldnateover 1 year ago

What's the point of having a separate codepoint for the Angstrom if it's specified to normalize back to the regular "capital A with ring above" codepoint anyway?

qwerty456127over 1 year ago

> The rest, about 800,000 code points, are not allocated at the moment. They could become characters in the future.Why is Tengwar still not in Uniclde officially? What's the problem with it?

评论 #37737892 未加载

评论 #37743102 未加载

评论 #37737529 未加载

honkyponkyover 1 year ago

<a href="https://worldswritingsystems.org/" rel="nofollow noreferrer">https://worldswritingsystems.org/</a>

denton-scratchover 1 year ago

> Hell, you can’t even spell café, piñata, or naïve without Unicode.I must have missed something. All of those symbols are present in Extended ASCII (i.e. 8-bit).

评论 #37752472 未加载

makeworldover 1 year ago

Really great article. Hitting all the points I would expect.

coding123over 1 year ago

Honestly the "what encoding is this! UTF-8" is still the only thing we need to know. len(emoji) is still a corner case that few will care about.

评论 #37741522 未加载

velox_nebover 1 year ago

Errata: In the table under "How many bytes are in UTF-8?", bottom row, "10000" should be "100000".

bluecheese452over 1 year ago

Anyone else hate titles like this? There are millions of developers working on a large variety of things. It sounds so arrogant to me.

评论 #37752805 未加载

shellmachineover 1 year ago

$ ping6 tonsky.me ping6: no address associated with name $What every software developer must know about IPv6 in 2023 (still no excuses!).

diego_sandovalover 1 year ago

Extended Grapheme Cluster should be understood as Extended (Grapheme Cluster) or as (Extended Grapheme) Cluster?

评论 #37747830 未加载

gorgoilerover 1 year ago

With the benefit of hindsight, would we include the error detection bits of UTF8 if we could choose not to?

评论 #37747801 未加载

m3kw9over 1 year ago

Unicode looks like a big over engineered standard that had 50 hands trying to put their mark in

评论 #37738647 未加载

评论 #37745827 未加载

bagasmeover 1 year ago

The article doesn't mention how to resolve string manipulation problem involving locales.

chxover 1 year ago

"roll your own"Rather not. It takes an incredible amount of work to get it right. Just stick to ICU.

jordanrobinsonover 1 year ago

Anyone know what the story is behind the "Weird Emoji" around 140000 on the map?

评论 #37740919 未加载

评论 #37749639 未加载

Dweditover 1 year ago

Reload with Javascript disabled to remove the distracting fake mouse pointers.

penguin_boozeover 1 year ago

I knew that domain, so I had sunglasses at hand before opening the page!

tripdoutover 1 year ago

Can there be overlaps between fonts in the private use area?

评论 #37739345 未加载

Avlin67over 1 year ago

Firefox 120.0a1 has difficult time displaying this page

tqwhiteover 1 year ago

Best. Explanation. Ever.

redder23over 1 year ago

If just for the fact that it annoys people I love the mouse cursor idea. But I also find it technically interesting. Is some kind of consent legally needed per GDPR or something for this? I for sure is tracking, literally. And a website has to ask to set cookies ...

评论 #37747820 未加载

wickedsickeuneover 1 year ago

I'm sorry but the website design is extremely distracting. The mouse pointers at least are easy to delete with the inspector; The background color is not the best choice for reading material, but the inexcusable part is the width of the content.This content must be really awesome for someone to go through the trouble of interacting with such a site.

评论 #37742476 未加载

Dudester230602over 1 year ago

Guys, you don't need to know that crap.

justrealistover 1 year ago

I don't want to be too full of myself here, but I'm a very skilled and highly paid backend software engineer who knows roughly nothing about unicode (I google what I need when a file seems f'd up), and it's never been a problem for me.I'm sure the article is good but the title is nonsense.

评论 #37738774 未加载

93 comments

layer8over 1 year ago

<a href="https://archive.ph/LtKk0" rel="nofollow noreferrer">https://archive.ph/LtKk0</a>

评论 #37742100 未加载

jcranmerover 1 year ago

评论 #37741786 未加载

评论 #37739596 未加载

评论 #37742841 未加载

评论 #37743561 未加载

评论 #37742484 未加载

评论 #37744984 未加载

评论 #37744513 未加载

评论 #37745427 未加载

评论 #37749554 未加载

评论 #37754430 未加载

gumbyover 1 year ago

评论 #37739580 未加载

评论 #37742490 未加载

评论 #37739171 未加载

评论 #37750087 未加载

davidhamover 1 year ago

Is it just me, or is anyone else seeing what looks like the mouse pointer of everyone else reading the page, like 1,000 little ants on the screen

评论 #37739177 未加载

评论 #37737803 未加载

评论 #37737711 未加载

评论 #37746213 未加载

评论 #37739224 未加载

评论 #37737621 未加载

评论 #37737542 未加载

评论 #37737684 未加载

评论 #37748437 未加载

评论 #37737679 未加载

评论 #37742561 未加载

评论 #37742179 未加载

评论 #37741086 未加载

评论 #37783992 未加载

评论 #37738682 未加载

评论 #37741463 未加载

评论 #37752948 未加载

评论 #37737277 未加载

qwerty456127over 1 year ago

评论 #37739360 未加载

评论 #37738819 未加载

评论 #37741933 未加载

评论 #37741624 未加载

评论 #37738099 未加载

评论 #37744897 未加载

评论 #37745781 未加载

评论 #37739280 未加载

评论 #37746656 未加载

评论 #37738558 未加载

评论 #37743661 未加载

评论 #37746081 未加载

oefrhaover 1 year ago

tr888over 1 year ago

What on EARTH is that mouse cursor thing all about? Why would you even bother writing this, then making it impossible to read properly?

评论 #37737858 未加载

评论 #37739015 未加载

评论 #37737900 未加载

评论 #37738463 未加载

评论 #37748998 未加载

dathinabover 1 year ago

评论 #37737971 未加载

评论 #37747625 未加载

评论 #37739546 未加载

评论 #37737961 未加载

评论 #37737963 未加载

TacticalCoderover 1 year ago

评论 #37744784 未加载

评论 #37743864 未加载

评论 #37745011 未加载

评论 #37746295 未加载

评论 #37758997 未加载

评论 #37744373 未加载

alexmolasover 1 year ago

评论 #37737713 未加载

评论 #37745227 未加载

w10-1over 1 year ago

评论 #37741394 未加载

elcaroover 1 year ago

titzerover 1 year ago

评论 #37743769 未加载

评论 #37743650 未加载

layer8over 1 year ago

thefringthingover 1 year ago

评论 #37747411 未加载

评论 #37746273 未加载

nwellnhofover 1 year ago

评论 #37737776 未加载

评论 #37738179 未加载

评论 #37737739 未加载

评论 #37737791 未加载

评论 #37739435 未加载

评论 #37737843 未加载

JonChesterfieldover 1 year ago

评论 #37739972 未加载

评论 #37747637 未加载

评论 #37739770 未加载

评论 #37745828 未加载

评论 #37740434 未加载

ssokolowover 1 year ago

评论 #37752276 未加载

dekken_over 1 year ago

Am I supposed to hate this website, cause I kinda do

评论 #37737181 未加载

评论 #37744873 未加载

评论 #37737217 未加载

评论 #37738587 未加载

评论 #37737098 未加载

评论 #37738272 未加载

zzzeekover 1 year ago

WillAdamsover 1 year ago

评论 #37740077 未加载

badcppdevover 1 year ago

phformsover 1 year ago

评论 #37745627 未加载

loegover 1 year ago

评论 #37745777 未加载

评论 #37745610 未加载

评论 #37746733 未加载

ycomfakeuserover 1 year ago

评论 #37862691 未加载

rdtscover 1 year ago

评论 #37750351 未加载

samatmanover 1 year ago

评论 #37744300 未加载

thyseliusover 1 year ago

评论 #37741718 未加载

评论 #37743374 未加载

kalinfover 1 year ago

Terr_over 1 year ago

评论 #37747661 未加载

moelfover 1 year ago

评论 #37739946 未加载

评论 #37739793 未加载

评论 #37739548 未加载

pifover 1 year ago

评论 #37740525 未加载

评论 #37739990 未加载

评论 #37739850 未加载

WalterBrightover 1 year ago

hot_grilover 1 year ago

This is a lot more than the minimum that every software dev must know about Unicode. Even if you only do web frontends, you will do fine not knowing most of this. Still a nice read, though.

Nevermarkover 1 year ago

bedersover 1 year ago

Tonsky, dude.I stopped reading your article because of your little websocket experiment.

throwaway2037over 1 year ago

racl101over 1 year ago

The fuck is up with the cursor nonsense? I would've read this thing if it wasn't for that.

dathinabover 1 year ago

评论 #37737893 未加载

kazinatorover 1 year ago

nabla9over 1 year ago

评论 #37742995 未加载

评论 #37739587 未加载

russellbeattieover 1 year ago

rkagererover 1 year ago

评论 #37742017 未加载

Karellenover 1 year ago

hosejaover 1 year ago

<a href="https://tonsky.me/blog/unicode/overview@2x.png" rel="nofollow noreferrer">https://tonsky.me/blog/unicode/overview@2x.png</a>Wow, what abominable mix of decimal and hexadecimal.

评论 #37737389 未加载

bumbledravenover 1 year ago

评论 #37738987 未加载

cryptonectorover 1 year ago

heldridaover 1 year ago

The mouse cursos ir really annoying, stopped reading for that reason

everyoneover 1 year ago

rurbanover 1 year ago

The Why is "Å" !== "Å" !== "Å"? section still strikes me as wrong. The strings are equal even when the representations differ.

评论 #37737646 未加载

评论 #37737671 未加载

ggcampinhoover 1 year ago

Elixir also gets the length correctly, not only Swift.

ameliusover 1 year ago

Can we please get a standard that describes how emoji are supposed to look?Now they look different on every platform and many subtleties are lost in translation.

评论 #37738849 未加载

zackmorrisover 1 year ago

francisofasciiover 1 year ago

I enjoyed how the timeline graphic included Joel's article. Because my first thought was hey, isn't this the same title.

throwaway_fjmrover 1 year ago

And yet, many modern, recent apps can't even encode the accented European character in my given name. Sigh.

Tomteover 1 year ago

> They will look the same (Å vs Å)No. In my browser the first A has the ring glued to it, and the second has a little gap.

ilytover 1 year ago

评论 #37738188 未加载

评论 #37737911 未加载

评论 #37738452 未加载

评论 #37738856 未加载

nottorpover 1 year ago

评论 #37744441 未加载

kajaktumover 1 year ago

评论 #37742383 未加载

wyldfireover 1 year ago

评论 #37738901 未加载

评论 #37738956 未加载

permo-wover 1 year ago

评论 #37737658 未加载

tootieover 1 year ago

aembletonover 1 year ago

In Java/Kotlin, I've found this Grapheme Splitter library to be useful: <a href="https://github.com/hiking93/grapheme-splitter-lite">https://github.com/hiking93/grapheme-splitter-lite</a>

khakiemover 1 year ago

> The only modern language that gets it right is SwiftThe only modern language ((he knows of)) that gets it right to be precise. Ruby also gets it right.``` [1] pry(main)> "".size => 1 ```

user3939382over 1 year ago

charcircuitover 1 year ago

kipcole9over 1 year ago

hyggetroldover 1 year ago

gh0stcloudover 1 year ago

ther article's background color deserves to be named: <a href="https://colornames.org/color/fddb29" rel="nofollow noreferrer">https://colornames.org/color/fddb29</a>

garfieldnateover 1 year ago

What's the point of having a separate codepoint for the Angstrom if it's specified to normalize back to the regular "capital A with ring above" codepoint anyway?

qwerty456127over 1 year ago

> The rest, about 800,000 code points, are not allocated at the moment. They could become characters in the future.Why is Tengwar still not in Uniclde officially? What's the problem with it?

评论 #37737892 未加载

评论 #37743102 未加载

评论 #37737529 未加载

honkyponkyover 1 year ago

<a href="https://worldswritingsystems.org/" rel="nofollow noreferrer">https://worldswritingsystems.org/</a>

denton-scratchover 1 year ago

> Hell, you can’t even spell café, piñata, or naïve without Unicode.I must have missed something. All of those symbols are present in Extended ASCII (i.e. 8-bit).

评论 #37752472 未加载

makeworldover 1 year ago

Really great article. Hitting all the points I would expect.

coding123over 1 year ago

Honestly the "what encoding is this! UTF-8" is still the only thing we need to know. len(emoji) is still a corner case that few will care about.

评论 #37741522 未加载

velox_nebover 1 year ago

Errata: In the table under "How many bytes are in UTF-8?", bottom row, "10000" should be "100000".

bluecheese452over 1 year ago

Anyone else hate titles like this? There are millions of developers working on a large variety of things. It sounds so arrogant to me.

评论 #37752805 未加载

shellmachineover 1 year ago

$ ping6 tonsky.me ping6: no address associated with name $What every software developer must know about IPv6 in 2023 (still no excuses!).

diego_sandovalover 1 year ago

Extended Grapheme Cluster should be understood as Extended (Grapheme Cluster) or as (Extended Grapheme) Cluster?

评论 #37747830 未加载

gorgoilerover 1 year ago

With the benefit of hindsight, would we include the error detection bits of UTF8 if we could choose not to?

评论 #37747801 未加载

m3kw9over 1 year ago

Unicode looks like a big over engineered standard that had 50 hands trying to put their mark in

评论 #37738647 未加载

评论 #37745827 未加载

bagasmeover 1 year ago

The article doesn't mention how to resolve string manipulation problem involving locales.

chxover 1 year ago

"roll your own"Rather not. It takes an incredible amount of work to get it right. Just stick to ICU.

jordanrobinsonover 1 year ago

Anyone know what the story is behind the "Weird Emoji" around 140000 on the map?

评论 #37740919 未加载

评论 #37749639 未加载

Dweditover 1 year ago

Reload with Javascript disabled to remove the distracting fake mouse pointers.

penguin_boozeover 1 year ago

I knew that domain, so I had sunglasses at hand before opening the page!

tripdoutover 1 year ago

Can there be overlaps between fonts in the private use area?

评论 #37739345 未加载

Avlin67over 1 year ago

Firefox 120.0a1 has difficult time displaying this page

tqwhiteover 1 year ago

Best. Explanation. Ever.

redder23over 1 year ago

评论 #37747820 未加载

wickedsickeuneover 1 year ago

评论 #37742476 未加载

Dudester230602over 1 year ago

Guys, you don't need to know that crap.

justrealistover 1 year ago

评论 #37738774 未加载