There's one part of this document that I would push extremely hard against, and that's the notion that "extended grapheme clusters" are the one true, right way to think of characters in Unicode, and therefore any language that views the length in any other way is doing it wrong.<p>The truth of the matter is that there are several different definitions of "character", depending on what you want to use it for. An extended grapheme cluster is largely defined on "this visually displays as a single unit", which isn't necessarily correct for things like "display size in a monospace font" or "thing that gets deleted when you hit backspace." Like so many other things in Unicode, the correct answer is use-case dependent.<p>(And for this reason, String iteration should be based on codepoints--it's the fundamental level on which Unicode works, and whatever algorithm you want to use to derive the correct answer for your purpose will be based on codepoint iteration. hsivonen's article (<a href="https://hsivonen.fi/string-length/" rel="nofollow noreferrer">https://hsivonen.fi/string-length/</a>), linked in this one, does try to explain why extended grapheme clusters is the wrong primitive to use in a language.)
This is quite a good write up. An answer to one of the author's questions:<p>> Why does the fi ligature even have its own code point? No idea.<p>On of the principles of Unicode is round trip compatibility. That is you should be able to read in a file encoded with some obsolete coding system and write it out again properly. Maybe frob it a bit with your unicode-based tools first. This is a good principle, though less useful today.<p>So the fi ligature was in a legacy encoding system and thus must be in Unicode. That's also why things like digits with a circle around them exist: they were in some old Japanese character set. Nowadays we might compose them with some zwj or even just leave them to some higher level formatting (my preference).
> People are not limited to a single locale. For example, I can read and write English (USA), English (UK), German, and Russian. Which locale should I set my computer to?<p>Ideally - the "English-World" locale is supposedly meant for us, cosmopolitans. It's included with Windows 10 and 11.<p>Practically, as "English-World" was not available in the past (and still wasn't available on platforms other than Windows the last time I checked), I have always been setting the locale to En-US even though I have never been to America. This leads to a number of annoyances though. E.g. LibreOffice always creates new documents for the Letter paper format and I have to switch it to A4 manually every time. It's even worse on Linux where locales appear to be less easy to customize than in Windows. Windows always offered a handy configuration dialog to granularly tweak your locale choosing what measures system you prefer, whether your weeks begin on sundays or mondays and even define your preferred date-time format templates fully manually.<p>A less-spoken about problem is Windows' system-wide setting for the default legacy codepage. I happen to use single-language legacy (non-Unicode) apps made by people from a number of very different countries. Some apps (e.g. I can remeber the Intel UHD Windows driver config app) even use this setting (ignoring the system locale and system UI language) to detect your language and render their whole UI in it.<p>> English (USA), English (UK)<p>This deserves a separate discussion. I doubt many English speakers (let alone those who don't live in a particular anglophone country) care to distinguish between English dialects. To us presence of a huge number of these (don't forget en-AU, en-TT, en-ZW etc - there are more!) in the options lists brings only annoyance, especially when one chooses some non-US one and this opens another can of worms.<p>By the way I wonder how do string capitalization and comparision functions manage to work on computers of people who use both English and Turkish actively (Turkish locale distinguishes between dotted and undotted İ).
> many Chinese, Japanese, and Korean logograms that are written very differently get assigned the same code point<p>This leads to absolutely horrendous rendering of Chinese filenames in Windows if the system locale isn’t Chinese. The characters seem to be rendered in some variant of MS Gothic and it’s very obviously a mix of Chinese and Japanese glyphs (of somewhat different sizes and/or stroke widths IIRC). I think the Chinese locale avoids the issue by using Microsoft YaHei UI.
> The only modern language that gets it right is Swift:<p>I disagree.<p>What is the "right" things is use-case dependent.<p>For UI it's glyph bases, kinda, more precise some good enough abstraction over render width. For which glyphs are not always good enough but also the best you can get without adding a ton of complexity.<p>But for pretty much every other use-case you want storage byte size.<p>I mean in the UI you care about the length of a string because there is limited width to render a strings.<p>But everywhere else you care about it because of (memory) resource limitations and costs in various ways. Weather that is for bandwidth cost, storage cost, number of network packages, efficient index-ability, etc. etc. In rare cases being able to type it, but then it's often us-ascii only, too.
> For example, é (a single grapheme) is encoded in Unicode as e (U+0065 Latin Small Letter E) + ´ (U+0301 Combining Acute Accent). Two code points!<p>It's a poor and misleading example for it is definitely not how 'é' is encoded in 99.999% of all the text written in, say, french out there (french is the language where 'é' is the most common).<p>'é' is U+00F9, one codepoint, definitely not two.<p>Now you could say: but it is <i>also</i> the two codepoints one. But that's precisely what makes Unicode the complete, total and utter clusterfuck that it is.<p>And hence even an article explaining what every programmer should know about Unicode cannot even get the most basic example right. Which is honestly quite ironic.
I tried to read the articles since it seemed interesting. After exactly 30 seconds trying it I had to leave the page. Impossible to read more than two sentences with all those pointer moving there - and for a folk with ADHD even more difficult. Sorry, but I couldn't make it :(
A real question is why IBM, Apple, and Microsoft poured millions into developing the unicode standard instead of treating character encoding like file formats as a venue for competition.<p>IBM and Apple in the early 1990's combined in Taligent to try to beat MS NT, but failed. But a lot of internationalization came out of that and was made open, at the perfect time for Java to adopt it.<p>Interestingly it wasn't just CJK but Thai language variants that drove much of the flexibility in early unicode, largely because some early developers took a fancy to it.<p>When you look at the actual variety in written languages, Unicode grapheme/code-point/byte seems rather elegant.<p>We're in the early days of term vectors, small floats, and differentiable numerics (not to mention big integers). Are lessons from the history of unicode relevant?
Another Unicode article that mentions Swift, but not Raku :(<p>Raku's Str type has a `.chars` method that counts graphemes. It has a separate `.codes` method to count codepoints. It also can do O(1) string indexing at the grapheme level.<p>That Zalgo "word" example is counted as 4 chars, and the different comparisons of "Å" are all True in Raku.<p>You can argue about the merits of it's approach (indeed several commenters here disagree that graphemes are the "one true way" to count characters), but it feels lacking to not at least _mention_ Raku when talking about how different programming languages handle Unicode.
> The problem is, you don’t want to operate on code points. A code point is not a unit of writing; one code point is not always a single character. What you should be iterating on is called “extended grapheme clusters”, or graphemes for short.<p>It's best to avoid making overly-general claims like this. There are plenty of situations that warrant operating on code points, and it's likely that software trying and failing to make sense of grapheme clusters will result it in a worse screwup. Codepoints are probably the <i>best</i> default. For example, it probably makes the most sense for programming languages to define strings as arrays of code points, and not characters or 16-bit chunks or an encoding, or whatever.
This is pretty good. One thing I would add is to mention that Unicode defines algorithms for bidirectional text, collation (sorting order), line breaking and other text segmentation (words and sentences, besides grapheme clusters). The main point here is to know that there are specifications one should take into account when topics like that come up, instead of just inventing your own algorithm.
> Unicode is a standard that aims to unify all human languages, both past and present, and make them work with computers.<p>This is doubly wrong.<p>First, it conflates languages and writing systems. Malay and English use the same writing system but are different languages. American Sign Language is a language, but it has no standard or widely-adopted writing system. Hakka is a language, but Hakka speakers normally write in Modern Standard Mandarin, a different language.<p>Second, it's not that case that Unicode aims to encode all writing systems. For example, there are many hobbyist neographies (constructed writing systems) which will not be included in Unicode.
Unicode is a total mess. In a sane system, "extended grapheme clusters" would equal "codepoints" and it wouldn't make a difference for 99% of languages. Now we ended up with grapheme clusters, normalization, decomposition, composition, Zalgo text, etc. But instead of deprecating this nonsense, Unicode doubled down with composed Emojis.
Prior to this article, I knew graphemes were a thing and that proper unicode software is supposed to count those instead of bytes or code points.<p>I didn't know that unicode changes the definition of grapheme in backwards incompatible fashion annually, so software which works by grapheme count is probably inconsistent with other software using a different version of the standard anyway.<p>I'm therefore going to continue counting bytes. And comparing by memcmp. If the bytes look like unicode to some reader, fine. Opaque string as far as my software is concerned.
> Before comparing strings or searching for a substring, normalize!<p>...and learn about the TR39 Skeleton Algorithm for Unicode Confusables. Far too few people writing spam-handling code know about that thing.<p>(Basically, it generates matching keys from arbitrary strings so that visually similar characters compare identical, so those Disqus/Facebook/etc. spam messages promoting things like BITCO1N pump-and-dumps or using esoteric Unicode characters to advertise work-from-home scams will be wasting their time trying to disguise their words.)<p>...and since it's based on a tabular plaintext definition file, you can write a simple parser and algorithm to work it in reverse and generate sample spam exploiting that approach if you want.<p><a href="https://www.unicode.org/Public/security/latest/confusables.txt" rel="nofollow noreferrer">https://www.unicode.org/Public/security/latest/confusables.t...</a><p>> and CD-ROM!<p>I think you mean Microsoft Windows's Joliet extensions to ISO9660 which, by the way, use UCS-2, not UTF-16. (Try generating an ISO on Linux (eg. using K3b) with the Joliet option enabled and watch as filenames with emoji outside the Basic Multilingual Plane cause the process to fail.)<p>The base ISO9660 filesystem uses bytewise-encoded filenames.
I wondered about how to do simple text centering / spacing justification given graphemes showing string lengths that don't match up human-perceived characters, like in 'Café' (python len('Café') returns 5, even though we see four letters).<p>Found this! good to know about. <a href="https://pypi.org/project/grapheme/" rel="nofollow noreferrer">https://pypi.org/project/grapheme/</a> "A Python package for working with user perceived characters. "<p>(apparently the article talks about this however the blog post is largely unreadable due to dozens of animated arrow pointers jumping all over the screen)
Just had this come up at work --- needed a checkbox in Microsoft Word --- oddly the solution to entering it was to use the numeric keypad, hold down the alt key and then type out 128504 which yielded a check mark when the Arial font was selected _and_ unlike Insert Symbol and other techniques didn't change the font to Segoe UI Symbol or some other font with that symbol.<p>Oddly, even though the Word UI indicated it was Arial, exporting to a PDF and inspecting that revealed that Segoe UI Symbol was being used.<p>As I've noted in the past, "If typography was easy, Microsoft Word wouldn't be the foetid mess which it is."
Just a nitpick because the page says: "Unicode is a standard that aims to unify all human languages, both past and present, and make them work with computers." but of course unicode is only relevant to written languages as opposed to spoken languages (and signed languages)<p>I wish that was the only thing wrong with that page
Regarding UTF-8 encoding:<p>“And a couple of important consequences:<p>- You CAN’T determine the length of the string by counting bytes.<p>- You CAN’T randomly jump into the middle of the string and start reading.<p>- You CAN’T get a substring by cutting at arbitrary byte offsets. You might cut off part of the character.”<p>One of the things I had to get used to when learning the programming language Janet is that strings are just plain byte sequences, unaware of any encoding. So when I call `length` on a string of one character that is represented by 2 bytes in UTF-8 (e.g. `ä`), the function returns 2 instead of 1. Similar issues occur when trying to take a substring, as mentioned by the author.<p>As much as I love the approach Janet took here (it feels clean and simple and works well with their built-in PEGs), it is a bit annoying to work with outside of the ASCII range. Fortunately, there are libraries that can deal with this issue (e.g. <a href="https://github.com/andrewchambers/janet-utf8">https://github.com/andrewchambers/janet-utf8</a>), but I wish they would support conversion to/from UTF-8 out of the box, since I generally like Janet very much.<p>One interesting thing I learned from the article is that the first byte can always be determined from its prefix. I always wondered how you would recognize/separate a unicode character in a Janet string since it may have 1-4 bytes length, but I guess this is the answer.
Pretty clearly, "every software developer" doesn't need to understand Unicode with this level of familiarity, much like "every programmer" doesn't need to know the full contents of the 114 page Drepper paper. For example, I work on a GUID-addressed object store. Everything is in term of bytes and 128-bit UUIDs. Unicode is irrelevant to everyone on my team, and most adjacent teams. There is lots of software like this.
There is another 'modern' language that does utf8 right and has done it right for a <i>long</i> time. I know it's mostly fallen out of favour, but we're still out here: Perl.<p>$ perl -wle 'use utf8; print length("");'
1<p>Without use utf8:
$ perl -wle 'print length("");'
3<p>It's funny: after Perl fell out of favour, is when it got all its best stuff. It's still my preferred language for just about everything.
> The only modern language that gets it right is Swift:<p><pre><code> print("...".count)
// => 1
</code></pre>
And Erlang/Elixir! I guess they are not "cool" enough. But they correctly interpret that as one grapheme cluster.<p><pre><code> % erl +pc unicode
> string:length("...").
1
</code></pre>
(... here is the U+1F926 U+1F3FB U+200D U+2642 U+FE0F emoji)
Please don't refer to codepoints as characters. Some are, some are not, it isn't a useful or informative approximation, it's just wrong. Unicode is a table which assigns unique numbers to different <i>codepoints</i>, most of which are characters. ZWJ is not a character at all, and extended grapheme clusters made of several codepoints are.
Wonderful to learn more about Unicode.<p>Does anyone know how to write a function (preferably in swift) to remove emoji? This is surprisingly hard (if the string can be any language, like English or Chinese).<p>There’s been multiple attempts on Stackoverflow but they’re all missing some of them, as Unicode is so complex.
> Among them is assigning the same code point to glyphs that are supposed to look differently, like Cyrillic Lowercase K and Bulgarian Lowercase K (both are U+043A).<p>This is nonsense, Bulgaria has been using the Cyrillic alphabet sinse its creation in … Bulgaria!<p>What you’ve shown is two different fonts, and both renderings are perfectly fine in Bulgaria.<p>Read up more about it on wikipedia: <a href="https://en.wikipedia.org/wiki/Bulgarian_alphabet" rel="nofollow noreferrer">https://en.wikipedia.org/wiki/Bulgarian_alphabet</a>
> normalization<p>A quick war-story on this: We had a system which was taking web-user input for human names, and then some of it had to be sanitized for a crappy third-party system. However some of the names were getting mangled in unexpected ways.<p>One of the (multiple) issues was that we were sometimes entirely dropping accented characters even when a good alternative existed. This occurred when we were getting "é" (U+00E9) instead of "é" (U+0065 U+0301), a regular letter E plus a special accent modifier. By forcing the second form (D normalization) we were able to strip "just the accents" and avoid excessively-wrong names.<p>Going further with K+D normalization, weird stuff like "⑧" (letter 8 in a circle) becomes a regular number 8.
<p><pre><code> > The only modern language that gets it right is Swift:
</code></pre>
arguably not true:<p><pre><code> julia> using Unicode
# for some reason HN doesn't allow emoji
julia> graphemes(" ")
length-1 GraphemeIterator{String} for " "
help?> graphemes
search: graphemes
graphemes(s::AbstractString) -> GraphemeIterator
Return an iterator over substrings of s that correspond to the extended graphemes in the string, as defined by Unicode UAX #29. (Roughly, these are what users would perceive as single characters, even though they may contain more than one codepoint; for example a letter combined
with an accent mark is a single grapheme.)</code></pre>
> The minimum every software developer must know about Unicode<p>Just a nitpick...<p>Once more, as it is typical on HN, web programming is confused with the entire universe of software development.<p>There are plenty of software realms where ASCII not only is enough, but it actually MUST be enough.
Quotes from the article illustrating what a train wreck Unicode has become:<p>"The problem is, in Unicode, some graphemes are encoded with multiple code points!"<p>"An Extended Grapheme Cluster is a sequence of one or more Unicode code points that must be treatead as a single, unbreakable character."<p>"Starting roughly in 2014, Unicode has been releasing a major revision of their standard every year."<p>"Å" === "Å"
"Å" === "Å"
"Å" === "Å"
What do you get? False? You should get false, and it’s not a mistake.<p>"That’s why we need normalization."<p>"Unicode is locale-dependent"<p>The article forgot one: characters that switch presentation to right-to-left.
This is a lot more than the minimum that <i>every</i> software dev must know about Unicode. Even if you only do web frontends, you will do fine not knowing most of this. Still a nice read, though.
What an interesting mess!<p>It occurs to me that a canonical semantic representation of all known (extracted) language concepts would be useful too.<p>Now that we have multi-language LLM's it would be an interesting challenge to create/design a canonical representation for a minimum number of base concepts, their relations and orthogonal "voice" modifiers, extracted from the latent representations of an LLM across a whole training set, over all training languages.<p>While the best LLMs still have complex reasoning issues, their understanding of concepts and voice at the sentence level is highly intuitive and accurate. So the design process could be automated.<p>The result would be a human language agnostic, cross-culture concept inclusive, regularized & normalized (relatively speaking) semantic language. Call it SEMANTICODE.<p><i>We need to get this right, using one standard LLM lineage, before the Unicode people create a super standard that spans 150 different LLM's and 150 different latent spaces!</i> :O<p>Stability between updates would be guaranteed by including SEMANTICODE as a non-human language in training of future LLM's. Perhaps including a (highly) pre-normalized semantic artificial language would dramatically speed up and reduce the parameter count needed for future multi-language training?*<p>Then LLMs could use SEMANTICODE talk to each other more reliably, efficiently, and with greater concept specificity than any of our single languages.
It sounds like a generic length function in Unicode in 2023 is no longer a good idea. These articles complaining about the variety of lengths in Unicode are annoying at this point. Pretty much all of them can be summed up as, "Well, it depends." And, that isn't wrong. But nerds love to argue until they are blue in the face about the One Correct Answer. Sheesh.<p>This is the most interesting comparison article I have seen in years about Unicode processing in C++: <a href="https://thephd.dev/the-c-c++-rust-string-text-encoding-api-landscape" rel="nofollow noreferrer">https://thephd.dev/the-c-c++-rust-string-text-encoding-api-l...</a><p>The author is also the lead on an open source C++ Unicode library called ztd.txt: <a href="https://github.com/soasis/text">https://github.com/soasis/text</a>
The author seem to hate people which concentration issues and/or various visual sicknesses.<p>That coloration tools shows the moving mouse coursers of other participants even if they aren't needed/wanted is already pretty bad, why bring it to a website?
If you have to recognize a grapheme cluster, it will be easier to do that from a sequence of code points, than from UTF-8.<p>It's like saying that we don't need to tokenize, because you never want to deal with tokens anyway, but phrase structures!<p>Mmkay, whatever ...
>3 Grapheme Cluster Boundaries<p>>It is important to recognize that what the user thinks of as a “character”—a basic unit of a writing system for a language—may not be just a single Unicode code point. Instead, that basic unit may be made up of multiple Unicode code points. To avoid ambiguity with the computer use of the term character, this is called a user-perceived character. For example, “G” + grave-accent is a user-perceived character: users think of it as a single character, yet is actually represented by two Unicode code points. These user-perceived characters are approximated by what is called a grapheme cluster, which can be determined programmatically.
We need, desperately and without question, two Unicode symbols for bold and italic.<p>These are <i>part of language</i> and should not be an optional proprietary add on that can be skipped or deleted from text. We've been using the two "formats" to convey <i>important</i> information since the <i>sixteenth century</i>!!!<p>It boggles my mind that we can give flesh tone to emojis, yet not mark a word as bold or italic. It makes zero sense. Especially how easy it would be to implement. It would work exactly the same way: Letters following the mark would be formatted as bold or italic until a space character or equivalent.
"<i>the definition of graphemes changes from version to version</i>"<p>In what twisted reality did someone think this a good idea?<p>Doesn't it go against the whole premise of everyone in the world agreeing on how to represent a meaningful unit of text?<p>"<i>What’s sad for us is that the rules defining grapheme clusters change every year as well. What is considered a sequence of two or three separate code points today might become a grapheme cluster tomorrow! There’s no way to know! Or prepare!</i>"<p>"<i>Even worse, different versions of your own app might be running on different Unicode standards and report different string lengths!</i>"
> The simplest possible encoding for Unicode is UTF-32. It simply stores code points as 32-bit integers.<p>Skipping over UTF-32-BE and UTF-32-LE there...<p>(I mean, it might not be an issue if it's just being used as an internal representation, but still)
<a href="https://tonsky.me/blog/unicode/overview@2x.png" rel="nofollow noreferrer">https://tonsky.me/blog/unicode/overview@2x.png</a><p>Wow, what abominable mix of decimal and hexadecimal.
> what to you think "ẇ͓̞͒͟͡ǫ̠̠̉̏͠͡ͅr̬̺͚̍͛̔͒͢d̠͎̗̳͇͆̋̊͂͐".length should be?<p>This is a nice example of the kind of thing we need to think about when defining a measure of length for Unicode strings.
> Another unfortunate example of locale dependence is the Unicode handling of dotless i in the Turkish language.<p>This isn't quite Unicode's fault, as the alternative would be to have two codepoints each for `i` and `I`, one pair for the Latin versions and one for the Turkish versions, and that would be very annoying too.<p>Whereas the Russian/Bulgarian situation is different. There used to be language tags in Unicode for that, but IIRC they got deprecated, and maybe they'll have to get undeprecated.
I'm always gonna point out these overly broad titles assuming "every software developer" is some kind of internetty web dev type. I'm a game dev, I try and never touch strings at all, they are a nightmare data type. Strings in a game are like graphics or audio assets, your game might read them and show them to player, but they should never come anywhere near your code or even be manipulated by it. I dont need to know any of that stuff about Unicode.
The Why is "Å" !== "Å" !== "Å"? section still strikes me as wrong.
The strings are equal even when the representations differ.
Can we please get a standard that describes how emoji are supposed to look?<p>Now they look different on every platform and many subtleties are lost in translation.
<i>The only modern language that gets it right is Swift:</i><p>Apple did a fairly good job with unicode string handling starting in Cocoa and Objective-C, by providing methods to get the number of code points and/or bytes:<p><a href="https://stackoverflow.com/questions/15582267/cfstring-count-of-characters-not-code-points-in-a-string/15582268#15582268" rel="nofollow noreferrer">https://stackoverflow.com/questions/15582267/cfstring-count-...</a><p>I feel that this support of both character count and buffer size in bytes is probably the way to go. But Python 3 went wrong by trying to abstract it away with encodings that have unintuitive pitfalls that broke compatibility with Python 2:<p><a href="https://blog.feabhas.com/2019/02/python-3-unicode-and-byte-strings/" rel="nofollow noreferrer">https://blog.feabhas.com/2019/02/python-3-unicode-and-byte-s...</a><p>There's also the normalization issue. Apple goofed (IMHO) when they used NFD in HFS+ filenames while everyone else went with NFC, but fixed that in APFS:<p><a href="https://unicode.org/faq/normalization.html" rel="nofollow noreferrer">https://unicode.org/faq/normalization.html</a><p><a href="https://medium.com/@sthadewald/the-utf-8-hell-of-mac-osx-feef5ea42407" rel="nofollow noreferrer">https://medium.com/@sthadewald/the-utf-8-hell-of-mac-osx-fee...</a>
> Unicode is locale-dependent<p>Well, there is a new fact that I learned and immediately hated.<p>The fuck were authors thinking...<p>I am now firmly convinced people developing unicode hate developers. I suspected it before just due to how messy it was (same character having different encodings ? Really ? Fuck you), but this cements it.
> Since everybody in the world agrees on which numbers correspond to which characters, and we all agree to use Unicode, we can read each other’s texts.<p>Hmm? I thought some code points combine to create a character. Even accented latin ones can be like that.<p>Also we need to agree on what is a character.
I am torn between supporting all languages (which easily leaks into supporting emojis) versus just using the 90~ Latin characters as the lingua franca.<p>Look, I would love to be able to read/write Sanskrit, Arabic, Chinese, Japanese etc and share those content and have everyone render and see the same thing. The problem is that I feel like most of these are:<p>1. a kind of an open problem
2. very subjective
3. very, very subjective as what you is mostly dictated by the implementation (fonts)<p>For example, why does a gun emoji looks like water gun? Why is the skull-and-crossbones symbol looks so benign. In fact, it is often used as a meme (see deadass :skull:) Why is the basmala a single "character"?<p>In my opinion, people should just learn how to use kaomoji. Granted, kaomojis rely on a lot more than the Latin characters but it is at least artful, skillfull and a natural extension of the "actual" languages.<p>> inb4 languages evolves<p>Yes, but it mostly happens naturally. I feel like what happens today mostly happens at the whim of a few passionate people in the standard.
> “I know, I’ll use a library to do strlen()!” — nobody, ever.<p>The standard library provided by languages like C, C++ <i>is</i> a library. Features like character strings are present and it's a totally reasonable expectation for the length to give you the cluster count.
>That gives us a space of about 11 million code points.
About 170,000, or 15%, are currently defined. An additional 11% are reserved for private use. The rest, about 800,000 code points, are not allocated at the moment. They could become characters in the future.<p>1.1 million?
One of my favorite interview questions is simply "What's the difference between Unicode and utf-8?" I feel like that's pretty mandatory knowledge for any specialty but it doesn't get answered correctly very often.
In Java/Kotlin, I've found this Grapheme Splitter library to be useful: <a href="https://github.com/hiking93/grapheme-splitter-lite">https://github.com/hiking93/grapheme-splitter-lite</a>
> The only modern language that gets it right is Swift<p>The only modern language ((he knows of)) that gets it right to be precise. Ruby also gets it right.<p>```
[1] pry(main)> "".size
=> 1
```
I once bought an O'Reilly book on encoding. It was like 2000 pages. I never read it, that was about 15 years ago. My take away is that encoding is really complex and I just kind of pray it works which most of the time it does.
The number of graphene clusters in a string depend on the font being used. The length of a string should be the number of code points because that is not length specific.<p>Better yet, there shouldn't be a function called length.
> The only modern language that gets it right is Swift:<p>Elixir too:<p><pre><code> Interactive Elixir (1.15.4) - press Ctrl+C to exit (type h() ENTER for help)
iex(1)> String.length "ẇ͓̞͒͟͡ǫ̠̠̉̏͠͡ͅr̬̺͚̍͛̔͒͢d̠͎̗̳͇͆̋̊͂͐"
4</code></pre>
Is there a way to read this with the mouse cursors disabled? It seems like great content but all the movement on the page is way too distracting.<p>EDIT: I've never been downvoted for asking a question before. Weird, but okay.
ther article's background color deserves to be named: <a href="https://colornames.org/color/fddb29" rel="nofollow noreferrer">https://colornames.org/color/fddb29</a>
What's the point of having a separate codepoint for the Angstrom if it's specified to normalize back to the regular "capital A with ring above" codepoint anyway?
> The rest, about 800,000 code points, are not allocated at the moment. They could become characters in the future.<p>Why is Tengwar still not in Uniclde officially? What's the problem with it?
> Hell, you can’t even spell café, piñata, or naïve without Unicode.<p>I must have missed something. All of those symbols are present in Extended ASCII (i.e. 8-bit).
Honestly the "what encoding is this! UTF-8" is still the only thing we need to know. len(emoji) is still a corner case that few will care about.
If just for the fact that it annoys people I love the mouse cursor idea. But I also find it technically interesting. Is some kind of consent legally needed per GDPR or something for this? I for sure is tracking, literally. And a website has to ask to set cookies ...
I'm sorry but the website design is extremely distracting. The mouse pointers at least are easy to delete with the inspector; The background color is not the best choice for reading material, but the inexcusable part is the width of the content.<p>This content must be really awesome for someone to go through the trouble of interacting with such a site.
I don't want to be too full of myself here, but I'm a very skilled and highly paid backend software engineer who knows roughly nothing about unicode (I google what I need when a file seems f'd up), and it's never been a problem for me.<p>I'm sure the article is good but the title is nonsense.