The official unicode documentation on normalization is good reading, and quite readable. It's actually an even more complicated topic than OP reveals, but the Unicode Standard Annex #15 explains it well.<p><a href="http://unicode.org/reports/tr15/" rel="nofollow">http://unicode.org/reports/tr15/</a><p>OP has a significant error:<p>> You can choose whatever form you’d like, as long as you’re consistent, so the same input always leads to the same result.<p>Not so much! Do _not_ use the "Compatibility" (rather than "Canonical") normalization forms unless you know what you are doing! UAX15 will explain why, but they are "lossy". In general, NFC is the one to use as a default.
> Thankfully, there’s an easy solution, which is normalizing the string into the “canonical form”.<p>Cool, problem solved!<p>> There are four standard normalization forms:<p>(╯°□°)╯︵ ┻━┻
> <i>Why use both [UTF-8 and UTF-16]? Western languages typically are most efficiently encoded with UTF-8 (since most characters would be represented with 1 byte only), while Asian languages can usually produce smaller files when using UTF-16 as encoding.</i><p>The second sentence is technically correct, but it's a strange followup here because it's not <i>why</i> UTF-8 and UTF-16 exist today. I don't know any Asian webpages that use UTF-16 to save bandwidth, e.g., Japanese Wikipedia is still UTF-8.<p>The major use of UTF-16 in 2019, AFAICT, is for legacy operating system interfaces.
Note that Apple's APFS doesn't normalize Unicode filenames:<p><a href="https://news.ycombinator.com/item?id=13953800" rel="nofollow">https://news.ycombinator.com/item?id=13953800</a><p>From what I understand, it stores them as-is but can read any (so is normalization insensitive):<p><a href="https://medium.com/@yorkxin/apfs-docker-unicode-6e9893c9385d" rel="nofollow">https://medium.com/@yorkxin/apfs-docker-unicode-6e9893c9385d</a><p><a href="https://developer.apple.com/library/archive/documentation/FileManagement/Conceptual/APFS_Guide/FAQ/FAQ.html" rel="nofollow">https://developer.apple.com/library/archive/documentation/Fi...</a><p>This hit me a couple of years ago when I was working on a scraper and storing the title of the page as the filename. It looked fine, but would fail a Javascript string comparison. I can't remember if I was using HFS+ though, which I believe saved filenames as NFD:<p><a href="https://en.wikipedia.org/wiki/HFS_Plus#Criticisms" rel="nofollow">https://en.wikipedia.org/wiki/HFS_Plus#Criticisms</a><p>The same script might work today on APFS.
By the way the last letter of Zoë is e with a diaresis, not An umlaut. Like the second o in coöperate — it’s just an ordinary o with a marker to tell you to pronounce it rather than form a diphthong.
Just tried this in Perl6; looks like string comparisons Do The Right Thing™.<p><pre><code> > "\x65\x301".contains("\xe9")
True</code></pre>
I still don't understand why Unicode allows two different ways to represent the same thing.<p>Naively, that appears as a major defect in Unicode.<p>Perhaps someone reading this knows why this was the right thing to do?
...in web apps (i.e. during presentation). Don’t do it at the storage layer:<p><a href="https://github.com/git/git/commit/76759c7dff53e8c84e975b88cb8245587c14c7ba" rel="nofollow">https://github.com/git/git/commit/76759c7dff53e8c84e975b88cb...</a><p>Edit: see comments below. My generalization is over broad. Maybe a fairer statement is that some forms of normalization lead to aliasing and sometimes you want that but sometimes not. So be aware of whether you want different strings to be treated the “same” or not.<p>My thought was that you can always test for sameness after the fact, but once you’ve normalized into storage, you can’t undo it.
“The first of such conventions, or character encodings, was ASCII (American Standard Code for Information Interchange).”
The author may know better and is glossing over history, but when I see statements like this that are obviously incorrect, I question everything else in the article.