TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Why to normalize Unicode strings

143 pointsby bibyteabout 6 years ago

11 comments

jrochkind1about 6 years ago
The official unicode documentation on normalization is good reading, and quite readable. It&#x27;s actually an even more complicated topic than OP reveals, but the Unicode Standard Annex #15 explains it well.<p><a href="http:&#x2F;&#x2F;unicode.org&#x2F;reports&#x2F;tr15&#x2F;" rel="nofollow">http:&#x2F;&#x2F;unicode.org&#x2F;reports&#x2F;tr15&#x2F;</a><p>OP has a significant error:<p>&gt; You can choose whatever form you’d like, as long as you’re consistent, so the same input always leads to the same result.<p>Not so much! Do _not_ use the &quot;Compatibility&quot; (rather than &quot;Canonical&quot;) normalization forms unless you know what you are doing! UAX15 will explain why, but they are &quot;lossy&quot;. In general, NFC is the one to use as a default.
评论 #19381648 未加载
评论 #19381311 未加载
doodpantsabout 6 years ago
&gt; Thankfully, there’s an easy solution, which is normalizing the string into the “canonical form”.<p>Cool, problem solved!<p>&gt; There are four standard normalization forms:<p>(╯°□°)╯︵ ┻━┻
评论 #19382147 未加载
评论 #19384018 未加载
kenabout 6 years ago
&gt; <i>Why use both [UTF-8 and UTF-16]? Western languages typically are most efficiently encoded with UTF-8 (since most characters would be represented with 1 byte only), while Asian languages can usually produce smaller files when using UTF-16 as encoding.</i><p>The second sentence is technically correct, but it&#x27;s a strange followup here because it&#x27;s not <i>why</i> UTF-8 and UTF-16 exist today. I don&#x27;t know any Asian webpages that use UTF-16 to save bandwidth, e.g., Japanese Wikipedia is still UTF-8.<p>The major use of UTF-16 in 2019, AFAICT, is for legacy operating system interfaces.
评论 #19384483 未加载
评论 #19384466 未加载
评论 #19384937 未加载
zackmorrisabout 6 years ago
Note that Apple&#x27;s APFS doesn&#x27;t normalize Unicode filenames:<p><a href="https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=13953800" rel="nofollow">https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=13953800</a><p>From what I understand, it stores them as-is but can read any (so is normalization insensitive):<p><a href="https:&#x2F;&#x2F;medium.com&#x2F;@yorkxin&#x2F;apfs-docker-unicode-6e9893c9385d" rel="nofollow">https:&#x2F;&#x2F;medium.com&#x2F;@yorkxin&#x2F;apfs-docker-unicode-6e9893c9385d</a><p><a href="https:&#x2F;&#x2F;developer.apple.com&#x2F;library&#x2F;archive&#x2F;documentation&#x2F;FileManagement&#x2F;Conceptual&#x2F;APFS_Guide&#x2F;FAQ&#x2F;FAQ.html" rel="nofollow">https:&#x2F;&#x2F;developer.apple.com&#x2F;library&#x2F;archive&#x2F;documentation&#x2F;Fi...</a><p>This hit me a couple of years ago when I was working on a scraper and storing the title of the page as the filename. It looked fine, but would fail a Javascript string comparison. I can&#x27;t remember if I was using HFS+ though, which I believe saved filenames as NFD:<p><a href="https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;HFS_Plus#Criticisms" rel="nofollow">https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;HFS_Plus#Criticisms</a><p>The same script might work today on APFS.
评论 #19385072 未加载
gumbyabout 6 years ago
By the way the last letter of Zoë is e with a diaresis, not An umlaut. Like the second o in coöperate — it’s just an ordinary o with a marker to tell you to pronounce it rather than form a diphthong.
athenotabout 6 years ago
Just tried this in Perl6; looks like string comparisons Do The Right Thing™.<p><pre><code> &gt; &quot;\x65\x301&quot;.contains(&quot;\xe9&quot;) True</code></pre>
评论 #19380705 未加载
评论 #19381518 未加载
评论 #19389774 未加载
评论 #19382926 未加载
gwbas1cabout 6 years ago
I still don&#x27;t understand why Unicode allows two different ways to represent the same thing.<p>Naively, that appears as a major defect in Unicode.<p>Perhaps someone reading this knows why this was the right thing to do?
评论 #19384363 未加载
评论 #19385678 未加载
js2about 6 years ago
...in web apps (i.e. during presentation). Don’t do it at the storage layer:<p><a href="https:&#x2F;&#x2F;github.com&#x2F;git&#x2F;git&#x2F;commit&#x2F;76759c7dff53e8c84e975b88cb8245587c14c7ba" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;git&#x2F;git&#x2F;commit&#x2F;76759c7dff53e8c84e975b88cb...</a><p>Edit: see comments below. My generalization is over broad. Maybe a fairer statement is that some forms of normalization lead to aliasing and sometimes you want that but sometimes not. So be aware of whether you want different strings to be treated the “same” or not.<p>My thought was that you can always test for sameness after the fact, but once you’ve normalized into storage, you can’t undo it.
评论 #19380875 未加载
评论 #19380844 未加载
s1monabout 6 years ago
“The first of such conventions, or character encodings, was ASCII (American Standard Code for Information Interchange).” The author may know better and is glossing over history, but when I see statements like this that are obviously incorrect, I question everything else in the article.
评论 #19381342 未加载
WalterBrightabout 6 years ago
There shouldn&#x27;t even be any such thing as normalized strings, i.e. two different Unicode sequences that are supposed to be the same character.
评论 #19382424 未加载
评论 #19383077 未加载
misesabout 6 years ago
What is with the push to unicode? Why not ascii? It seems to give a lot less trouble, particularly wrt panagrams, normalization, etc.
评论 #19380529 未加载
评论 #19380470 未加载
评论 #19380832 未加载
评论 #19380583 未加载
评论 #19381064 未加载
评论 #19385621 未加载