The ü/ü Conundrum

179 点作者 firstSpeaker大约 1 年前

26 条评论

re大约 1 年前

> Can you spot any difference between “blöb” and “blöb”?It's tricky to try to determine this because normalization can end up getting applied unexpectedly (for instance, on Mac, Firefox appears to normalize copied text as NFC while Chrome does not), but by downloading the page with cURL and checking the raw bytes I can confirm that there is no difference between those two words :) Something in the author's editing or publishing pipeline is applying normalization and not giving her the end result that she was going for.<pre><code> 00009000: 0a3c 7020 6964 3d22 3066 3939 223e 4361 .Ca 00009010: 6e20 796f 7520 7370 6f74 2061 6e79 2064 n you spot any d 00009020: 6966 6665 7265 6e63 6520 6265 7477 6565 ifference betwee 00009030: 6e20 e280 9c62 6cc3 b662 e280 9d20 616e n ...bl..b... an 00009040: 6420 e280 9c62 6cc3 b662 e280 9d3f 3c2f d ...bl..b...?</ </code></pre> Let's see if I can get HN to preserve the different forms:Composed: ü Decomposed: üEdit: Looks like that worked!

评论 #39810078 未加载

评论 #39809736 未加载

mglz大约 1 年前

My last name contains an ü and it has been consistenly horrible.* When I try to preemptively replace ü with ue many institutions and companies refuse to accept it because it does not match my passport* Especially in France, clerks try to emulate ü with the diacritics used for the trema e, ë. This makes it virtually impossible to find me in a system again* Sometimes I can enter my name as-is and there seems to be no problem, only for some other system to mangle it to � or or a box. This often triggers error downstream I have no way of fixing* Sometimes, people print a u and add the diacritics by hand on the label. This is nice, but still somehow wrong.I wonder what the solution is. Give up and ask people to consistenly use a ascii-only name? Allow everybody 1000+ unicode characters as a name and go off that string? Officially change my name?

评论 #39810781 未加载

评论 #39811059 未加载

评论 #39811033 未加载

评论 #39810904 未加载

评论 #39811472 未加载

评论 #39811125 未加载

评论 #39817692 未加载

评论 #39811088 未加载

评论 #39818402 未加载

评论 #39818844 未加载

评论 #39814094 未加载

评论 #39810772 未加载

weinzierl大约 1 年前

This article is about a failure to do normalization properly and is not really about an issue with Unicode. Regardless what some comments seem to allude to, an Umlaut-ü should always render exactly the same, no matter how it is encoded.There is, however, a real ü/ü conundrum, regarding ü-Umlaut and ü-diaeresis. The ü's in the words Müll and aigüe should render differently. The dots in the French word are too close to the letter. In printed French material this is usually not the case.Unfortunately Unicode does not capture the nuance of the semantic difference between an Umlaut and a Tréma or Diaresis.The Umlaut is a letter in its own right with its own space in the alphabet. An ü-Umlaut can never be replaced by an u alone. This would be just as wrong as replacing a p by a q. Just because they look similar does not mean they are interchangeable. [1]The Tréma on the other hand, is a modifier that helps with proper pronunciation of letter combinations. It is not a letter in its own right, just additional information. It can sometimes move over other adjacent letters (aiguë=aigüe, both are possible) too.Some say this should be handled by the rendering system similar to Han-Unification, but I strongly disagree with this. French words are often used in German and vice versa. Currently there is no way to render a German loan word with Umlaut (e.g. führer) properly in French.[1] The only acceptable replacement for ü-Umlaut is the combination ue.

noodlesUK大约 1 年前

One thing that is very unintuitive with normalization is that MacOS is much more aggressive with normalizing Unicode than Windows or Linux distros. Even if you copy and paste non-normalized text into a text box in safari on Mac, it will be normalized before it gets posted to the server. This leads to strange issues with string matching.

评论 #39810080 未加载

评论 #39809955 未加载

评论 #39810415 未加载

评论 #39811334 未加载

评论 #39810511 未加载

评论 #39809964 未加载

jesprenj大约 1 年前

Should you really change filenames of users' files and depend on the fact that they are valid utf8? Wouldn't it be better to keep the original filename and use that most of the time sans the searches and indexing?Why don't you normalize latin alphabets filenames for indexing even further -- allow searching for "Führer" with queries like "Fuehrer" and "Fuhrer"?

评论 #39809962 未加载

josephcsible大约 1 年前

IMO, it was a mistake for Unicode to provide multiple ways to represent 100% identical-looking characters. After all, ASCII doesn't have separate "c"s for "hard c" and "soft c".

评论 #39811352 未加载

评论 #39810740 未加载

评论 #39810734 未加载

评论 #39810968 未加载

评论 #39811056 未加载

layer8大约 1 年前

The more general solution is specified here: <a href="https://unicode.org/reports/tr10/#Searching" rel="nofollow">https://unicode.org/reports/tr10/#Searching</a>

评论 #39811071 未加载

blablabla123大约 1 年前

As a German macOS user with US keyboard I run into a related issue every now and then. What's nice about macOS is I can easily combine Umlaute but also other common letters from European languages without any extra configuration. But some (Web) Applications stumble upon it, while entering because it's like: 1. ¨ (Option-u) 2. ü (u pressed)

评论 #39811191 未加载

chuckadams大约 1 年前

Clearly the author already knows this, but it highlights the importance of always normalizing your input, and consistently using the same form instead of relying on the OS defaults.

评论 #39810718 未加载

评论 #39810417 未加载

userbinator大约 1 年前

its[sic] 2024, and we are still grappling with Unicode character encoding problemsMore like "because it's 2024." This wouldn't be a problem before the complexity of Unicode became prevalent.

评论 #39810408 未加载

评论 #39810456 未加载

评论 #39811559 未加载

评论 #39811083 未加载

_nalply大约 1 年前

Sometimes it makes sense to reduce to Unicode confusables.For example the Greek letter Big Alpha looks like uppercase A. Or some characters look very similar like the slash and the fraction slash. Yes, Unicode has separate scalar values for them.There are Open Source tools to handle confusables.This is in addition to the search specified by Unicode.

评论 #39809632 未加载

评论 #39809783 未加载

Havoc大约 1 年前

For those intrigued by this sort of thing check tech talk “plain text” by Dylan BeattieAbsolute gem. His other talks are entertaining too

评论 #39811190 未加载

mawise大约 1 年前

I ran into this building search for a family tree project. I found out that Rails provides `ActiveSupport::Inflector.transliterate()` which I could use for normalization.

anewhnaccount2大约 1 年前

Reminded of this classic diveintomark post <a href="http://web.archive.org/web/20080209154953/http://diveintomark.org/archives/2004/07/06/nfc" rel="nofollow">http://web.archive.org/web/20080209154953/http://diveintomar...</a>

CoastalCoder大约 1 年前

Isn't ü/ü-encoding a solved problem on Unix systems?</joke>

philkrylov大约 1 年前

The article suggests using NFC normalization as a simple solution, but fails to mention that HFS+ always does NFD normalization to file names, and APFS kinda does not but some layer above it actually does (<a href="https://eclecticlight.co/2021/05/08/explainer-unicode-normalization-and-apfs/" rel="nofollow">https://eclecticlight.co/2021/05/08/explainer-unicode-normal...</a>), and ZFS has this behavior controlled by a dataset-level option. I don't see how applying its suggestion literally (just normalize to NFC before saving) can work.

jph大约 1 年前

Normalizing can help with search. For example for Ruby I maintain this gem: <a href="https://rubygems.org/gems/sixarm_ruby_unaccent" rel="nofollow">https://rubygems.org/gems/sixarm_ruby_unaccent</a>

评论 #39810064 未加载

kazinator大约 1 年前

Oh that Mötley Ünicöde.

评论 #39809677 未加载

评论 #39809704 未加载

raffy大约 1 年前

I created a bunch of Unicode tools during development of ENSIP-15 for ENS (Ethereum Name Service)ENSIP-15 Specification: <a href="https://docs.ens.domains/ensip/15" rel="nofollow">https://docs.ens.domains/ensip/15</a>ENS Normalization Tool: <a href="https://adraffy.github.io/ens-normalize.js/test/resolver.html" rel="nofollow">https://adraffy.github.io/ens-normalize.js/test/resolver.htm...</a>Browser Tests: <a href="https://adraffy.github.io/ens-normalize.js/test/report-nf.html" rel="nofollow">https://adraffy.github.io/ens-normalize.js/test/report-nf.ht...</a>0-dependancy JS Unicode 15.1 NFC/NFD Implementation [10KB] <a href="https://github.com/adraffy/ens-normalize.js/blob/main/dist/nf.min.js">https://github.com/adraffy/ens-normalize.js/blob/main/dist/n...</a>Unicode Character Browser: <a href="https://adraffy.github.io/ens-normalize.js/test/chars.html" rel="nofollow">https://adraffy.github.io/ens-normalize.js/test/chars.html</a>Unicode Emoji Browser: <a href="https://adraffy.github.io/ens-normalize.js/test/emoji.html" rel="nofollow">https://adraffy.github.io/ens-normalize.js/test/emoji.html</a>Unicode Confusables: <a href="https://adraffy.github.io/ens-normalize.js/test/confused.html" rel="nofollow">https://adraffy.github.io/ens-normalize.js/test/confused.htm...</a>

WalterBright大约 1 年前

> Can you spot any difference between “blöb” and “blöb”?That's where Unicode lost its way and went into a ditch. Identical glyphs should always have the same code point (or sequence of code points).Imagine all the coding time spent trying to deal with this nonsense.

评论 #39814028 未加载

ulrischa大约 1 年前

It is really so awful that we have to deal with encoding issues in 2024.

ComputerGuru大约 1 年前

ZFS can be configured to force the use of a particular normalized Unicode form for all filenames. Amazing filesystem.

NotYourLawyer大约 1 年前

ASCII should be enough for anyone.

评论 #39810145 未加载

评论 #39810180 未加载

评论 #39814047 未加载

earthboundkid大约 1 年前

This isn’t an encoding problem. It’s a search problem.

juujian大约 1 年前

I ran into encoding problems so many times, I just use ASCII aggressively now. There is still kanji, Hanzi, etc. but at least for Western alphabets, not worth the hassle.

评论 #39810113 未加载

评论 #39811111 未加载

评论 #39810015 未加载

keybored大约 1 年前

I try to avoid Unicode in filenames (I’m on Linux). It seems that a lot of normal users might have the same intuition as well? I get the sense that a lot will instinctually transcode to ASCII, like they do for URLs.

评论 #39810130 未加载

评论 #39845484 未加载