Unicode is harder than you think

126 点作者 mcilloni将近 2 年前

20 条评论

Pannoniae将近 2 年前

Most programs claim to support Unicode but they actually don't. They either miscount string lengths (you type a CJK character or an emoji in, string appears shorter than what the program thinks), separate them improperly or many other things. It doesn't help that by default, most programming languages also handle unicode poorly, with the default APIs producing wrong results.I'd take "we don't do unicode at all" or "we only support BMP" or "we don't support composite characters" any day over pretend-support (but then inevitably breaking when the program wasn't tested with anything non-ASCII)(ninjaedit: to see how prevalent it is, even gigantic message apps such as discord make this mistake. There are users on discord who you can't add as friends because the friend input field is limited to 32.... something - probably bytes, yet elsewhere the program allows the name to be taken. This is easy to do with combining characters)

评论 #36871233 未加载

评论 #36869859 未加载

评论 #36872205 未加载

spudlyo将近 2 年前

If you found this essay interesting, you owe it to yourself to check out this super entertaining talk "Plain Text"[0] from NDC 2022 by Dylan Beattie. Rabbit hole warning: This video caused me to lose an entire Sunday watching Dylan's talks on YouTube, which are uniformly awesome.[0]: <a href="https://www.youtube.com/watch?v=gd5uJ7Nlvvo">https://www.youtube.com/watch?v=gd5uJ7Nlvvo</a>

评论 #36868619 未加载

评论 #36869876 未加载

skitter将近 2 年前

Annoyingly, Java, JavaScript, Windows file paths and more don't quite use UTF-16 (well, even if they did, that would be annoying) — they allow unpaired surrogates, which don't represent any Unicode character. So if you want to represent e.g. an arbitrary Windows file path in UTF-8, you can't; you have to use WTF-8 (wobbly transformation format) instead.

评论 #36870386 未加载

评论 #36870581 未加载

评论 #36871392 未加载

评论 #36870391 未加载

dmitrygr将近 2 年前

Favourite unicode fact: properly rendering unicode requires understanding of the current geopolitical situation (Depending on whom you accept as a country and whom you do not, two country-code-letters may or may not render as a flag. This changes sometimes in today's world.). <a href="https://esham.io/2014/06/unicode-flags" rel="nofollow noreferrer">https://esham.io/2014/06/unicode-flags</a>

评论 #36868676 未加载

评论 #36875342 未加载

评论 #36868690 未加载

jmclnx将近 2 年前

No kidding, you have not lived until you try and explain UTF-8 to people who only believes in what they called "doublebyte".You think they get it, but surprise happens when a database load fails when loading Chinese Character "string" into a field sized calculated based upon 2 bytes per character.

评论 #36868672 未加载

评论 #36868687 未加载

o1y32将近 2 年前

Regarding the title -- anecdotally, everyone I know is sacred of encoding issues, and I don't know anyone who claims they have a great understanding of Unicode or think it is easy (including myself). It is often overlooked for sure -- people don't realize there is a problem in the code until they run into a bug, ane it turns out they are treating strings wrong from the very beginning.

评论 #36870563 未加载

jkaptur将近 2 年前

The logical next step here is to realize that if you want to be truly internationalized, pretty much every single method of the string class in your favorite language is an antipattern and should be used with extreme caution. Seriously!

评论 #36869679 未加载

nightpool将近 2 年前

Thank you for being the first article I've ever actually read to explain the difference between NFC, NFD, NFKD and NFKC in a way that I actually understood. I was a little bored through the whole UCS/UTF* history lesson because I knew a lot of it already, but the normalization and collation examples were definitely worth it

评论 #36870229 未加载

NovemberWhiskey将近 2 年前

I used to work on a platform at a large financial services firm; it was essentially complete ignorant of anything Unicode with respect to string handling, strings were null-terminated byte streams. The platform had CSV import capability for tabular data, and it had an integrated pivot table capability based on some widgets that had been grafted onto it.Some of the users in Hong Kong discovered that you could import CSVs with Unicode text (e.g. index compositions with Chinese company names) and they'd display in the pivot table widgets and even be exportable to reports. But only most names. Some names were truncated or turned into garbage, and I was called upon to help debug this.My first reaction was frank amazement that this "worked" at all: apparently, the path from the dumb CSV import code through to the Unicode-aware pivot table was sufficiently clean that much of the encoded text made it through OK. I can't remember the precise details now but I think the problem turned out to be embedded nulls from UTF-16 encoding and so was completely insoluble without a gut renovation of the platform.

评论 #36871650 未加载

Amorymeltzer将近 2 年前

I'll always take an excuse to link to one of my favorite StackOverflow answers, to the question "Why does modern Perl avoid UTF-8 by default?": <<a href="https://stackoverflow.com/a/6163129/2521092" rel="nofollow noreferrer">https://stackoverflow.com/a/6163129/2521092</a>>. It's from 2011 and Perl-centric, of course, but skip down to "𝔸 𝕤 𝕤 𝕦 𝕞 𝕖 𝔹 𝕣 𝕠 𝕜 𝕖 𝕟 𝕟 𝕖 𝕤 𝕤" for a thorough, if opinionated, list.My favorite, similar to the issue of case-folding, is the idea of something being a "letter" and "lowercase" but NOT "lowercase letter."

评论 #36875294 未加载

ggm将近 2 年前

if you see URLs which project the microsoft editors choice of "single quote mark" into a triad of %xx; values, you've seen how people wind up in unexpected unicode/utf land, because of the text editor swapping ASCII for an uplift into either distorted iso-latin1, or unicode/utf8 depending.

评论 #36871579 未加载

eviks将近 2 年前

Pity those bad design decisions like surrogate pair pollute the future for generations instead of being cleanly cut off with a painful, but temporary transition period

zamadatix将近 2 年前

TIL of UTF-1, what an odd specification.

评论 #36871450 未加载

qingcharles将近 2 年前

What is most interesting to me is that some countries just give up completely on their native script:<a href="https://en.wikipedia.org/wiki/Spread_of_the_Latin_script#21st_century" rel="nofollow noreferrer">https://en.wikipedia.org/wiki/Spread_of_the_Latin_script#21s...</a>Any Indians here? I heard that most Hindi texting is done with a weird Latin character slang these days?

_nalply将近 2 年前

Another important subject are confusables. "A" looks same for Latin A, Greek Alpha and Cyrillic A. If you need users to identify names by sight, you need to do something about confusables. It is even a security problem, for example for hostnames, imagine a hacker build your bank's website with a similar looking URL.<a href="https://news.ycombinator.com/item?id=32497414">https://news.ycombinator.com/item?id=32497414</a>

solresol将近 2 年前

The article didn't mention UTF-9 or UTF-18 for when you want to do unicode on your PDP-6 or Univac. <a href="https://datatracker.ietf.org/doc/html/rfc4042" rel="nofollow noreferrer">https://datatracker.ietf.org/doc/html/rfc4042</a>A reviewer for a scientific paper I wrote asked if I had considered the impact of UTF-9 on what I was doing. To this day I don't know if they were trolling me.

butlerm将近 2 年前

With the benefit of hindsight, if the standard was done over again either a lot of unnecessary difficulties mentioned in this article should be avoided or the standard split in two. It is arguably too ambitious, too inefficient, and too unrealistic for a number of these things to be handled as recommended in all contexts. There are many examples - in operating system kernels and system programming in general, to name two.To start with, alternative representations for the same visibly identical character should have been excluded. In the base standard all supported characters should be precomposed - no modifiers. The article points out the difficulties in something as simple (and as security critical) as determining whether two strings refer to the same visibly identical characters.A character set that does not make that trivial is not suitable for general use in system programming or for security critical identifiers (unfortunately). It massively complicates programming in many programming languages as well, and even UTF-32 is not sufficient to remedy the problem, as the article well notes.The inability to handle and process any arbitrary string of bytes in a universal character set is a serious problem as well. The world is complicated, and the inability to pass through incorrectly coded data, alternatively coded data, and general binary data is a major limitation with serious consequences. System code such as device drivers or filesystems typically can't deal with inefficiencies or limitations like that.In addition, there probably should be two types of uppercase and lowercase conversions, a simple, predictable one for system programming, and a more complex one that deals with considerations in languages that do not follow the normal rules.String collation should be done at two levels as well, a simple code point level suitable for such things as prefix matches on database indexes, and a more complex variation for applications where linguistically sensitive collation is critical.In general trying to solve all the higher end use cases in a large body of software that do not need to deal with them, should not need to deal with them, or cannot deal with them is impractical and has exacerbated the common string processing issues we see today. A lightweight standard - perhaps a subset or profile of Unicode that supported arbitrary binary data as opaque codepoints - could be helpful in a lot of contexts.

Roark66将近 2 年前

Indeed it is. One use of Unicode I do is for icons that can be used by console programs like (neo)vim. I was quite happy that xterm supports Unicode these days so I can use a fast terminal that supports OSC52 system clipboard integration(none of the newer gnome/KDE terminals do).I was rather disappointed when I noticed my pretty Unicode icons would sometimes end up cut in half :-(

alcover将近 2 年前

Currently working on a language, I feel dizzy after reading this.My stdlib will provide a (byte) Buffer class with basic low-level methods but I feel like iterating through it in fancy ways should be the concern of the user or 3rd-party libraries.I fail to see this as part of a programming language.Am I wrong here ?

评论 #36870125 未加载

评论 #36871144 未加载

评论 #36870913 未加载

评论 #36870657 未加载

night-rider将近 2 年前

It's not that it's hard, it's just people don't go out of their way to escape UTF-8 glyphs into ASCII when dealing with exotic glyphs in a text editor. It's more mundane and tedious, but not 'hard'.Try working with raw UTF-8 in JS and find yourself in a world of pain. Mathias Bynens talks about these gotchas here:<a href="https://mathiasbynens.be/notes/javascript-unicode" rel="nofollow noreferrer">https://mathiasbynens.be/notes/javascript-unicode</a>

评论 #36868785 未加载