UTF-8 Everywhere (2012)

122 pointsby thefoxalmost 9 years ago

15 comments

Animatsalmost 9 years ago

The Python problem is amusing. Python 3 has three representations of strings internally (1-byte, 2-byte, and 4-byte) and promotes them to a wider form when necessary. This is mostly to support string indexing. It probably would have been better to use UTF-8, and create an index array for the string when necessary.You rarely need to index a string with an integer in Python. FOR loops don't need to. Regular expressions don't need to. Operations that return a position into the string could return an opaque type which acts as a string index. That type should support adding and subtracting integers (at least +1 and -1) by progressing through the string. That would take care of most of the use cases. Attempts to index a string with an int would generate index arrays internally. (Or, for short strings, just start at the beginning every time and count.)Windows and Java have big problems. They really are 16-bit char based. It's not Java's fault; they standardized when Unicode was 16 bits.

评论 #11934750 未加载

评论 #11934698 未加载

评论 #11934879 未加载

评论 #11934709 未加载

评论 #11935078 未加载

评论 #11935203 未加载

评论 #11934728 未加载

评论 #11936062 未加载

wcoenenalmost 9 years ago

It's interesting how history seems to have repeated itself with UTF-16. With ASCII and its extensions, we had 128 "normal" characters and everything else was exotic text that caused problems.Now with UTF-16, the "normal" characters are the ones in the basic multilingual plane that fit in a single UTF-16 code point.

评论 #11935936 未加载

chillacyalmost 9 years ago

This article was from 4 years ago. Since then, utf 8 adoption has increased from 68% to 87% of the top 10 million websites on Alexa:<a href="https://w3techs.com/technologies/history_overview/character_encoding/ms/y" rel="nofollow">https://w3techs.com/technologies/history_overview/character_...</a>

评论 #11935746 未加载

IvanK_netalmost 9 years ago

When you create a table in MySQL, a text attribute (VARCHAR etc.) is not encoded in UTF8 by default.I think UTF8 should be the default and only format for storing text attributes in all databases and all other text encodings should be removed from database systems.

评论 #11935534 未加载

yuhongalmost 9 years ago

I have the feeling that back in 1990, ISO 10646 wanted 32-bit characters but had no software folks on that committee, while the Unicode people was basically software folks but thought that 16-bit was enough (this dates back to the original Unicode proposal from 1988). UTF-8 was only created in 1992, after the software folks rejected the original DIS 10646 in mid-1991.

mangixalmost 9 years ago

this seems specific to Windows. UTF8 is already standard in Linux and the web for example. It's just Microsoft.

评论 #11934658 未加载

评论 #11934557 未加载

评论 #11934566 未加载

评论 #11934666 未加载

评论 #11934800 未加载

voaiealmost 9 years ago

May be off-topic, I wonder if anyone is planning a redesign of Unicode for the far future? or is there a better way to handle characters, so we don't require a giant library like ICU?

评论 #11934625 未加载

评论 #11934574 未加载

评论 #11934711 未加载

评论 #11934593 未加载

Const-mealmost 9 years ago

> In both UTF-8 and UTF-16 encodings, code points may take up to 4 bytes.Wrong: up to 4 bytes UTF16, and up to 6 bytes UTF8.> Cyrillic, Hebrew and several other popular Unicode blocks are 2 bytes both in UTF-16 and UTF-8.Cyrillic, Hebrew and several other languages still have spaces and punctuation, that take a single byte in UTF8. Now it’s 2016, RAM and storage are cheap and declining, but CPU branch misprediction cost is same 20 cycles and not going to decline.> plain Windows edit control (until Vista)Windows XP is 14 years old, and now in 2016 it’s market share is less then 3%. Who cares what was before Vista?> In C++, there is no way to return Unicode from std::exception::what() other than using UTF-8.The exception that are part of STL don’t return Unicode at all, they are in English.If you throw your custom exceptions, return non-English messages in exception::what() in utf-8, catch std::exception and call what() — you’ll get English error messages for STL-thrown exceptions, and non-English error messages for your custom exceptions.I’m not sure mixing GUI languages in a single app is always a right thing.> First, the application must be compiled as Unicode-awareThe oldest visual studio I have installed is 2008 (because I sometimes develop for WinCE). I’ve just created a new C++ console application project, and by default it already Unicode-aware.So, for anyone using Microsoft IDE, this requirement is not a problem.

评论 #11935726 未加载

hackuseralmost 9 years ago

Is there any application where UTF-8 isn't the best choice for long-term (i.e., 20-200 year) forward compatibility?

评论 #11935374 未加载

Murkalmost 9 years ago

After considering this problem in long detail in the past, I too favoured utf8 at the time.I remember a project (circa 1999) I worked on which was a feature phone HTML 3.4 browser and email client (one of the first). The browser/ip stack handled only ascii/code page characters to begin with. To my surprise it was decided to encode text on the platform using utf-16. Thus the entire code base was converted to use 16 bit code points (UCS-2). On a resource constrained platform (~300k ram IFIRC), better, I think, would have been update the renderer and email client to understand utf8.Nice as it might be to have the idea that utf16, or utf32 were a "character" it is as has been pointed out not the case, and when you look into language you can see how it never can be that simple.

misnomealmost 9 years ago

I quite like Swift's approach -Characters, where a character can be "An extended grapheme cluster ... a sequence of one or more Unicode scalars that (when combined) produce a single human-readable character.". This seems, in practice to mean things like multibyte entries, modified entries, end up as a single entry.As the trade-off, directly indexing into strings is... Either not possible or discouraged, and often relies on an opaque(?) indexing class.The main weirdness I have encountered so far is that the Regex functions operate only on the old, objective-c method of indexing, so a little swizzling is required to handle things properly.

cm3almost 9 years ago

Offtopic, but does anyone know of a way to ensure I don't introduce non-ASCII filenames, to ensure broad portability across systems? I've had to resort to disabling UTF-8 on Linux to achieve that.

评论 #11935347 未加载

wrpalmost 9 years ago

This militancy to force everyone to use UTF-8 is bad engineering. I'm thinking of GNOME 3, where you aren't even allowed the option of choosing ASCII as a default setting, only UTF-8 or ISO-8859-x. A default setting is just as important for what it filters out as for what it passes through. I use a lot of older tools on *nix that are ASCII-only, in tool chains that slurp and munge text. If the chain includes any of these UTF-8-only apps, I'm constantly dealing with the problem of invalid ASCII passing through.

codeulikealmost 9 years ago

With or without a BOM?

评论 #11934569 未加载

评论 #11934817 未加载

评论 #11934586 未加载

douchealmost 9 years ago

Some days, I imagine a parallel universe, where the ancient Chinese had called ideograms a bad idea, and went on to develop a proper alphabet. Unicode would be pretty much unnecessary.