The Python problem is amusing. Python 3 has three representations of strings internally (1-byte, 2-byte, and 4-byte) and promotes them to a wider form when necessary. This is mostly to support string indexing. It probably would have been better to use UTF-8, and create an index array for the string when necessary.<p>You rarely need to index a string with an integer in Python. FOR loops don't need to. Regular expressions don't need to. Operations that return a position into the string could return an opaque type which acts as a string index. That type should support adding and subtracting integers (at least +1 and -1) by progressing through the string. That would take care of most of the use cases. Attempts to index a string with an int would generate index arrays internally. (Or, for short strings, just start at the beginning every time and count.)<p>Windows and Java have big problems. They really are 16-bit char based. It's not Java's fault; they standardized when Unicode was 16 bits.
It's interesting how history seems to have repeated itself with UTF-16. With ASCII and its extensions, we had 128 "normal" characters and everything else was exotic text that caused problems.<p>Now with UTF-16, the "normal" characters are the ones in the basic multilingual plane that fit in a single UTF-16 code point.
This article was from 4 years ago. Since then, utf 8 adoption has increased from 68% to 87% of the top 10 million websites on Alexa:<p><a href="https://w3techs.com/technologies/history_overview/character_encoding/ms/y" rel="nofollow">https://w3techs.com/technologies/history_overview/character_...</a>
When you create a table in MySQL, a text attribute (VARCHAR etc.) is not encoded in UTF8 by default.<p>I think UTF8 should be the default and only format for storing text attributes in all databases and all other text encodings should be removed from database systems.
I have the feeling that back in 1990, ISO 10646 wanted 32-bit characters but had no software folks on that committee, while the Unicode people was basically software folks but thought that 16-bit was enough (this dates back to the original Unicode proposal from 1988). UTF-8 was only created in 1992, after the software folks rejected the original DIS 10646 in mid-1991.
May be off-topic, I wonder if anyone is planning a redesign of Unicode for the far future? or is there a better way to handle characters, so we don't require a giant library like ICU?
> In both UTF-8 and UTF-16 encodings, code points may take up to 4 bytes.<p>Wrong: up to 4 bytes UTF16, and up to 6 bytes UTF8.<p>> Cyrillic, Hebrew and several other popular Unicode blocks are 2 bytes both in UTF-16 and UTF-8.<p>Cyrillic, Hebrew and several other languages still have spaces and punctuation, that take a single byte in UTF8. Now it’s 2016, RAM and storage are cheap and declining, but CPU branch misprediction cost is same 20 cycles and not going to decline.<p>> plain Windows edit control (until Vista)<p>Windows XP is 14 years old, and now in 2016 it’s market share is less then 3%. Who cares what was before Vista?<p>> In C++, there is no way to return Unicode from std::exception::what() other than using UTF-8.<p>The exception that are part of STL don’t return Unicode at all, they are in English.<p>If you throw your custom exceptions, return non-English messages in exception::what() in utf-8, catch std::exception and call what() — you’ll get English error messages for STL-thrown exceptions, and non-English error messages for your custom exceptions.<p>I’m not sure mixing GUI languages in a single app is always a right thing.<p>> First, the application must be compiled as Unicode-aware<p>The oldest visual studio I have installed is 2008 (because I sometimes develop for WinCE). I’ve just created a new C++ console application project, and by default it already Unicode-aware.<p>So, for anyone using Microsoft IDE, this requirement is not a problem.
After considering this problem in long detail in the past, I too favoured utf8 at the time.<p>I remember a project (circa 1999) I worked on which was a feature phone HTML 3.4 browser and email client (one of the first). The browser/ip stack handled only ascii/code page characters to begin with. To my surprise it was decided to encode text on the platform using utf-16. Thus the entire code base was converted to use 16 bit code points (UCS-2). On a resource constrained platform (~300k ram IFIRC), better, I think, would have been update the renderer and email client to understand utf8.<p>Nice as it might be to have the idea that utf16, or utf32 were a "character" it is as has been pointed out not the case, and when you look into language you can see how it never can be that simple.
I quite like Swift's approach -Characters, where a character can be "An extended grapheme cluster ... a sequence of one or more Unicode scalars that (when combined) produce a single human-readable character.". This seems, in practice to mean things like multibyte entries, modified entries, end up as a single entry.<p>As the trade-off, directly indexing into strings is... Either not possible or discouraged, and often relies on an opaque(?) indexing class.<p>The main weirdness I have encountered so far is that the Regex functions operate only on the old, objective-c method of indexing, so a little swizzling is required to handle things properly.
Offtopic, but does anyone know of a way to ensure I don't introduce non-ASCII filenames, to ensure broad portability across systems? I've had to resort to disabling UTF-8 on Linux to achieve that.
This militancy to <i>force</i> everyone to use UTF-8 is bad engineering. I'm thinking of GNOME 3, where you aren't even allowed the option of choosing ASCII as a default setting, only UTF-8 or ISO-8859-x. A default setting is just as important for what it filters out as for what it passes through. I use a lot of older tools on *nix that are ASCII-only, in tool chains that slurp and munge text. If the chain includes any of these UTF-8-only apps, I'm constantly dealing with the problem of invalid ASCII passing through.
Some days, I imagine a parallel universe, where the ancient Chinese had called ideograms a bad idea, and went on to develop a proper alphabet. Unicode would be pretty much unnecessary.