Broad backwards compatibility with ASCII is a strong reason to prefer UTF-8 in most applications, however I find the issues with UTF-16 are overstated and the advantages of it ignored.<p>"UTF-16 has no practical advantages over UTF-8" - UTF-16 is better - i.e., quite a bit more compact, since, e.g., Chinese characters always take 3 characters in UTF-8, for asian languages.<p>"A related drawback is the loss of self-synchronization at the byte level." - Maybe this is a problem, maybe not. Maybe the failure of UTF-8 to be self-synchronising at the 4-bit level is a problem is some circumstances. I don't mean to be flippant, but the wider point is that with UTF-16, you really need to commit to 16-bit char width.<p>"The encoding is inflexible" - I think the author has confused the fixed-width UCS-2 and the variable-width UTF-16.<p>"We lose backwards compatibility with code treating NUL as a string terminator." - Not true. NUL is a 16-bit character in UTF-16. Use a C compiler that supports 16-bit char width.
> The Go designers knew what they were doing, so they chose UTF-8.<p>The authors of Go and authors of UTF-8 are more or less the same people, so the choice was no-brainer.
I keep hoping a string API will catch on in which combining marks are mostly treated as indivisible. Handling text one codepoint at a time is as bad an idea as handling US-ASCII one bit at a time--almost everything it lets you do is an elaborate way to misinterpret or corrupt your data.
The article, at the end, claims that "ASCII was developed from telegraph codes by a committee." It turns out the story is much, much more complicated and interesting than that: <a href="http://www.wps.com/projects/codes/" rel="nofollow">http://www.wps.com/projects/codes/</a>
UCS2, even though being an outdated predecessor to UTF-16, has some unique qualities that make it useful for things like databases or other storage mediums that you are not mixing with a lot of low code point characters (like you do with XML and HTML markups).<p>One being that's fair for all languages with respect to size so when you may be storing your standard Chinese, Korean, Japanese characters.<p>When UTF-16 made UC2 variable length, a few of the nice things were lost, but when dealing a lot of the higher code point characters mostly, UTF-16 may save you space.
may i dare make a conclusion with my observations? :)<p>utf-8 is good for network interchange and is de facto becoming standard.<p>utf-16 is not bad for internal storage of strings in memory or in database. not nesserily bad. maybe even better for some reasons
Can someone link me to where python uses UTF-16? It was my understanding it defaulted to UTF-8.<p><a href="http://www.python.org/dev/peps/pep-3120/" rel="nofollow">http://www.python.org/dev/peps/pep-3120/</a>
Correct me if I'm wrong, but I think Excel still outputs UTF-16 in some cases. I remember parsing generated .txt/.csv files and there were issues with it and it's endian order.