TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

UTF-8 good, UTF-16 bad

117 pointsby mattybover 14 years ago

12 comments

chalstover 14 years ago
Broad backwards compatibility with ASCII is a strong reason to prefer UTF-8 in most applications, however I find the issues with UTF-16 are overstated and the advantages of it ignored.<p>"UTF-16 has no practical advantages over UTF-8" - UTF-16 is better - i.e., quite a bit more compact, since, e.g., Chinese characters always take 3 characters in UTF-8, for asian languages.<p>"A related drawback is the loss of self-synchronization at the byte level." - Maybe this is a problem, maybe not. Maybe the failure of UTF-8 to be self-synchronising at the 4-bit level is a problem is some circumstances. I don't mean to be flippant, but the wider point is that with UTF-16, you really need to commit to 16-bit char width.<p>"The encoding is inflexible" - I think the author has confused the fixed-width UCS-2 and the variable-width UTF-16.<p>"We lose backwards compatibility with code treating NUL as a string terminator." - Not true. NUL is a 16-bit character in UTF-16. Use a C compiler that supports 16-bit char width.
评论 #2192348 未加载
评论 #2192346 未加载
评论 #2192404 未加载
评论 #2192284 未加载
woosterover 14 years ago
Better discussion here, IMO: <a href="http://research.swtch.com/2010/03/utf-8-bits-bytes-and-benefits.html" rel="nofollow">http://research.swtch.com/2010/03/utf-8-bits-bytes-and-benef...</a>
adobriyanover 14 years ago
&#62; The Go designers knew what they were doing, so they chose UTF-8.<p>The authors of Go and authors of UTF-8 are more or less the same people, so the choice was no-brainer.
burgerbrainover 14 years ago
I honestly had no idea there were parts of the development community that actually used and preferred UTF-16...
评论 #2192231 未加载
评论 #2193101 未加载
评论 #2193974 未加载
评论 #2192393 未加载
评论 #2192268 未加载
评论 #2192307 未加载
prodigal_erikover 14 years ago
I keep hoping a string API will catch on in which combining marks are mostly treated as indivisible. Handling text one codepoint at a time is as bad an idea as handling US-ASCII one bit at a time--almost everything it lets you do is an elaborate way to misinterpret or corrupt your data.
评论 #2192452 未加载
评论 #2192580 未加载
评论 #2192301 未加载
thristianover 14 years ago
The article, at the end, claims that "ASCII was developed from telegraph codes by a committee." It turns out the story is much, much more complicated and interesting than that: <a href="http://www.wps.com/projects/codes/" rel="nofollow">http://www.wps.com/projects/codes/</a>
zbowlingover 14 years ago
UCS2, even though being an outdated predecessor to UTF-16, has some unique qualities that make it useful for things like databases or other storage mediums that you are not mixing with a lot of low code point characters (like you do with XML and HTML markups).<p>One being that's fair for all languages with respect to size so when you may be storing your standard Chinese, Korean, Japanese characters.<p>When UTF-16 made UC2 variable length, a few of the nice things were lost, but when dealing a lot of the higher code point characters mostly, UTF-16 may save you space.
feddover 14 years ago
may i dare make a conclusion with my observations? :)<p>utf-8 is good for network interchange and is de facto becoming standard.<p>utf-16 is not bad for internal storage of strings in memory or in database. not nesserily bad. maybe even better for some reasons
natmasterover 14 years ago
Can someone link me to where python uses UTF-16? It was my understanding it defaulted to UTF-8.<p><a href="http://www.python.org/dev/peps/pep-3120/" rel="nofollow">http://www.python.org/dev/peps/pep-3120/</a>
评论 #2192377 未加载
评论 #2192409 未加载
评论 #2192378 未加载
评论 #2193030 未加载
wildmXranatover 14 years ago
Correct me if I'm wrong, but I think Excel still outputs UTF-16 in some cases. I remember parsing generated .txt/.csv files and there were issues with it and it's endian order.
mikecaronover 14 years ago
Good read! I always HATE it when things complain that my visual studio <i>.c/</i>.h files are BINARY! WTF!
feddover 14 years ago
and i find it fair that American characters require 2 bytes in Java as everybody else, not 1 as in utf-8! :)
评论 #2194143 未加载