UTF-8: Bits, Bytes, and Benefits

36 pointsby alexkonover 14 years ago

7 comments

js2over 14 years ago

The wikipedia page on UTF-8 -- <a href="http://en.wikipedia.org/wiki/UTF-8" rel="nofollow">http://en.wikipedia.org/wiki/UTF-8</a> -- is excellent, and includes a link to this wonderful anecdote from Rob Pike - <a href="http://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt" rel="nofollow">http://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt</a>UTF-8 was designed, in front of my eyes, on a placemat in a New Jersey diner one night in September or so 1992.What happened was this. We had used the original UTF from ISO 10646 to make Plan 9 support 16-bit characters, but we hated it. We were close to shipping the system when, late one afternoon, I received a call from some folks, I think at IBM - I remember them being in Austin - who were in an X/Open committee meeting. They wanted Ken and me to vet their FSS/UTF design. We understood why they were introducing a new design, and Ken and I suddenly realized there was an opportunity to use our experience to design a really good standard and get the X/Open guys to push it out. We suggested this and the deal was, if we could do it fast, OK. So we went to dinner, Ken figured out the bit-packing, and when we came back to the lab after dinner we called the X/Open guys and explained our scheme. We mailed them an outline of our spec, and they replied saying that it was better than theirs (I don't believe I ever actually saw their proposal; I know I don't remember it) and how fast could we implement it? I think this was a Wednesday night and we promised a complete running system by Monday, which I think was when their big vote was.

pilifover 14 years ago

Totally agreeing on the benefits. But remember one thing: You english speaking people out there have it easy because of points 1-3 in the article.This stops being the case the moment you are dealing with any non-ascii character. At that point, some assumptions stop being valid, like the fact that it stops being the case that every document is a valid UTF-8 encoded document.If you treat arbitrary encoded data as UTF-8, depending on your environment, you will get thrown exceptions at or you will see question marks in various designs all over the place.Combine this with the fact that browsers sometimes are not quite sure of what they are doing:I've seen them sending latin1 but telling the server it's utf-8 or the other way around.The rails people tried to detect utf-8-ness using the snowman and lately a checkmark (<a href="http://railssnowman.info/" rel="nofollow">http://railssnowman.info/</a>).Finally, keep in mind that it's impossible to accurately detect the encoding if it isn't utf-8 without actually doing language analysis.Soon, you'll notice that utf-8 isn't magically solving problems.It's funny how many english speaking people talking to english audiences think that slapping a "; charset=utf-8" to their content-type headers suddenly makes their site utf-8 compliant.As long as their content is 7 bit english and their users send data in 7 bit english, they could as well just have left the charset alone or set it to ASCII as ASCII = utf-8 as long as the first bit isn't set.The application I'm maintaining, while being multi lingual, is, thankfully, targeting countries with languages that can be expressed in latin-1, so that's what we are (still) using.I made various attempts at going utf-8, but in the end, between browsers lying and external third-party APIs still insisting on latin1, these efforts never bore fruit.

评论 #2072479 未加载

评论 #2070948 未加载

评论 #2071120 未加载

jharsmanover 14 years ago

If you're representing text which contains lots of characters that aren't in ASCII, like say Chinese, UTF-8 will consume much more storage than necessary. There are many languages where non-ASCII characters are extremely common.He misses a very useful property of UTF-8 as well, it never contains null bytes. This trips up all sorts of heuristics for detecting binary files in various programs if you use e.g. UTF-16 for mostly ASCII text, since it then will contain lots of nulls.UTF-8 is a very clever way to avoid problems on systems suffering under the mistaken assumption that text is 8-bit byte strings (cough UNIX cough), but that doesn't make it the ideal choice every time.It is still very common for cross platform tools to not handle file names with non-ASCII characters for example. Both Mercurial and Git suffered from this last I looked.The reason is that they treat file names as byte strings instead of text in some encoding, and therefore cannot translate to the proper encoding on platforms which treat file names as Unicode text, like Windows and OS X. OS X also uses a somewhat unconventional normalization form, which means you need to handle normalization as well.

评论 #2071491 未加载

justin_vanwover 14 years ago

This article is retarded in lots of ways, but I'll quickly debunk one point.5. Substring search is just byte string search.Nope. The same character sequences can have multiple utf-8 representations. Yes, this comes up in the real world, especially on the web.Reference: <a href="http://en.wikipedia.org/wiki/Unicode_equivalence#Normal_forms" rel="nofollow">http://en.wikipedia.org/wiki/Unicode_equivalence#Normal_form...</a>The only way to sanely handle unicode is to load it using a codec in the language of your choice, and manipulate it using the tools your language supplies for manipulating strings OR read all the insane specs for unicode and implement your own. Treating unicode as bytes will end in tears.

评论 #2071562 未加载

评论 #2072183 未加载

endianover 14 years ago

On the subject of cool byte-string properties of encodings: <a href="http://keyjson.org" rel="nofollow">http://keyjson.org</a>sorted(map(encode, values)) == sorted(values)</ shameless plug >

riffraffover 14 years ago

what does> UTF-8 sequences sort in code point order. You can verify this by inspecting the encodings in the table above. This means that Unix tools like join, ls, and sort (without options) don't need to handle UTF-8 specially.mean? Isn't ordering a property that is external to the encoding and language dependent? (Though AFAICT ls on osx seems to sort fine italian and hungarian alphabets, even if it's unable to handle character length properly :) )EDIT: got it _code point order_ I'm an idiot

评论 #2071464 未加载

aleccoover 14 years ago

I used to think UTF was a great thing. But I came to realize it isn't good to go through all this trouble just to keep using a 70s US-centric API. UTF is error prone and the problems are often subtle and hard to catch.Also, the same API has serious issues with buffer overflows and off by 1. It would be great to move on.