The UTF-8-Everywhere Manifesto

381 点作者 bearpool大约 13 年前

21 条评论

pilif大约 13 年前

Really good article. You'll get nothing from me but heartfelt agreement. I especially liked that the article was giving numbers about how inefficient UTF8 would be to store Asian text (not really apparently).Also insightful, but obvious in hindsight: Not even in utf-32 you can index specific character in constant time due to the various digraphs.The one property I really love about UTF8 is that you get a free consistency check as not every arbitrary byte sequence is a valid UTF8 string.This is a really good help for detecting encoding errors very early (still to this day, applications are known to lie about the encoding of their output).And of course, there's no endianness issue, removing the need for a BOM which makes it possible for tools that operate at byte levels to still do the right job.If only it had better support outside of Unix.For example, try opening a UTF8 encoded CSV file (using characters outside of ASCII of course) in Mac Excel (latest versions. Up until that, it didn't know UTF8 at all) for a WTF experience somewhere between comical and painful.If there is one thing I could criticize about UTF8 then that would be its similarity to ASCII (which is also its greatest strength) causing many applications and APIs to boldly declare UTF8 compatibility when all they really can do is ASCII compatibility and emitting a mess (or blowing up) once they have to deal with code points outside that range.I'm jokingly calling this US-UTF8 when I encounter it (all too often unfortunately), but maybe the proliferation of "cool" characters like what we recently got with Emoji is likely going to help with this over time.

评论 #3907657 未加载

评论 #3907405 未加载

gwillen大约 13 年前

Ok, let me be the first approving top level comment: This document is correct. The author of this document is smart. You should follow this document.As jwz said about backups: "Shut up. I know things. You will listen to me. Do it anyway."

luriel大约 13 年前

Yes! I have been meaning to write something like this for years.There is only one thing I would add: Never add a BOM to an UTF-8 file!! It is redundant, useless and breaks all kinds of things by attaching garbage to the start of your files.Edit: Here is the interesting story of how Ken Thompson invented UTF-8: <a href="http://doc.cat-v.org/bell_labs/utf-8_history" rel="nofollow">http://doc.cat-v.org/bell_labs/utf-8_history</a>

评论 #3906768 未加载

pcwalton大约 13 年前

Sadly, the pervasiveness of JavaScript means that UTF-16 interoperability will be needed as least as long as the Web is alive. JavaScript strings are fundamentally UTF-16. This is why we've tentatively decided to go with UTF-16 in Servo (the experimental browser engine) -- converting to UTF-8 every time text needed to go through the layout engine would kill us in benchmarks.For new APIs in which legacy interoperability isn't needed, I completely approve of this document.

评论 #3906680 未加载

评论 #3906664 未加载

评论 #3906615 未加载

评论 #3907268 未加载

评论 #3906611 未加载

评论 #3907323 未加载

cygx大约 13 年前

Personally, I prefer UTF-8 as well. However, I think this whole debate about choice of encoding gets blown out of proportion.Consider the following diagram:<pre><code> [user-perceived characters] <-+ ^ | | | v | [characters] <-> [grapheme clusters] | ^ ^ | | | | v v | [bytes] <-> [codepoints] [glyphs] <----------+ </code></pre> Choice of encoding only affects the conversion from bytes to codepoints, which is pretty straight-forward: The subtleties lie elsewhere...

raverbashing大约 13 年前

Disagree"UTF-16 is the worst of both worlds—variable length and too wide"Really, the author tries to convince the reader, but it's not that clean cut.One of the advantages of UTF-16 is knowing right away it's UTF-16 as opposed to deciding if it's UTF-8/ASCII/other encoding. Sure, for transmission it's a waste of space (still, text for today's computer capabilities is a non issue even if using UTF-32)"It's not fixed width" But for most text, it is. Sure, you can do UTF-32 and it may not be a bad idea (today)Yes, Windows has to deal with several complications and with backwards compatibility, so it's a bag of hurt. Still, they went the right way (internally, it's unicode, period.)"in plain Windows edit control (until Vista), it takes two backspaces to delete a character which takes 4 bytes in UTF-16"If I'm not mistaken this is by design. The 4 byte characters is usually typed as a combination of characters, so if you want to change the last part of the combination you jut type one backspace.

评论 #3906547 未加载

评论 #3906522 未加载

评论 #3906452 未加载

评论 #3906466 未加载

评论 #3906445 未加载

makecheck大约 13 年前

Markus Kuhn's web page has a lot of useful UTF-8 info and valuable links (e.g. samples of UTF-8 corner cases that people often miss).<a href="http://www.cl.cam.ac.uk/~mgk25/unicode.html" rel="nofollow">http://www.cl.cam.ac.uk/~mgk25/unicode.html</a>

评论 #3906714 未加载

haberman大约 13 年前

Totally agree re: UTF-8 vs other Unicode encodings.But are there still still hold-outs who don't like Unicode? Last I heard some CJK users were unhappy about Han Unification: <a href="http://en.wikipedia.org/wiki/Han_unification" rel="nofollow">http://en.wikipedia.org/wiki/Han_unification</a>

评论 #3907333 未加载

评论 #3907295 未加载

evincarofautumn大约 13 年前

For those who don’t know it, UTF8-CPP[1] is a good lightweight header-only library for UTF conversions, mostly STL-compatible.[1] <a href="http://utfcpp.sourceforge.net/" rel="nofollow">http://utfcpp.sourceforge.net/</a>

tommi大约 13 年前

That collection of best practices can hardly be considered as "UTF-8 Everywhere Manifesto" as it focuses on Windows and C++. It's good, but I'd rather see more manifesto like document for all cases on a domain like that.

评论 #3906428 未加载

评论 #3906681 未加载

antidoh大约 13 年前

Text is maddening, the modern Tower of Babel.Is there a definitive reference, or small handful of references, to learn all that's worth knowing about text, from ASCII to UTF-∞ and beyond?

评论 #3907173 未加载

mkup大约 13 年前

I use UTF-8 for transmitted data and disk I/O, and I use UCS-4 (wchar_t on Linux/FreeBSD) for internal representation of strings in my software.I generally agree with this article, but I disagree with it on the point that UTF-8 is the only appropriate encoding for strings stored in memory, and also I disagree on the point wchar_t should be removed from C++ standard or made sizeof 1, as in Android NDK.Let me explain why.In UTF-8 single Unicode character may be encoded in multiple ways. For example NUL (U+0000) can be encoded as 00 or as C0 80. The second encoding is illegal because it's longer than necessary and forbidden by standard, but naive parser may extract NUL out of it. If UTF-8 input was not properly sanitized, or there is a bug in charset converter, this may result in exploit like SQL injection or arbitrary filesystem access or something like that: malicious party can encode not only NUL, but ", /, \ etc this way.Also UTF-8 string can't be cut at arbitrary position. Byte groups (UTF-8 runes) must be processed as a whole, so appear either on left side or on the right side of cut.Reversing of UTF-8 string is tricky, especially when illegal character sequences are present in input string and corresponding code points (U+FFFD) must be preserved in output string.I think UTF-8 for network transmitted data and disk I/O is inevitable, but our software should keep all in-memory strings in UCS-4 only, and take adequate security precautions in all places where conversion between UTF-8 and UCS-4 happens.And sizeof(wchar_t)==4 in GCC ABI is not a design defect, wchar_t exists for a good reason. I admit that sizeof(wchar_t)==2 on Windows is utterly broken.

评论 #3906970 未加载

评论 #3907053 未加载

erichocean大约 13 年前

The strangest thing about Unicode (any flavor) is that NULL, aka \0, aka "all zeros" is a valid character.If you claim to support Unicode, you have to support NULL characters; otherwise, you support a subset.I find most OS utilities that "accept" Unicode fail to accept the NULL character.FWIW, UTF-8 has a few invalid characters (characters that can never appear in a valid UTF-8 string). Any one of them could be used as an "end of string" terminator if so desired, for situations where the string length is not known up front.We could even standardize which one (hint hint). I suggest -1 (all 1s).UPDATE: I meant "strange" as in "surprising", especially for those coming from a C background, like me.

评论 #3906795 未加载

评论 #3906780 未加载

评论 #3907724 未加载

sopooneo大约 13 年前

Can someone explain to me how UTF-8 is endianness independent? I don't mean that I am arguing the fact, I just don't understand how it is possible. Don't you have to know which order to interpret the bits in each byte? And isn't that endianness?

评论 #3906596 未加载

评论 #3906597 未加载

评论 #3906734 未加载

chj大约 13 年前

can not agree more! it will be a much better world if we all use utf8 for external string presentation. i don't care about what your app use internally, but if it generates output, please use utf8.

CJefferson大约 13 年前

I there a simple set of rules for people who currently have code which use ASCII, to check for UTF-8 cleanness?In particular, what should I watch out for to make an ASCII parser UTF-8 clean?

评论 #3906801 未加载

评论 #3906878 未加载

评论 #3906718 未加载

评论 #3906894 未加载

评论 #3906697 未加载

breck大约 13 年前

How could we avoid acronyms like 'utf-8'?We can do better than that. Unicode8?

评论 #3906474 未加载

评论 #3906449 未加载

评论 #3906548 未加载

评论 #3906442 未加载

scoith大约 13 年前

That page is misleading when it comes to Japanese text: UTF-8 sucks for Japanese text. UTF-8 and UTF-16 aren't the only two choices within the whole world, which is demonstrated in their choice of encoding Shift-JIS.

评论 #3907065 未加载

fleitz大约 13 年前

tl:dr; Use UTF-8 when you need to use unicode with legacy APIs, never anywhere else.UNIX isn't UTF-8 because UTF-8 is better, UNIX is UTF-8 because you can pass UTF-8 strings to functions that expect ASCII and it kinda works. This is really the only thing you need to know about UTF-8 and why it's better.There are few pieces of software that don't have to talk to legacy APIs that store strings natively in UTF-8.C# and Java are probably the best examples of software that was engineered from the ground up and thus uses UTF-16 internally because it's much less likely to run into issues like String.length returning 32 yet only containing 31 characters. If you use UTF-8 expect this result anytime a string contains a real genuine apostrophe."UTF-8 and UTF-32 result the same order when sorted lexicographically. UTF-16 does not."This is complete and utter bullshit, to sort a string lexicographically you need to decode it, if you've decoded the string into UNICODE then they sort the exact same way.There are lots of gotchas for sorting UNICODE strings including normalization because you can write the semantically equivalent strings in unicode multiple ways. eg. ligatures.If you're sorting bit strings that happen to contain UTF-8/32 then you're not sorting lexicographically and your results will be screwed up anyway.

评论 #3907656 未加载

评论 #3907869 未加载

评论 #3907671 未加载

评论 #3907890 未加载

评论 #3907658 未加载

评论 #3913130 未加载

评论 #3907852 未加载

alecco大约 13 年前

ASCII and UTF-8 are too US centric. That's why adoption in places like China is so low.Also, if there's variable length encoding why can't we just do a proper way and improve size for the same computational cost?

评论 #3906525 未加载

评论 #3906662 未加载

评论 #3906626 未加载

评论 #3907838 未加载

natch大约 13 年前

Strings (NSString) on Apple platforms are UTF-16. The Apple platforms are not exactly lagging behind in either multilingual, or text processing. I wonder what this team of three people knows that Apple doesn't? Or is it the other way around, that Apple knows something they don't, and when it comes to shipping products that work in the real world, Apple has figured out how to do it?

评论 #3906732 未加载

评论 #3906705 未加载

评论 #3906717 未加载

评论 #3906702 未加载