科技回声

2 条评论

gilgoomesh超过 11 年前

This article deals mostly with Windows and is from 2003 so it fails to emphasise the current standard practice as much as it should:<pre><code> Use UTF-8 everywhere you can. </code></pre> UTF-8 is:* the most backwards compatible (can be passed through many tools intended for ASCII-only with a few limitations – including avoiding composed latin glyphs)* most likely to give an appropriate result if the end-user incorrectly interprets it* the most space efficient encoding (on average)* avoids endianness problems* de-facto encoding for most Mac and Linux C APIs* verifiable with a high degree of accuracy (unlike many other encodings which can't be verified at all)Specifically:* If you have to pick an encoding, always try to use UTF-8 unless you're only storing text to pass into an API which requires something different.* The Winapi (aka Win32) is the only commonly used API that regularly requires something other than UTF-8 (the Windows Unicode APIs use UTF-16 – not UCS-2 as indicated in the Spolsky article). Windows' UTF-16 requirement a pain for platform independence -- be careful. However, you should still aim to use UTF-8 for all text files on Windows and only use UTF-16 for the Windows API calls (never use the locale specific non-Unicode encodings).* There are a few language+environment combinations that literally can't open Unicode filenames. These include MinGW C++ which has no platform independent way of opening file streams with unicode filenames. You need to fall back to C _wfopen and UTF-16 to open files correctly.Note: you don't always have to choose the encoding. e.g. the Mac class NSString or the C# String class use UTF-16 internally, you don't normally need to care what they do internally since any time you access the internal characters, you specify the desired encoding. You should usually extract characters in UTF-8.

评论 #6998750 未加载

评论 #6998166 未加载

评论 #6998301 未加载

评论 #6997544 未加载

评论 #6998100 未加载

pygy_超过 11 年前

A good summary, but for one imortant detail: In UTF-16, some code points (laying on the so-called "astral planes", ie not on the "basic multilingual plane") take 32 bits.The Emoji, for example, lie on the first higher plane: 🍒🎄🐰🚴. Firefox and Safari display them properly, Chrome doesn't, no idea for IE and Opera.UCS-2 is a strict 16-bit encoding (a subset of UTF-16), and it cannot represent all characters.It is the encoding used by JavaScript, which can be problematic when double width characters are used. For example, `"🐙🐚🐛🐜🐝🐞🐟".length` is 14 even though there are only seven characters, and you could slice that string in the middle of a character.

评论 #6997297 未加载

评论 #6997185 未加载

评论 #6997114 未加载

评论 #6997145 未加载

评论 #6998334 未加载

评论 #6997310 未加载

评论 #6997563 未加载

2 条评论

gilgoomesh超过 11 年前

评论 #6998750 未加载

评论 #6998166 未加载

评论 #6998301 未加载

评论 #6997544 未加载

评论 #6998100 未加载

pygy_超过 11 年前

评论 #6997297 未加载

评论 #6997185 未加载

评论 #6997114 未加载

评论 #6997145 未加载

评论 #6998334 未加载

评论 #6997310 未加载

评论 #6997563 未加载

What Every Software Developer Must Know About Unicode (2003)

2 条评论

What Every Software Developer Must Know About Unicode (2003)

2 条评论