TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

What Every Software Developer Must Know About Unicode (2003)

96 点作者 jervisfm超过 11 年前

2 条评论

gilgoomesh超过 11 年前
This article deals mostly with Windows and is from 2003 so it fails to emphasise the current standard practice as much as it should:<p><pre><code> Use UTF-8 everywhere you can. </code></pre> UTF-8 is:<p>* the most backwards compatible (can be passed through many tools intended for ASCII-only with a few limitations – including avoiding composed latin glyphs)<p>* most likely to give an appropriate result if the end-user incorrectly interprets it<p>* the most space efficient encoding (on average)<p>* avoids endianness problems<p>* de-facto encoding for most Mac and Linux C APIs<p>* verifiable with a high degree of accuracy (unlike many other encodings which can&#x27;t be verified at all)<p>Specifically:<p>* If you have to pick an encoding, always try to use UTF-8 unless you&#x27;re only storing text to pass into an API which requires something different.<p>* The Winapi (aka Win32) is the only commonly used API that regularly requires something other than UTF-8 (the Windows Unicode APIs use UTF-16 – not UCS-2 as indicated in the Spolsky article). Windows&#x27; UTF-16 requirement a pain for platform independence -- be careful. However, you should still aim to use UTF-8 for all text files on Windows and only use UTF-16 for the Windows API calls (never use the locale specific non-Unicode encodings).<p>* There are a few language+environment combinations that literally <i>can&#x27;t</i> open Unicode filenames. These include MinGW C++ which has no platform independent way of opening file streams with unicode filenames. You need to fall back to C _wfopen and UTF-16 to open files correctly.<p>Note: you don&#x27;t always have to choose the encoding. e.g. the Mac class NSString or the C# String class use UTF-16 internally, you don&#x27;t normally need to care what they do internally since any time you access the internal characters, you specify the desired encoding. You should usually extract characters in UTF-8.
评论 #6998750 未加载
评论 #6998166 未加载
评论 #6998301 未加载
评论 #6997544 未加载
评论 #6998100 未加载
pygy_超过 11 年前
A good summary, but for one imortant detail: In UTF-16, some code points (laying on the so-called &quot;astral planes&quot;, ie not on the &quot;basic multilingual plane&quot;) take 32 bits.<p>The Emoji, for example, lie on the first higher plane: 🍒🎄🐰🚴. Firefox and Safari display them properly, Chrome doesn&#x27;t, no idea for IE and Opera.<p>UCS-2 is a strict 16-bit encoding (a subset of UTF-16), and it cannot represent all characters.<p>It is the encoding used by JavaScript, which can be problematic when double width characters are used. For example, `&quot;🐙🐚🐛🐜🐝🐞🐟&quot;.length` is 14 even though there are only seven characters, and you could slice that string in the middle of a character.
评论 #6997297 未加载
评论 #6997185 未加载
评论 #6997114 未加载
评论 #6997145 未加载
评论 #6998334 未加载
评论 #6997310 未加载
评论 #6997563 未加载