TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

What Every Programmer Absolutely, Positively Needs To Know About Encodings (2011)

175 点作者 subset超过 3 年前

7 条评论

oshiar53-0超过 3 年前
Fun fact: GB 18030 is a Unicode Transformation Format.<p>Example: \N{THINKING FACE}\N{FACE WITH TEARS OF JOY}\N{FACE SCREAMING IN FEAR}\N{SMILING FACE WITH SMILING EYES AND THREE HEARTS}\N{PERSON DOING CARTWHEEL}\N{FACE WITH NO GOOD GESTURE}\N{ZERO WIDTH JOINER}\N{FEMALE SIGN}\N{VARIATION SELECTOR-16}\N{EYES}\N{ON WITH EXCLAMATION MARK WITH LEFT RIGHT ARROW ABOVE}\N{SQUARED COOL}\N{VARIATION SELECTOR-16}<p>In UTF-8:<p><pre><code> 00000000: f09f a494 f09f 9882 f09f 98b1 f09f a5b0 ................ 00000010: f09f a4b8 f09f 9985 e280 8de2 9980 efb8 ................ 00000020: 8ff0 9f91 80f0 9f94 9bf0 9f86 92ef b88f ................ </code></pre> In GB 18030:<p><pre><code> 00000000: 9530 cd34 9439 fc38 9530 8335 9530 d636 .0.4.9.8.0.5.0.6 00000010: 9530 d130 9530 8535 8136 a439 a1e2 8431 .0.0.0.5.6.9...1 00000020: 8235 9439 cf38 9439 e537 9439 8b32 8431 .5.9.8.9.7.9.2.1 00000030: 8235 .5</code></pre>
评论 #30394808 未加载
评论 #30398091 未加载
torstenvl超过 3 年前
The world of text encodings is pretty insane, especially when you start getting into the realm of what seems like endless variations on multi-level encodings, like the bajillion different character set encodings for quwei&#x2F;kuten CJK encodings.<p>I&#x27;m only scratching the surface right now, and just wrote a CPG 932 → Unicode lookup utility. <a href="https:&#x2F;&#x2F;pastebin.com&#x2F;4PYmEjQZ" rel="nofollow">https:&#x2F;&#x2F;pastebin.com&#x2F;4PYmEjQZ</a><p>(For internal testing, forgive any sloppiness, but feel free to use the kuten table if you happen to have a niche project with one-way mapping. The tables themselves are facts and not creative expression, so should not be copyrightable, but I&#x27;m dedicating the project to the public domain anyway.)
评论 #30394571 未加载
评论 #30394430 未加载
bruce511超过 3 年前
A detail skimmed over in the article, but one which has significant importance is that;<p>&quot;one code point in unicode does not necessarily map to one character on the screen.&quot;<p>A &quot;character&quot; can, and often does, get constructed from multiple code points.<p>This doesn&#x27;t help an already complicated sorting issue (who knew &quot;sort these names alphabetically&quot; could be an ambiguous statement).
评论 #30396660 未加载
评论 #30398431 未加载
评论 #30395242 未加载
评论 #30394583 未加载
评论 #30394728 未加载
seles超过 3 年前
According to the article, PHP can handle other encodings by just treating sequences of strings as byte sequences and not caring what the encoding is. There example:<p>$string = &quot;漢字&quot;;<p>But if you are using say UTF-8 and one of those Chinese characters has one of its bytes have a value of 34 (the ascii value of &quot;), then wouldn&#x27;t the string terminate prematurely?<p>Edit: to answer my own question, quote from wikipedia: ASCII bytes do not occur when encoding non-ASCII code points into UTF-8
评论 #30399475 未加载
agumonkey超过 3 年前
A nice complement to nedbat&#x27;s <a href="https:&#x2F;&#x2F;nedbatchelder.com&#x2F;text&#x2F;unipain&#x2F;unipain.html#1" rel="nofollow">https:&#x2F;&#x2F;nedbatchelder.com&#x2F;text&#x2F;unipain&#x2F;unipain.html#1</a>
评论 #30394258 未加载
badrabbit超过 3 年前
Should mention enduanness as well and ebcdic. There are big endian versions of UTF-*
评论 #30397469 未加载
ncmncm超过 3 年前
PHP-centric, not mentioned in the title. Most of it is relevant to everybody, but it is jarring to run into stuff about PHP. Isn&#x27;t that dead yet?<p>Of greater moment is that the article keeps talking about &quot;characters&quot;, which is an undefined term in Unicode. Unicode offers you code points, code units, graphemes, grapheme clusters, and ... other things, none of which maps to the grouping of dots you see on your screen (and probably cannot imagine how to type in).<p>&quot;Character&quot; has outlived its sell-by date. Let it be retired and buried with dignity, but with a good thick slab of concrete on top.<p>It also fails to mention &quot;expanded form&quot; and &quot;canonical form&quot;, and other ways that two completely different sequences of bits mean, at some level, the same text. Different forms are convenient for different things; there is a shortest possible representation nice for sending and storing, and a maximally decomposed representation that might be best for editing if you like adding and removing diereses (&quot;umlauts&quot;) and accents piecemeal.<p>And it fails to mention WTF-8, a way to package up byte sequences that are not valid UTF-8, but may have valid UTF-8 characters that you want to display in case they offer the poor human a clue as to what was intended. WTF-8 sequences often arise in file systems and databases that don&#x27;t enforce any particular encoding, but just store whatever bytes the benighted programs users run provide as, e.g., names for files. You <i>wish</i> you could display them in sorted order. There had better be a way to point at it, because there is no way to type it. But you have to store it, because that is the only way to tell the OS which file you wanted to rename or delete. Deletion is tempting, but we can&#x27;t, always, can we?.
评论 #30395640 未加载
评论 #30395491 未加载
评论 #30395401 未加载
评论 #30396075 未加载
评论 #30395870 未加载