TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

What Every Programmer Absolutely, Positively Needs To Know About Encodings (2011)

175 pointsby subsetover 3 years ago

7 comments

oshiar53-0over 3 years ago
Fun fact: GB 18030 is a Unicode Transformation Format.<p>Example: \N{THINKING FACE}\N{FACE WITH TEARS OF JOY}\N{FACE SCREAMING IN FEAR}\N{SMILING FACE WITH SMILING EYES AND THREE HEARTS}\N{PERSON DOING CARTWHEEL}\N{FACE WITH NO GOOD GESTURE}\N{ZERO WIDTH JOINER}\N{FEMALE SIGN}\N{VARIATION SELECTOR-16}\N{EYES}\N{ON WITH EXCLAMATION MARK WITH LEFT RIGHT ARROW ABOVE}\N{SQUARED COOL}\N{VARIATION SELECTOR-16}<p>In UTF-8:<p><pre><code> 00000000: f09f a494 f09f 9882 f09f 98b1 f09f a5b0 ................ 00000010: f09f a4b8 f09f 9985 e280 8de2 9980 efb8 ................ 00000020: 8ff0 9f91 80f0 9f94 9bf0 9f86 92ef b88f ................ </code></pre> In GB 18030:<p><pre><code> 00000000: 9530 cd34 9439 fc38 9530 8335 9530 d636 .0.4.9.8.0.5.0.6 00000010: 9530 d130 9530 8535 8136 a439 a1e2 8431 .0.0.0.5.6.9...1 00000020: 8235 9439 cf38 9439 e537 9439 8b32 8431 .5.9.8.9.7.9.2.1 00000030: 8235 .5</code></pre>
评论 #30394808 未加载
评论 #30398091 未加载
torstenvlover 3 years ago
The world of text encodings is pretty insane, especially when you start getting into the realm of what seems like endless variations on multi-level encodings, like the bajillion different character set encodings for quwei&#x2F;kuten CJK encodings.<p>I&#x27;m only scratching the surface right now, and just wrote a CPG 932 → Unicode lookup utility. <a href="https:&#x2F;&#x2F;pastebin.com&#x2F;4PYmEjQZ" rel="nofollow">https:&#x2F;&#x2F;pastebin.com&#x2F;4PYmEjQZ</a><p>(For internal testing, forgive any sloppiness, but feel free to use the kuten table if you happen to have a niche project with one-way mapping. The tables themselves are facts and not creative expression, so should not be copyrightable, but I&#x27;m dedicating the project to the public domain anyway.)
评论 #30394571 未加载
评论 #30394430 未加载
bruce511over 3 years ago
A detail skimmed over in the article, but one which has significant importance is that;<p>&quot;one code point in unicode does not necessarily map to one character on the screen.&quot;<p>A &quot;character&quot; can, and often does, get constructed from multiple code points.<p>This doesn&#x27;t help an already complicated sorting issue (who knew &quot;sort these names alphabetically&quot; could be an ambiguous statement).
评论 #30396660 未加载
评论 #30398431 未加载
评论 #30395242 未加载
评论 #30394583 未加载
评论 #30394728 未加载
selesover 3 years ago
According to the article, PHP can handle other encodings by just treating sequences of strings as byte sequences and not caring what the encoding is. There example:<p>$string = &quot;漢字&quot;;<p>But if you are using say UTF-8 and one of those Chinese characters has one of its bytes have a value of 34 (the ascii value of &quot;), then wouldn&#x27;t the string terminate prematurely?<p>Edit: to answer my own question, quote from wikipedia: ASCII bytes do not occur when encoding non-ASCII code points into UTF-8
评论 #30399475 未加载
agumonkeyover 3 years ago
A nice complement to nedbat&#x27;s <a href="https:&#x2F;&#x2F;nedbatchelder.com&#x2F;text&#x2F;unipain&#x2F;unipain.html#1" rel="nofollow">https:&#x2F;&#x2F;nedbatchelder.com&#x2F;text&#x2F;unipain&#x2F;unipain.html#1</a>
评论 #30394258 未加载
badrabbitover 3 years ago
Should mention enduanness as well and ebcdic. There are big endian versions of UTF-*
评论 #30397469 未加载
ncmncmover 3 years ago
PHP-centric, not mentioned in the title. Most of it is relevant to everybody, but it is jarring to run into stuff about PHP. Isn&#x27;t that dead yet?<p>Of greater moment is that the article keeps talking about &quot;characters&quot;, which is an undefined term in Unicode. Unicode offers you code points, code units, graphemes, grapheme clusters, and ... other things, none of which maps to the grouping of dots you see on your screen (and probably cannot imagine how to type in).<p>&quot;Character&quot; has outlived its sell-by date. Let it be retired and buried with dignity, but with a good thick slab of concrete on top.<p>It also fails to mention &quot;expanded form&quot; and &quot;canonical form&quot;, and other ways that two completely different sequences of bits mean, at some level, the same text. Different forms are convenient for different things; there is a shortest possible representation nice for sending and storing, and a maximally decomposed representation that might be best for editing if you like adding and removing diereses (&quot;umlauts&quot;) and accents piecemeal.<p>And it fails to mention WTF-8, a way to package up byte sequences that are not valid UTF-8, but may have valid UTF-8 characters that you want to display in case they offer the poor human a clue as to what was intended. WTF-8 sequences often arise in file systems and databases that don&#x27;t enforce any particular encoding, but just store whatever bytes the benighted programs users run provide as, e.g., names for files. You <i>wish</i> you could display them in sorted order. There had better be a way to point at it, because there is no way to type it. But you have to store it, because that is the only way to tell the OS which file you wanted to rename or delete. Deletion is tempting, but we can&#x27;t, always, can we?.
评论 #30395640 未加载
评论 #30395491 未加载
评论 #30395401 未加载
评论 #30396075 未加载
评论 #30395870 未加载