TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Character Encoding and UTF-8

17 pointsby fredrbalmost 3 years ago

4 comments

tialaramexalmost 3 years ago
There&#x27;s some confused or misunderstood stuff in here but I guess it&#x27;s just somebody doing the equivalent of musing out loud about what they&#x27;ve learned? I noticed:<p>Those aren&#x27;t &quot;ASCII code pages&quot;. &quot;Code pages&quot; are a way to talk about character encodings, mostly these days used by Microsoft in its Windows operating systems, but historically because IBM&#x27;s manuals would dedicate a whole page to each such encoding. They aren&#x27;t &quot;ASCII&quot; code pages, although many of them reserve the first 128 codes for the same things ASCII put there.<p>&quot;The upper 128 bits from the ASCII table&quot; is presumably a mistake and means the upper 128 code values maybe?<p>It&#x27;s called &quot;us-ascii&quot; because that&#x27;s the name IANA assigned to the ASCII encoding. IANA keeps registries of a lot of stuff... here&#x27;s the one with character sets in it: <a href="https:&#x2F;&#x2F;www.iana.org&#x2F;assignments&#x2F;character-sets&#x2F;character-sets.xhtml" rel="nofollow">https:&#x2F;&#x2F;www.iana.org&#x2F;assignments&#x2F;character-sets&#x2F;character-se...</a>
评论 #32301706 未加载
aorthalmost 3 years ago
&gt; If someone from Brazil writes a message using the letter é to multiple people, they would read as ة in Arabic, и in Russian and as a corner pipe (╔) if they’re using IBM’s code page 850.<p>Apologies for the minor nitpick: и is in Cyrillic, not Russian. Cyrillic is the script, Russian is the language. There are other languages that use Cyrillic besides Russian (and the script itself was developed around Greece&#x2F;Bulgaria before Russia even existed).<p>For Arabic it&#x27;s the language and the script so you&#x27;re OK there!
评论 #32309406 未加载
svnpennalmost 3 years ago
&gt; You need to know the encoding of any text, otherwise it’s impossible to decipher the message (although it’s common for applications to assume the encoding).<p>Even with the caveat in parentheses, this is quite misleading. For example, the following line is some text, with no specified encoding:<p>&gt; hello world<p>now, while its true this could be some exotic encoding, or maybe just random binary data, I wouldn&#x27;t call it impossible to decipher. More accurate, would be &quot;impossible to decipher with 100% certainty&quot;. Same issue exists with Protocol Buffers, or any format that is not self-describing. The data is not a black box, its just annoying to deal with.
评论 #32300627 未加载
评论 #32300601 未加载
jcranmeralmost 3 years ago
There&#x27;s... a lot of wrong stuff here. Tackling some of the highlights:<p>&gt; ASCII code pages map the upper 128 positions (0x7F:0xFF) of the ASCII byte. Each page holds a different character set. This is one way internationalisation can be achieved.<p>This is at best a poor explanation, and at worst outright wrong. The actual key thing is charset--there&#x27;s a wide variety of charsets. Because ASCII is an inherently 7-bit charset, a lot of charsets were created by setting the first 128 characters to be ASCII and mapping in different characters for these charsets. IBM (I believe) came up with the term &#x27;code page&#x27; to refer to the different character sets they came up with.<p>&gt; Unicode provides a unique code for <i>every</i> character, regardless of the language.<p>That&#x27;s not really true. Unicode keeps track of &quot;code points&quot;. Several code points may together make up what we think of a character--consider that something like à can consist of either a precomposed &quot;à&quot; code point or an &quot;a&quot; + &quot;` diacritic&quot; sequence. Thus there&#x27;s an entire concern about Unicode normalization that a lot of people prefer to sweep under the rug.<p>&gt; When creating a new file using touch, your computer will interpret that file as binary file.<p>Okay, what&#x27;s happening here is you&#x27;ve got a command here, the file command, whose entire job is to look at a file and <i>guess</i> what the contents of that file is. For text files, part of that guessing process often involves <i>guessing</i> what the character encoding of the file is. That guessing is not always correct--there&#x27;s the infamous &quot;the printer can&#x27;t print on Tuesdays bug&quot; that was caused by the date string in the printer file, on Tuesdays, causing the file command to think it was an entirely different type of file [1]. There&#x27;s another famous bug where starting a text file with a 4-letter word, two three-letter words, and another 4-letter word would cause Notepad to think the text file was in UTF-16 instead of ASCII [2].<p>With regards to guessing charsets, this is not always a particularly feasible process. Some charsets are more reliable to guess than others are. UTF-8, for example, tends to stick out--continuation bytes form a pattern that most charsets are unlikely to keep up with for long. Guessing ASCII for text that contains no 8-bit values set is pretty safe, since almost every charset is designed with ASCII-subset-safety in mind, and those that aren&#x27;t (EBCDIC, UTF-7, UTF-16&#x2F;UTF-32) are found in relatively constrained environments [3].<p>[1] <a href="https:&#x2F;&#x2F;beza1e1.tuxen.de&#x2F;lore&#x2F;print_on_tuesday.html" rel="nofollow">https:&#x2F;&#x2F;beza1e1.tuxen.de&#x2F;lore&#x2F;print_on_tuesday.html</a><p>[2] <a href="https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Bush_hid_the_facts" rel="nofollow">https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Bush_hid_the_facts</a><p>[3] ISO-2022-* charsets are mode-switching, relying on the ESC character as part of the sequence to switch to different encodings. So you also have to consider the ESC character as a non-7-bit encoding for reliable ASCII detection.
评论 #32301475 未加载
评论 #32303381 未加载