TechEcho

4 comments

tialaramexalmost 3 years ago

There's some confused or misunderstood stuff in here but I guess it's just somebody doing the equivalent of musing out loud about what they've learned? I noticed:Those aren't "ASCII code pages". "Code pages" are a way to talk about character encodings, mostly these days used by Microsoft in its Windows operating systems, but historically because IBM's manuals would dedicate a whole page to each such encoding. They aren't "ASCII" code pages, although many of them reserve the first 128 codes for the same things ASCII put there."The upper 128 bits from the ASCII table" is presumably a mistake and means the upper 128 code values maybe?It's called "us-ascii" because that's the name IANA assigned to the ASCII encoding. IANA keeps registries of a lot of stuff... here's the one with character sets in it: <a href="https://www.iana.org/assignments/character-sets/character-sets.xhtml" rel="nofollow">https://www.iana.org/assignments/character-sets/character-se...</a>

评论 #32301706 未加载

aorthalmost 3 years ago

> If someone from Brazil writes a message using the letter é to multiple people, they would read as ة in Arabic, и in Russian and as a corner pipe (╔) if they’re using IBM’s code page 850.Apologies for the minor nitpick: и is in Cyrillic, not Russian. Cyrillic is the script, Russian is the language. There are other languages that use Cyrillic besides Russian (and the script itself was developed around Greece/Bulgaria before Russia even existed).For Arabic it's the language and the script so you're OK there!

评论 #32309406 未加载

svnpennalmost 3 years ago

> You need to know the encoding of any text, otherwise it’s impossible to decipher the message (although it’s common for applications to assume the encoding).Even with the caveat in parentheses, this is quite misleading. For example, the following line is some text, with no specified encoding:> hello worldnow, while its true this could be some exotic encoding, or maybe just random binary data, I wouldn't call it impossible to decipher. More accurate, would be "impossible to decipher with 100% certainty". Same issue exists with Protocol Buffers, or any format that is not self-describing. The data is not a black box, its just annoying to deal with.

评论 #32300627 未加载

评论 #32300601 未加载

jcranmeralmost 3 years ago

There's... a lot of wrong stuff here. Tackling some of the highlights:> ASCII code pages map the upper 128 positions (0x7F:0xFF) of the ASCII byte. Each page holds a different character set. This is one way internationalisation can be achieved.This is at best a poor explanation, and at worst outright wrong. The actual key thing is charset--there's a wide variety of charsets. Because ASCII is an inherently 7-bit charset, a lot of charsets were created by setting the first 128 characters to be ASCII and mapping in different characters for these charsets. IBM (I believe) came up with the term 'code page' to refer to the different character sets they came up with.> Unicode provides a unique code for every character, regardless of the language.That's not really true. Unicode keeps track of "code points". Several code points may together make up what we think of a character--consider that something like à can consist of either a precomposed "à" code point or an "a" + "` diacritic" sequence. Thus there's an entire concern about Unicode normalization that a lot of people prefer to sweep under the rug.> When creating a new file using touch, your computer will interpret that file as binary file.Okay, what's happening here is you've got a command here, the file command, whose entire job is to look at a file and guess what the contents of that file is. For text files, part of that guessing process often involves guessing what the character encoding of the file is. That guessing is not always correct--there's the infamous "the printer can't print on Tuesdays bug" that was caused by the date string in the printer file, on Tuesdays, causing the file command to think it was an entirely different type of file [1]. There's another famous bug where starting a text file with a 4-letter word, two three-letter words, and another 4-letter word would cause Notepad to think the text file was in UTF-16 instead of ASCII [2].With regards to guessing charsets, this is not always a particularly feasible process. Some charsets are more reliable to guess than others are. UTF-8, for example, tends to stick out--continuation bytes form a pattern that most charsets are unlikely to keep up with for long. Guessing ASCII for text that contains no 8-bit values set is pretty safe, since almost every charset is designed with ASCII-subset-safety in mind, and those that aren't (EBCDIC, UTF-7, UTF-16/UTF-32) are found in relatively constrained environments [3].[1] <a href="https://beza1e1.tuxen.de/lore/print_on_tuesday.html" rel="nofollow">https://beza1e1.tuxen.de/lore/print_on_tuesday.html</a>[2] <a href="https://en.wikipedia.org/wiki/Bush_hid_the_facts" rel="nofollow">https://en.wikipedia.org/wiki/Bush_hid_the_facts</a>[3] ISO-2022-* charsets are mode-switching, relying on the ESC character as part of the sequence to switch to different encodings. So you also have to consider the ESC character as a non-7-bit encoding for reliable ASCII detection.

Character Encoding and UTF-8

4 comments

Character Encoding and UTF-8

4 comments