It reminds me a fun fact, which I noticed when writing a TTF font parser <a href="https://github.com/photopea/Typr.js" rel="nofollow">https://github.com/photopea/Typr.js</a><p>TTF files have a 4-byte field, where the font manufacturer can put "information about himself" (like the identification). The Adobe company puts an ASCII string "ADBE" into these four bytes.<p>There is another field for the font manufacturer, which has only two bytes. Guess what Adobe puts into these two bytes? 0xadbe :D
Am I the only one struggling to understand the hex dump? The author says "0x79 is the z in the ASCII table." That's wrong. 'z' is 0x7a.<p>The author also says "In UTF-8 all characters after 0x79 are at least two bytes long." That's also wrong. All characters after 0x7f get encoded as two or more bytes.
So what most probably happened is that FileReader.readAsBinaryString()° defaults to FileReader.readAsText()°° since it's deprecated. At least that is what I saw in Chromium. As soon as I used readAsArrayBuffer the problem went away.<p>°) <a href="https://developer.mozilla.org/en-US/docs/Web/API/FileReader/readAsBinaryString" rel="nofollow">https://developer.mozilla.org/en-US/docs/Web/API/FileReader/...</a><p>°°) <a href="https://developer.mozilla.org/en-US/docs/Web/API/FileReader/readAsText" rel="nofollow">https://developer.mozilla.org/en-US/docs/Web/API/FileReader/...</a>
If you dive head-first into Python's string behaviour, you'll eventually learn the hard UnicodeDecodeError-way what the difference is between a stream of bytes/octets and a text made of unicode code points.
Much the same as learning that a timestamp without a timezone is not worth much, a text as a stream of bytes is not much worth without the encoding it is in.
PHP also has nice footguns in that area.
Sounds like you're confusing the integer code point and the integer representation of characters.<p>Many programming languages internally represent chars as UTF-8 or UTF-16, so when using libraries to read bytes into chars everything get's mangled.<p>Check out this guide for more in-depth look at the mangling that can happen. <a href="http://cweb.github.io/unicode-security-guide/background/" rel="nofollow">http://cweb.github.io/unicode-security-guide/background/</a>
I had a similar thing with glitched images once. We had to retrofit a middleware onto a site that would obfuscate email addresses. It used a regex to spot valid emails and replaced them with a hash. It also knew what urls and form parameters expected emails and used a lookup table to translate them back. This was sufficient to anonymise usernames without breaking any functionality on the site.<p>Turns out, we forgot to check content type, and valid emails according to the regex we had used were surprisingly common in binaries.
First guess: The author is running into something related to <a href="https://developer.mozilla.org/en-US/docs/Web/API/XMLHttpRequest/Sending_and_Receiving_Binary_Data#Receiving_binary_data_in_older_browsers" rel="nofollow">https://developer.mozilla.org/en-US/docs/Web/API/XMLHttpRequ...</a>.
> 0x79 is the z in the ASCII table.<p>The most confusing technically correct statement of the year.<p>Edit: Sorry, scratch "technically correct". Need more coffee.