There's... a lot of wrong stuff here. Tackling some of the highlights:<p>> ASCII code pages map the upper 128 positions (0x7F:0xFF) of the ASCII byte. Each page holds a different character set. This is one way internationalisation can be achieved.<p>This is at best a poor explanation, and at worst outright wrong. The actual key thing is charset--there's a wide variety of charsets. Because ASCII is an inherently 7-bit charset, a lot of charsets were created by setting the first 128 characters to be ASCII and mapping in different characters for these charsets. IBM (I believe) came up with the term 'code page' to refer to the different character sets they came up with.<p>> Unicode provides a unique code for <i>every</i> character, regardless of the language.<p>That's not really true. Unicode keeps track of "code points". Several code points may together make up what we think of a character--consider that something like à can consist of either a precomposed "à" code point or an "a" + "` diacritic" sequence. Thus there's an entire concern about Unicode normalization that a lot of people prefer to sweep under the rug.<p>> When creating a new file using touch, your computer will interpret that file as binary file.<p>Okay, what's happening here is you've got a command here, the file command, whose entire job is to look at a file and <i>guess</i> what the contents of that file is. For text files, part of that guessing process often involves <i>guessing</i> what the character encoding of the file is. That guessing is not always correct--there's the infamous "the printer can't print on Tuesdays bug" that was caused by the date string in the printer file, on Tuesdays, causing the file command to think it was an entirely different type of file [1]. There's another famous bug where starting a text file with a 4-letter word, two three-letter words, and another 4-letter word would cause Notepad to think the text file was in UTF-16 instead of ASCII [2].<p>With regards to guessing charsets, this is not always a particularly feasible process. Some charsets are more reliable to guess than others are. UTF-8, for example, tends to stick out--continuation bytes form a pattern that most charsets are unlikely to keep up with for long. Guessing ASCII for text that contains no 8-bit values set is pretty safe, since almost every charset is designed with ASCII-subset-safety in mind, and those that aren't (EBCDIC, UTF-7, UTF-16/UTF-32) are found in relatively constrained environments [3].<p>[1] <a href="https://beza1e1.tuxen.de/lore/print_on_tuesday.html" rel="nofollow">https://beza1e1.tuxen.de/lore/print_on_tuesday.html</a><p>[2] <a href="https://en.wikipedia.org/wiki/Bush_hid_the_facts" rel="nofollow">https://en.wikipedia.org/wiki/Bush_hid_the_facts</a><p>[3] ISO-2022-* charsets are mode-switching, relying on the ESC character as part of the sequence to switch to different encodings. So you also have to consider the ESC character as a non-7-bit encoding for reliable ASCII detection.