Note: This post is basically a TLDR of <a href="https://www.theregister.com/2013/10/04/verity_stob_unicode/" rel="nofollow">https://www.theregister.com/2013/10/04/verity_stob_unicode/</a> by Verity Stob.<p>One of the reasons there is a lot of confusion about encodings vs Unicode is that Unicode was initially an encoding. It was thought that 65K characters was enough to represent all the characters in actual use across the languages and thus you just needed to change the from an 8 bit char to a 16 bit char and all would be well (apart from the issue of endianness). Thus Unicode initially specified what each symbol would look like encoded in 16bits. (see <a href="http://unicode.org/history/unicode88.pdf" rel="nofollow">http://unicode.org/history/unicode88.pdf</a>, particularly section 2). Windows NT, Java, ICU, all embraced this.<p>Then it turned out that you needed a lot more characters than 65K and instead of each character being 16 bits, you would need 32 bit characters (or else have weird 3 byte data types). Whereas people could justify going from 8 bits to 16 bits as a cost of not having to worry about charsets, most developers balked at 32 bits for every character. In addition, you now had a bunch of the early adopters (Java and Windows NT) that had already embraced 16 bit characters. So then encodings were hacked on such as UTF-16 (surrogate pairs of 16 bit characters for some unicode code points).<p>I think, if the problem had been understood better at the start that you have a lot more characters than will fit in 16 bits, then something UTF-8 would likely have been chosen as the canonical encoding and we could have avoided a lot of these issues. Alas, such is the benefit of 20/20 hindsight.