TechEcho

When they did this, Unicode already existed, and assigned a code point to each character. There were fewer than 65k code points.<p>Naively, it seems like creating a scheme to pack these code points would be trivial: just represent each character as a series of bytes. But it's not so simple! As I understand it:<p>- they wanted backward compatibility with ASCII, which used only a single byte to represent each character<p>- they wanted to use memory efficiently: common characters shouldn't use 2 bytes<p>- they wanted to gracefully handle errors: a single corrupted byte shouldn't result in the rest of the string being parsed as garbage

See: <a href="https://flickr.com/photos/ajstarks/albums/72157631470798870" rel="nofollow noreferrer">https://flickr.com/photos/ajstarks/albums/72157631470798870</a>

The Unlikely Story of UTF-8: The Text Encoding of the Web

2 comments

The Unlikely Story of UTF-8: The Text Encoding of the Web

2 comments