I finally took the time to read how unicode code points are represented in UTF-8. Wish I did this before. It explains how you can represent 2^21 ~ 2.1 million code points in 4 bytes, what happened to the other (4*8-21=11) bits, and probably why the "char" type in Rust is 4 bytes instead of 1. Enjoy.