Is this super-summary correct, to understand how what's shown on the user's screen is expanded into single bytes?<p>1) A user sees some character on his/her screen => that's a "grapheme", which is a collection of...<p>2) ...1 to N "Unicode code points", where a single "Unicode code point" can use...<p>3) ...1 to 6 "UTF-8" bytes.<p>Is that right (in the case of UTF-8 storage)?<p>(I feel like that I'm missing an intermediate step...)<p>(indirectly related to "You can't just assume UTF-8" https://news.ycombinator.com/item?id=40195009 , comment https://news.ycombinator.com/item?id=40206149 , link mentioned in that comment being https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/ )<p>Thx :o)
Codepoints can only be 1 to 4 utf-8 bytes. Utf-8's bit pattern can extend up to 6 bytes, but there are only 1,114,111 valid unicode codepoints. and U+10FFFF takes 4 bytes to encode in utf-8 in a not overlong form. I guess you could encode it overlong, but utf-8 should only be encoded not overlong, so anything else could be considered invalid and potentially harmful.
Also I think the step you feel you are missing is the one where the combining of codepoints into ligatures and laying out of text on screen is done. Google Chrome uses a library called Pango to do this IIRC. Edit: maybe it's one called Skia instead. <a href="https://en.wikipedia.org/wiki/Complex_text_layout" rel="nofollow">https://en.wikipedia.org/wiki/Complex_text_layout</a>