科技回声

Is this super-summary correct, to understand how what's shown on the user's screen is expanded into single bytes?1) A user sees some character on his/her screen => that's a "grapheme", which is a collection of...2) ...1 to N "Unicode code points", where a single "Unicode code point" can use...3) ...1 to 6 "UTF-8" bytes.Is that right (in the case of UTF-8 storage)?(I feel like that I'm missing an intermediate step...)(indirectly related to "You can't just assume UTF-8" https://news.ycombinator.com/item?id=40195009 , comment https://news.ycombinator.com/item?id=40206149 , link mentioned in that comment being https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/ )Thx :o)

2 条评论

nuc1e0n大约 1 年前

Codepoints can only be 1 to 4 utf-8 bytes. Utf-8's bit pattern can extend up to 6 bytes, but there are only 1,114,111 valid unicode codepoints. and U+10FFFF takes 4 bytes to encode in utf-8 in a not overlong form. I guess you could encode it overlong, but utf-8 should only be encoded not overlong, so anything else could be considered invalid and potentially harmful.

评论 #40246663 未加载

nuc1e0n大约 1 年前

Also I think the step you feel you are missing is the one where the combining of codepoints into ligatures and laying out of text on screen is done. Google Chrome uses a library called Pango to do this IIRC. Edit: maybe it's one called Skia instead. <a href="https://en.wikipedia.org/wiki/Complex_text_layout" rel="nofollow">https://en.wikipedia.org/wiki/Complex_text_layout</a>

Ask HN: Super-summary to go from "grapheme" to "bytes"?

2 条评论

Ask HN: Super-summary to go from "grapheme" to "bytes"?

2 条评论