TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Ask HN: Super-summary to go from "grapheme" to "bytes"?

2 点作者 zepearl大约 1 年前
Is this super-summary correct, to understand how what&#x27;s shown on the user&#x27;s screen is expanded into single bytes?<p>1) A user sees some character on his&#x2F;her screen =&gt; that&#x27;s a &quot;grapheme&quot;, which is a collection of...<p>2) ...1 to N &quot;Unicode code points&quot;, where a single &quot;Unicode code point&quot; can use...<p>3) ...1 to 6 &quot;UTF-8&quot; bytes.<p>Is that right (in the case of UTF-8 storage)?<p>(I feel like that I&#x27;m missing an intermediate step...)<p>(indirectly related to &quot;You can&#x27;t just assume UTF-8&quot; https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=40195009 , comment https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=40206149 , link mentioned in that comment being https:&#x2F;&#x2F;www.joelonsoftware.com&#x2F;2003&#x2F;10&#x2F;08&#x2F;the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses&#x2F; )<p>Thx :o)

2 条评论

nuc1e0n大约 1 年前
Codepoints can only be 1 to 4 utf-8 bytes. Utf-8&#x27;s bit pattern can extend up to 6 bytes, but there are only 1,114,111 valid unicode codepoints. and U+10FFFF takes 4 bytes to encode in utf-8 in a not overlong form. I guess you could encode it overlong, but utf-8 should only be encoded not overlong, so anything else could be considered invalid and potentially harmful.
评论 #40246663 未加载
nuc1e0n大约 1 年前
Also I think the step you feel you are missing is the one where the combining of codepoints into ligatures and laying out of text on screen is done. Google Chrome uses a library called Pango to do this IIRC. Edit: maybe it&#x27;s one called Skia instead. <a href="https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Complex_text_layout" rel="nofollow">https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Complex_text_layout</a>