TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Libgrapheme: A simple freestanding C99 library for Unicode

108 pointsby harporoederover 2 years ago

6 comments

manifoldgeoover 2 years ago
I&#x27;ve been trying to get into C this past week, and this is a great coincidence to see this today! I was just thinking how convenient it is to type emojis right into strings in Python and print them. I assumed C didn&#x27;t have much unicode compatibility, though I didn&#x27;t research it.<p>I gave libgrapheme a try, and it compiled just as the instructions said it would. The hello-world program also mostly worked, but in my terminal it malformed several things. For example the American flag emoji rendered in my terminal as [U][S], and the family emoji rendered as three distinct emoji faces (side-by-side) rather than one grouped one.<p>I went to a website that lets me copy emojis to my clipboard, and I directly copy-pasted the American flag into my terminal, and I still got [U][S], so I think the problem is just with the terminal and not the library.<p>edit: Indeed, this is a problem in Gnome terminal. I found a Bugzilla link[0] that is still open. The official name for the grouped emoji type is &quot;ZWJ sequence&quot;[1], short for Zero-Width Joiner, and it appears not a lot of terminals support them. If anyone knows of a good one for Linux, please let me know!<p>Great stuff, thank you for sharing!<p>References:<p>[0]: <a href="https:&#x2F;&#x2F;gitlab.gnome.org&#x2F;GNOME&#x2F;vte&#x2F;-&#x2F;issues&#x2F;2317" rel="nofollow">https:&#x2F;&#x2F;gitlab.gnome.org&#x2F;GNOME&#x2F;vte&#x2F;-&#x2F;issues&#x2F;2317</a><p>[1]: <a href="https:&#x2F;&#x2F;emojipedia.org&#x2F;emoji-zwj-sequence&#x2F;" rel="nofollow">https:&#x2F;&#x2F;emojipedia.org&#x2F;emoji-zwj-sequence&#x2F;</a>
评论 #33614337 未加载
评论 #33617359 未加载
评论 #33619648 未加载
hnfongover 2 years ago
The claims about &quot;word segmentation&quot; caught my eye, because it&#x27;s not even well defined in my native language. (one example among the difficulties: there&#x27;s a lot of constructs like &quot;rid-fucking-iculous&quot;. Should it be &quot;rid&#x2F;fucking&#x2F;iculous&quot; or &quot;fucking ridiculous&quot;?)<p>After checking the source to see how they perform this literally impossible feat, it turns out they were implementing a Unicode standard that basically tries to do something useful (for some families of languages) but has a whole <i>wall</i> of caveats: <a href="https:&#x2F;&#x2F;unicode.org&#x2F;reports&#x2F;tr29&#x2F;#Word_Boundary_Rules" rel="nofollow">https:&#x2F;&#x2F;unicode.org&#x2F;reports&#x2F;tr29&#x2F;#Word_Boundary_Rules</a><p>I can see users of the word segmentation function shooting their feet really badly with at least CJK languages, especially with a deceptively simple API like that. In general &quot;word segmentation&quot; doesn&#x27;t make sense on a language-neutral level. libgrapheme even admits to this somewhat:<p>&quot;&quot;&quot; For some languages, for instance, it is necessary to have a dictionary on hand to always accurately determine when a word begins and ends. The defaults provided by the standard, though, already do a great job respecting the language&#x27;s boundaries in the general case &quot;&quot;&quot;<p>I disagree about the &quot;great job&quot; part. It&#x27;s probably the best job a Unicode spec can do, but it&#x27;s not going to be great for like at least 20% of the world&#x27;s population... (Also a minor nit is that no dictionary is accurate, so even with a dictionary the results are not really &quot;always accurate&quot;, just mostly accurate, unless you hit edge cases [for CJK at least], upon which advanced NLP techniques are needed.)<p>IMHO they should have a big disclaimer in their GRAPHEME_NEXT_WORD_BREAK(3) manual page warning users about the caveats (like, &quot;if you&#x27;re using this for anything other than $(these families of languages) make sure you really know what you&#x27;re doing before using this function&quot;).
gigel82over 2 years ago
This is interesting, particularly for implementing Intl in JS engines without the mega-heavy ICU. But I wonder how portable it really is.<p>Sometimes I have to dig very deep to find that what folks call &quot;portable C&quot; is actually POSIX-dependent.<p>It doesn&#x27;t appear to be the case after going through the code for a bit, so that&#x27;s promising.
评论 #33614972 未加载
评论 #33620309 未加载
评论 #33618401 未加载
评论 #33616086 未加载
__dover 2 years ago
Does anyone know offhand whether this does comparisons? And normalization?
评论 #33617304 未加载
HexDecOctBinover 2 years ago
Is there any similar C library that deals with Normalisation and Collation?
评论 #33618394 未加载
1letterunixnameover 2 years ago
Oh my. Don&#x27;t roll your own crypto or Unicode library. There are too many edge cases that &quot;simplicity&quot; simply doesn&#x27;t scale well with. Use what works and what&#x27;s already been done rather than reinventing the wheel without a clear and necessary purpose.
评论 #33620031 未加载