I've been trying to get into C this past week, and this is a great coincidence to see this today! I was just thinking how convenient it is to type emojis right into strings in Python and print them. I assumed C didn't have much unicode compatibility, though I didn't research it.<p>I gave libgrapheme a try, and it compiled just as the instructions said it would. The hello-world program also mostly worked, but in my terminal it malformed several things. For example the American flag emoji rendered in my terminal as [U][S], and the family emoji rendered as three distinct emoji faces (side-by-side) rather than one grouped one.<p>I went to a website that lets me copy emojis to my clipboard, and I directly copy-pasted the American flag into my terminal, and I still got [U][S], so I think the problem is just with the terminal and not the library.<p>edit: Indeed, this is a problem in Gnome terminal. I found a Bugzilla link[0] that is still open. The official name for the grouped emoji type is "ZWJ sequence"[1], short for Zero-Width Joiner, and it appears not a lot of terminals support them. If anyone knows of a good one for Linux, please let me know!<p>Great stuff, thank you for sharing!<p>References:<p>[0]: <a href="https://gitlab.gnome.org/GNOME/vte/-/issues/2317" rel="nofollow">https://gitlab.gnome.org/GNOME/vte/-/issues/2317</a><p>[1]: <a href="https://emojipedia.org/emoji-zwj-sequence/" rel="nofollow">https://emojipedia.org/emoji-zwj-sequence/</a>
The claims about "word segmentation" caught my eye, because it's not even well defined in my native language. (one example among the difficulties: there's a lot of constructs like "rid-fucking-iculous". Should it be "rid/fucking/iculous" or "fucking ridiculous"?)<p>After checking the source to see how they perform this literally impossible feat, it turns out they were implementing a Unicode standard that basically tries to do something useful (for some families of languages) but has a whole <i>wall</i> of caveats: <a href="https://unicode.org/reports/tr29/#Word_Boundary_Rules" rel="nofollow">https://unicode.org/reports/tr29/#Word_Boundary_Rules</a><p>I can see users of the word segmentation function shooting their feet really badly with at least CJK languages, especially with a deceptively simple API like that. In general "word segmentation" doesn't make sense on a language-neutral level. libgrapheme even admits to this somewhat:<p>""" For some languages, for instance, it is necessary to have a dictionary on hand to always accurately determine when a word begins and ends. The defaults provided by the standard, though, already do a great job respecting the language's boundaries in the general case """<p>I disagree about the "great job" part. It's probably the best job a Unicode spec can do, but it's not going to be great for like at least 20% of the world's population... (Also a minor nit is that no dictionary is accurate, so even with a dictionary the results are not really "always accurate", just mostly accurate, unless you hit edge cases [for CJK at least], upon which advanced NLP techniques are needed.)<p>IMHO they should have a big disclaimer in their GRAPHEME_NEXT_WORD_BREAK(3) manual page warning users about the caveats (like, "if you're using this for anything other than $(these families of languages) make sure you really know what you're doing before using this function").
This is interesting, particularly for implementing Intl in JS engines without the mega-heavy ICU. But I wonder how portable it really is.<p>Sometimes I have to dig very deep to find that what folks call "portable C" is actually POSIX-dependent.<p>It doesn't appear to be the case after going through the code for a bit, so that's promising.
Oh my. Don't roll your own crypto or Unicode library. There are too many edge cases that "simplicity" simply doesn't scale well with. Use what works and what's already been done rather than reinventing the wheel without a clear and necessary purpose.