The Unicode standard describes in Annex 29 [1] how to properly split strings into grapheme clusters. And here [2] is a JavaScript implementation. This is a solved problem.<p>[1] <a href="http://www.unicode.org/reports/tr29/" rel="nofollow">http://www.unicode.org/reports/tr29/</a><p>[2] <a href="https://github.com/orling/grapheme-splitter" rel="nofollow">https://github.com/orling/grapheme-splitter</a>
The thing that frustrates me the most about Unicode emoji is the astounding number of combining characters. For combining characters in written languages, you can do an NFC normalization and, with moderate success, get a 1 codepoint = 1 grapheme mapping, but "Emoji 2.0" introduced some ridiculous emoji compositions with the ZWJ character.<p>To use the author's example:<p>woman - 1 codepoint<p>black woman - 2 codepoints, woman + dark Fitzpatrick modifier<p>️woman kissing woman - 7 codepoints, woman + ZWJ + heart + ZWJ + kissy lips + ZWJ + woman<p>It's like composing Mayan pictographs, except you have to include an invisible character in between each component.<p>Here's another fun one: country flags. Unicode has special characters 🇱 🇮 🇰 🇪 🇹 🇭 🇮 🇸 that you can combine into country codes to create a flag. 🇰+🇷 = 🇰🇷<p>edit: looks like HN strips emoji? Changed the emoji in the example into English words. They are all supposed to render as a single "character".
Before emoji, fonts and colors were independent. Combining the two creates a mess. Try using emoji in an editor with syntax coloring. We got into this because some people thought that single-color emoji were racist.[1] So now there are five skin tone options. The no-option case is usually rendered as bright yellow, which comes from the old AOL client. They got it from the happy-face icon of the 1970s.<p>Here's the current list of valid emoji, including upcoming ones being added in the next revision.[2]<p>A reasonable test for passwords is to run them through an IDNA checker, which checks whether a string is acceptable as a domain name component. This catches most weird stuff, such as mixed left-to-right and right-to-left symbols, zero-width markers, homoglyphs, and emoji.<p>[1] <a href="https://www.washingtonpost.com/news/the-intersect/wp/2015/02/24/are-apples-new-yellow-face-emoji-racist/?utm_term=.ec65e2f8ef2d" rel="nofollow">https://www.washingtonpost.com/news/the-intersect/wp/2015/02...</a>
[2] <a href="http://unicode.org/emoji/charts-beta/full-emoji-list.html" rel="nofollow">http://unicode.org/emoji/charts-beta/full-emoji-list.html</a>
There are multiple ways of counting "length" of a string. Number of UTF-8 bytes, number of UTF-16 code units, number of codepoints, number of grapheme clusters. These are all distinct yet valid concepts of "length."<p>For the purpose of allocating buffers, I can see the obvious use in knowing number of bytes, UTF-16 code units, or the number of codepoints. I also see the use in being able to iterate through grapheme clusters, for instance for rendering a fragment of text, or for parsing. Perhaps someone can shed light on a compelling use case for knowing the number of grapheme clusters in a particular string, because I haven't been able to think of one.<p>I'm not sure about calculating password lengths: if the point is entropy, the number of bytes seems good enough to me!<p>The password field bug is possibly compelling, but I don't think it's obvious what a password field <i>should</i> do. Should it represent keystrokes? Codepoints? Grapheme clusters? Ligatures? Replace all the glyphs with bullets during font rendering?<p>(Similarly, perhaps someone could explain why they think reversing a string should be a sensible operation. That this is hard to do is something I occasionally hear echoing around the internet. The best I've heard is that you can reuse the default forward lexicographic ordering on reversed strings for a use I've forgotten.)
If you want to do Unicode correctly, you shouldn't ask for the "length" of a string. The is no true definition of length. If want to know how many bytes it uses in storage, ask for that. If you want to know how wide it will be on the screen, ask for that. Do not iterate over strings character by character.
> The current largest codepoint? Why that would be a cheese wedge at U+1F9C0. How did we ever communicate before this?<p>Sounds cute, but inaccurate.<p>If we count the last two planes that are reserved for private use (aka, applications/users can use them for whatever domain problems they like), that would be U+10FFFD.<p>If we count the variation selector codepoints (used for things like changing skin tone, or the look of certain other characters), U+E01EF.<p>If we count the last honestly-for-real-written-language character assigned, it would be 𪘀 U+2FA1D CJK COMPATIBILITY IDEOGRAPH-2FA.<p>But I suppose none of that sounds as fun as an emoji (which are really a very small part of the Unicode standard).
Tom Scott did a nice YouTube video related to this: <a href="https://www.youtube.com/watch?v=sTzp76JXsoY" rel="nofollow">https://www.youtube.com/watch?v=sTzp76JXsoY</a>
This appears to be a rehash of what Mathias Bynens was talking about a few years ago.<p><a href="http://vimeo.com/76597193" rel="nofollow">http://vimeo.com/76597193</a><p><a href="https://mathiasbynens.be/notes/javascript-unicode" rel="nofollow">https://mathiasbynens.be/notes/javascript-unicode</a>
I've gone through exactly the same discovery process when implementing faux stamps (something between images and Emoji) in my xmpp app yesterday.<p>My idea was to increase the font size of a message that only consists of Emoji, depending on the number of Emoji in the message, like this:<p><a href="https://xmpp.pix-art.de/imagehost/display/file/2017-03-09_09-36-09_r8m468so4vh7.jpg" rel="nofollow">https://xmpp.pix-art.de/imagehost/display/file/2017-03-09_09...</a><p>The code turned out more complex than first expected, mirroring the same problems OP encountered:<p><a href="https://github.com/ge0rg/yaxim/blob/gradle/src/org/yaxim/androidclient/util/XMPPHelper.java#L66-L93" rel="nofollow">https://github.com/ge0rg/yaxim/blob/gradle/src/org/yaxim/and...</a>
The Zero-Width-Joiner allows for some really strange things: <a href="https://blog.emojipedia.org/ninja-cat-the-windows-only-emoji/" rel="nofollow">https://blog.emojipedia.org/ninja-cat-the-windows-only-emoji...</a>.<p>One can basically achieve an unlimited number of emojis by concatenating the current ones.
I ran into this 2 years ago on Swift when I was creating an emojified version of Twitter. I wanted to ensure that each message sent had at least 1 emoji and I quickly realized that validating a string with 1 emoji was not as simple as:<p><pre><code> if (lastString.characters.count == 2) {
// pseudo code to allow string and activate send button
}
</code></pre>
This was the app I was working on [1]; code is finished, but I'm not launching it (probably ever). The whole emoji length piece was quite frustrating because my assumption of character counting went right out of the window when I had people testing the app in Test Flight.<p>[1] - <a href="https://joeblau.com/emo/" rel="nofollow">https://joeblau.com/emo/</a>
> I have no idea if there’s a good reason for the name “astral plane.” Sometimes, I think people come up with these names just to add excitement to their lives.<p><a href="https://en.wikipedia.org/wiki/Plane_(esotericism)#The_Planes" rel="nofollow">https://en.wikipedia.org/wiki/Plane_(esotericism)#The_Planes</a>
The issue doesn't really seem to be the emojis, but rather the variation sequences, which seem to be really awkward to work with, but I can sort of see why they're necessary. But the fact that we need special libraries to answer fairly basic queries about unicode text doesn't bode well.
I see your 2 and raise you 2:<p>"(this is a color-hued hand from Apple that doesn't render on HN)".length == 4<p>I ran into the length==2 bug when truncating some text, it led to errors trying to url encode a string :)<p>The author's `fancyCount2` still returns a size of 2 for these kinds of emoji, but I'm not too surprised.
I think the article "A Programmer's Introduction to Unicode" that was shared here recently is a good read and explains Unicode well.<p><a href="https://news.ycombinator.com/item?id=13790575" rel="nofollow">https://news.ycombinator.com/item?id=13790575</a>
Just ran into this yesterday when I discovered that an emoji character wouldn't fit into Rust's `char` type. I just changed the type to `&'static str` but I still wish there was a single `grapheme` type or something like that.
In Go:<p><pre><code> func main() {
shit := "\U0001f4a9"
fmt.Printf("len of %s is %d\n", shit, utf8.RuneCountInString(shit))
}
</code></pre>
$ len of � is 1<p>Though I can't say that this is all that intuitive either...
Just going to leave this link here: <a href="https://mathiasbynens.be/notes/javascript-unicode" rel="nofollow">https://mathiasbynens.be/notes/javascript-unicode</a>
If this interests you, read the source of Java's abstractStringBuilder.reverse(). It's interesting and very short. I am not sure it can deal with multi-emoji-emoji though.
Here are my 2 cents: you can decompose an Unicode string with the ES6 spread operator:<p>[..."(insert 5 poo emoji here)"].length === 5<p>[..."(insert 5 poo emoji here)"][1] === "(poo emoji)"
lodash's toArray and split both support emoji, with good unit tests. I also wrote emoji-aware for this purpose:<p><a href="https://www.npmjs.com/package/emoji-aware" rel="nofollow">https://www.npmjs.com/package/emoji-aware</a>
> Sometimes, I think people come up with these names just to add excitement to their lives.<p>Let's get outta here guys, we've been rumbled!
Unicode is fucked. All these bullshit emojis remind me of the 1980s when ASCII was 7 bits but every computer manufacturer (Atari, Commodore, Apple, IBM, TI, etc...) made their own set of characters for the 128 values of a byte beyond ASCII. Of course Unicode is a global standard so your pile-of-poop emoji will still be a pile-of-poop on every device even if the amount of steam is different for some people.<p>It's beyond me why this is happening. Who decides which bullshit symbols get into the standard anyway?