Emoji.length == 2

279 pointsby stanzhengabout 8 years ago

30 comments

danbrucabout 8 years ago

The Unicode standard describes in Annex 29 [1] how to properly split strings into grapheme clusters. And here [2] is a JavaScript implementation. This is a solved problem.[1] <a href="http://www.unicode.org/reports/tr29/" rel="nofollow">http://www.unicode.org/reports/tr29/</a>[2] <a href="https://github.com/orling/grapheme-splitter" rel="nofollow">https://github.com/orling/grapheme-splitter</a>

评论 #13832831 未加载

评论 #13834831 未加载

评论 #13832082 未加载

评论 #13841000 未加载

darkengineabout 8 years ago

The thing that frustrates me the most about Unicode emoji is the astounding number of combining characters. For combining characters in written languages, you can do an NFC normalization and, with moderate success, get a 1 codepoint = 1 grapheme mapping, but "Emoji 2.0" introduced some ridiculous emoji compositions with the ZWJ character.To use the author's example:‍woman - 1 codepointblack woman - 2 codepoints, woman + dark Fitzpatrick modifier‍️‍‍woman kissing woman - 7 codepoints, woman + ZWJ + heart + ZWJ + kissy lips + ZWJ + womanIt's like composing Mayan pictographs, except you have to include an invisible character in between each component.Here's another fun one: country flags. Unicode has special characters 🇱 🇮 🇰 🇪 🇹 🇭 🇮 🇸 that you can combine into country codes to create a flag. 🇰+🇷 = 🇰🇷edit: looks like HN strips emoji? Changed the emoji in the example into English words. They are all supposed to render as a single "character".

评论 #13832913 未加载

评论 #13833507 未加载

评论 #13831917 未加载

评论 #13838526 未加载

评论 #13836351 未加载

评论 #13832616 未加载

评论 #13832852 未加载

评论 #13832050 未加载

Animatsabout 8 years ago

Before emoji, fonts and colors were independent. Combining the two creates a mess. Try using emoji in an editor with syntax coloring. We got into this because some people thought that single-color emoji were racist.[1] So now there are five skin tone options. The no-option case is usually rendered as bright yellow, which comes from the old AOL client. They got it from the happy-face icon of the 1970s.Here's the current list of valid emoji, including upcoming ones being added in the next revision.[2]A reasonable test for passwords is to run them through an IDNA checker, which checks whether a string is acceptable as a domain name component. This catches most weird stuff, such as mixed left-to-right and right-to-left symbols, zero-width markers, homoglyphs, and emoji.[1] <a href="https://www.washingtonpost.com/news/the-intersect/wp/2015/02/24/are-apples-new-yellow-face-emoji-racist/?utm_term=.ec65e2f8ef2d" rel="nofollow">https://www.washingtonpost.com/news/the-intersect/wp/2015/02...</a> [2] <a href="http://unicode.org/emoji/charts-beta/full-emoji-list.html" rel="nofollow">http://unicode.org/emoji/charts-beta/full-emoji-list.html</a>

评论 #13832722 未加载

评论 #13832845 未加载

评论 #13832642 未加载

评论 #13836111 未加载

评论 #13832571 未加载

kmillabout 8 years ago

There are multiple ways of counting "length" of a string. Number of UTF-8 bytes, number of UTF-16 code units, number of codepoints, number of grapheme clusters. These are all distinct yet valid concepts of "length."For the purpose of allocating buffers, I can see the obvious use in knowing number of bytes, UTF-16 code units, or the number of codepoints. I also see the use in being able to iterate through grapheme clusters, for instance for rendering a fragment of text, or for parsing. Perhaps someone can shed light on a compelling use case for knowing the number of grapheme clusters in a particular string, because I haven't been able to think of one.I'm not sure about calculating password lengths: if the point is entropy, the number of bytes seems good enough to me!The password field bug is possibly compelling, but I don't think it's obvious what a password field should do. Should it represent keystrokes? Codepoints? Grapheme clusters? Ligatures? Replace all the glyphs with bullets during font rendering?(Similarly, perhaps someone could explain why they think reversing a string should be a sensible operation. That this is hard to do is something I occasionally hear echoing around the internet. The best I've heard is that you can reuse the default forward lexicographic ordering on reversed strings for a use I've forgotten.)

评论 #13834065 未加载

评论 #13832951 未加载

评论 #13832922 未加载

评论 #13835508 未加载

TorKlingbergabout 8 years ago

If you want to do Unicode correctly, you shouldn't ask for the "length" of a string. The is no true definition of length. If want to know how many bytes it uses in storage, ask for that. If you want to know how wide it will be on the screen, ask for that. Do not iterate over strings character by character.

评论 #13835122 未加载

评论 #13834933 未加载

chungyabout 8 years ago

> The current largest codepoint? Why that would be a cheese wedge at U+1F9C0. How did we ever communicate before this?Sounds cute, but inaccurate.If we count the last two planes that are reserved for private use (aka, applications/users can use them for whatever domain problems they like), that would be U+10FFFD.If we count the variation selector codepoints (used for things like changing skin tone, or the look of certain other characters), U+E01EF.If we count the last honestly-for-real-written-language character assigned, it would be 𪘀 U+2FA1D CJK COMPATIBILITY IDEOGRAPH-2FA.But I suppose none of that sounds as fun as an emoji (which are really a very small part of the Unicode standard).

评论 #13835418 未加载

评论 #13835810 未加载

评论 #13835419 未加载

zach417about 8 years ago

Tom Scott did a nice YouTube video related to this: <a href="https://www.youtube.com/watch?v=sTzp76JXsoY" rel="nofollow">https://www.youtube.com/watch?v=sTzp76JXsoY</a>

teknologistabout 8 years ago

This appears to be a rehash of what Mathias Bynens was talking about a few years ago.<a href="http://vimeo.com/76597193" rel="nofollow">http://vimeo.com/76597193</a><a href="https://mathiasbynens.be/notes/javascript-unicode" rel="nofollow">https://mathiasbynens.be/notes/javascript-unicode</a>

ge0rgabout 8 years ago

I've gone through exactly the same discovery process when implementing faux stamps (something between images and Emoji) in my xmpp app yesterday.My idea was to increase the font size of a message that only consists of Emoji, depending on the number of Emoji in the message, like this:<a href="https://xmpp.pix-art.de/imagehost/display/file/2017-03-09_09-36-09_r8m468so4vh7.jpg" rel="nofollow">https://xmpp.pix-art.de/imagehost/display/file/2017-03-09_09...</a>The code turned out more complex than first expected, mirroring the same problems OP encountered:<a href="https://github.com/ge0rg/yaxim/blob/gradle/src/org/yaxim/androidclient/util/XMPPHelper.java#L66-L93" rel="nofollow">https://github.com/ge0rg/yaxim/blob/gradle/src/org/yaxim/and...</a>

评论 #13836653 未加载

mhilsabout 8 years ago

The Zero-Width-Joiner allows for some really strange things: <a href="https://blog.emojipedia.org/ninja-cat-the-windows-only-emoji/" rel="nofollow">https://blog.emojipedia.org/ninja-cat-the-windows-only-emoji...</a>.One can basically achieve an unlimited number of emojis by concatenating the current ones.

joeblauabout 8 years ago

I ran into this 2 years ago on Swift when I was creating an emojified version of Twitter. I wanted to ensure that each message sent had at least 1 emoji and I quickly realized that validating a string with 1 emoji was not as simple as:<pre><code> if (lastString.characters.count == 2) { // pseudo code to allow string and activate send button } </code></pre> This was the app I was working on [1]; code is finished, but I'm not launching it (probably ever). The whole emoji length piece was quite frustrating because my assumption of character counting went right out of the window when I had people testing the app in Test Flight.[1] - <a href="https://joeblau.com/emo/" rel="nofollow">https://joeblau.com/emo/</a>

评论 #13837318 未加载

hwcabout 8 years ago

How can that entire article never mention the term UTF-16?

评论 #13836028 未加载

tantalorabout 8 years ago

> I have no idea if there’s a good reason for the name “astral plane.” Sometimes, I think people come up with these names just to add excitement to their lives.<a href="https://en.wikipedia.org/wiki/Plane_(esotericism)#The_Planes" rel="nofollow">https://en.wikipedia.org/wiki/Plane_(esotericism)#The_Planes</a>

openasocketabout 8 years ago

The issue doesn't really seem to be the emojis, but rather the variation sequences, which seem to be really awkward to work with, but I can sort of see why they're necessary. But the fact that we need special libraries to answer fairly basic queries about unicode text doesn't bode well.

评论 #13832174 未加载

codezeroabout 8 years ago

I see your 2 and raise you 2:"(this is a color-hued hand from Apple that doesn't render on HN)".length == 4I ran into the length==2 bug when truncating some text, it led to errors trying to url encode a string :)The author's `fancyCount2` still returns a size of 2 for these kinds of emoji, but I'm not too surprised.

sorenjanabout 8 years ago

I think the article "A Programmer's Introduction to Unicode" that was shared here recently is a good read and explains Unicode well.<a href="https://news.ycombinator.com/item?id=13790575" rel="nofollow">https://news.ycombinator.com/item?id=13790575</a>

pc2g4dabout 8 years ago

Just ran into this yesterday when I discovered that an emoji character wouldn't fit into Rust's `char` type. I just changed the type to `&'static str` but I still wish there was a single `grapheme` type or something like that.

gtrubetskoyabout 8 years ago

In Go:<pre><code> func main() { shit := "\U0001f4a9" fmt.Printf("len of %s is %d\n", shit, utf8.RuneCountInString(shit)) } </code></pre> $ len of � is 1Though I can't say that this is all that intuitive either...

评论 #13837344 未加载

remxabout 8 years ago

Just going to leave this link here: <a href="https://mathiasbynens.be/notes/javascript-unicode" rel="nofollow">https://mathiasbynens.be/notes/javascript-unicode</a>

Traubenfuchsabout 8 years ago

If this interests you, read the source of Java's abstractStringBuilder.reverse(). It's interesting and very short. I am not sure it can deal with multi-emoji-emoji though.

xemabout 8 years ago

Here are my 2 cents: you can decompose an Unicode string with the ES6 spread operator:[..."(insert 5 poo emoji here)"].length === 5[..."(insert 5 poo emoji here)"][1] === "(poo emoji)"

lsv1about 8 years ago

As a developer dealing with the encoding of user input made in UTF-8 into a legacy systems which only support ASCII... I prefer this.

rsmetsabout 8 years ago

(U+200B), zero width space, should be outlawed... got me good a couple years ago! Had todo a hexdump to see what was going on.

beaugundersonabout 8 years ago

lodash's toArray and split both support emoji, with good unit tests. I also wrote emoji-aware for this purpose:<a href="https://www.npmjs.com/package/emoji-aware" rel="nofollow">https://www.npmjs.com/package/emoji-aware</a>

nutbutterabout 8 years ago

The golf course flag equals one obviously because at a hole-in-one. :)

jtymannabout 8 years ago

Makes me wonder whether or not that should be considered a bug.

评论 #13832960 未加载

TheRealPomaxabout 8 years ago

but the real question is why he needed password length constraints instead of password strength constraints...

marichardsabout 8 years ago

create table twitter(tweet varchar(? ... that's it, I give up, time to become an Uber driver

wcummingsabout 8 years ago

> Sometimes, I think people come up with these names just to add excitement to their lives.Let's get outta here guys, we've been rumbled!

phkahlerabout 8 years ago

Unicode is fucked. All these bullshit emojis remind me of the 1980s when ASCII was 7 bits but every computer manufacturer (Atari, Commodore, Apple, IBM, TI, etc...) made their own set of characters for the 128 values of a byte beyond ASCII. Of course Unicode is a global standard so your pile-of-poop emoji will still be a pile-of-poop on every device even if the amount of steam is different for some people.It's beyond me why this is happening. Who decides which bullshit symbols get into the standard anyway?

评论 #13832159 未加载

评论 #13831744 未加载

评论 #13831754 未加载

评论 #13835232 未加载

评论 #13831947 未加载

评论 #13831834 未加载