TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Emoji.length == 2

279 pointsby stanzhengabout 8 years ago

30 comments

danbrucabout 8 years ago
The Unicode standard describes in Annex 29 [1] how to properly split strings into grapheme clusters. And here [2] is a JavaScript implementation. This is a solved problem.<p>[1] <a href="http:&#x2F;&#x2F;www.unicode.org&#x2F;reports&#x2F;tr29&#x2F;" rel="nofollow">http:&#x2F;&#x2F;www.unicode.org&#x2F;reports&#x2F;tr29&#x2F;</a><p>[2] <a href="https:&#x2F;&#x2F;github.com&#x2F;orling&#x2F;grapheme-splitter" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;orling&#x2F;grapheme-splitter</a>
评论 #13832831 未加载
评论 #13834831 未加载
评论 #13832082 未加载
评论 #13841000 未加载
darkengineabout 8 years ago
The thing that frustrates me the most about Unicode emoji is the astounding number of combining characters. For combining characters in written languages, you can do an NFC normalization and, with moderate success, get a 1 codepoint = 1 grapheme mapping, but &quot;Emoji 2.0&quot; introduced some ridiculous emoji compositions with the ZWJ character.<p>To use the author&#x27;s example:<p>‍woman - 1 codepoint<p>black woman - 2 codepoints, woman + dark Fitzpatrick modifier<p>‍️‍‍woman kissing woman - 7 codepoints, woman + ZWJ + heart + ZWJ + kissy lips + ZWJ + woman<p>It&#x27;s like composing Mayan pictographs, except you have to include an invisible character in between each component.<p>Here&#x27;s another fun one: country flags. Unicode has special characters 🇱 🇮 🇰 🇪 🇹 🇭 🇮 🇸 that you can combine into country codes to create a flag. 🇰+🇷 = 🇰🇷<p>edit: looks like HN strips emoji? Changed the emoji in the example into English words. They are all supposed to render as a single &quot;character&quot;.
评论 #13832913 未加载
评论 #13833507 未加载
评论 #13831917 未加载
评论 #13838526 未加载
评论 #13836351 未加载
评论 #13832616 未加载
评论 #13832852 未加载
评论 #13832050 未加载
Animatsabout 8 years ago
Before emoji, fonts and colors were independent. Combining the two creates a mess. Try using emoji in an editor with syntax coloring. We got into this because some people thought that single-color emoji were racist.[1] So now there are five skin tone options. The no-option case is usually rendered as bright yellow, which comes from the old AOL client. They got it from the happy-face icon of the 1970s.<p>Here&#x27;s the current list of valid emoji, including upcoming ones being added in the next revision.[2]<p>A reasonable test for passwords is to run them through an IDNA checker, which checks whether a string is acceptable as a domain name component. This catches most weird stuff, such as mixed left-to-right and right-to-left symbols, zero-width markers, homoglyphs, and emoji.<p>[1] <a href="https:&#x2F;&#x2F;www.washingtonpost.com&#x2F;news&#x2F;the-intersect&#x2F;wp&#x2F;2015&#x2F;02&#x2F;24&#x2F;are-apples-new-yellow-face-emoji-racist&#x2F;?utm_term=.ec65e2f8ef2d" rel="nofollow">https:&#x2F;&#x2F;www.washingtonpost.com&#x2F;news&#x2F;the-intersect&#x2F;wp&#x2F;2015&#x2F;02...</a> [2] <a href="http:&#x2F;&#x2F;unicode.org&#x2F;emoji&#x2F;charts-beta&#x2F;full-emoji-list.html" rel="nofollow">http:&#x2F;&#x2F;unicode.org&#x2F;emoji&#x2F;charts-beta&#x2F;full-emoji-list.html</a>
评论 #13832722 未加载
评论 #13832845 未加载
评论 #13832642 未加载
评论 #13836111 未加载
评论 #13832571 未加载
kmillabout 8 years ago
There are multiple ways of counting &quot;length&quot; of a string. Number of UTF-8 bytes, number of UTF-16 code units, number of codepoints, number of grapheme clusters. These are all distinct yet valid concepts of &quot;length.&quot;<p>For the purpose of allocating buffers, I can see the obvious use in knowing number of bytes, UTF-16 code units, or the number of codepoints. I also see the use in being able to iterate through grapheme clusters, for instance for rendering a fragment of text, or for parsing. Perhaps someone can shed light on a compelling use case for knowing the number of grapheme clusters in a particular string, because I haven&#x27;t been able to think of one.<p>I&#x27;m not sure about calculating password lengths: if the point is entropy, the number of bytes seems good enough to me!<p>The password field bug is possibly compelling, but I don&#x27;t think it&#x27;s obvious what a password field <i>should</i> do. Should it represent keystrokes? Codepoints? Grapheme clusters? Ligatures? Replace all the glyphs with bullets during font rendering?<p>(Similarly, perhaps someone could explain why they think reversing a string should be a sensible operation. That this is hard to do is something I occasionally hear echoing around the internet. The best I&#x27;ve heard is that you can reuse the default forward lexicographic ordering on reversed strings for a use I&#x27;ve forgotten.)
评论 #13834065 未加载
评论 #13832951 未加载
评论 #13832922 未加载
评论 #13835508 未加载
TorKlingbergabout 8 years ago
If you want to do Unicode correctly, you shouldn&#x27;t ask for the &quot;length&quot; of a string. The is no true definition of length. If want to know how many bytes it uses in storage, ask for that. If you want to know how wide it will be on the screen, ask for that. Do not iterate over strings character by character.
评论 #13835122 未加载
评论 #13834933 未加载
chungyabout 8 years ago
&gt; The current largest codepoint? Why that would be a cheese wedge at U+1F9C0. How did we ever communicate before this?<p>Sounds cute, but inaccurate.<p>If we count the last two planes that are reserved for private use (aka, applications&#x2F;users can use them for whatever domain problems they like), that would be U+10FFFD.<p>If we count the variation selector codepoints (used for things like changing skin tone, or the look of certain other characters), U+E01EF.<p>If we count the last honestly-for-real-written-language character assigned, it would be 𪘀 U+2FA1D CJK COMPATIBILITY IDEOGRAPH-2FA.<p>But I suppose none of that sounds as fun as an emoji (which are really a very small part of the Unicode standard).
评论 #13835418 未加载
评论 #13835810 未加载
评论 #13835419 未加载
zach417about 8 years ago
Tom Scott did a nice YouTube video related to this: <a href="https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=sTzp76JXsoY" rel="nofollow">https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=sTzp76JXsoY</a>
teknologistabout 8 years ago
This appears to be a rehash of what Mathias Bynens was talking about a few years ago.<p><a href="http:&#x2F;&#x2F;vimeo.com&#x2F;76597193" rel="nofollow">http:&#x2F;&#x2F;vimeo.com&#x2F;76597193</a><p><a href="https:&#x2F;&#x2F;mathiasbynens.be&#x2F;notes&#x2F;javascript-unicode" rel="nofollow">https:&#x2F;&#x2F;mathiasbynens.be&#x2F;notes&#x2F;javascript-unicode</a>
ge0rgabout 8 years ago
I&#x27;ve gone through exactly the same discovery process when implementing faux stamps (something between images and Emoji) in my xmpp app yesterday.<p>My idea was to increase the font size of a message that only consists of Emoji, depending on the number of Emoji in the message, like this:<p><a href="https:&#x2F;&#x2F;xmpp.pix-art.de&#x2F;imagehost&#x2F;display&#x2F;file&#x2F;2017-03-09_09-36-09_r8m468so4vh7.jpg" rel="nofollow">https:&#x2F;&#x2F;xmpp.pix-art.de&#x2F;imagehost&#x2F;display&#x2F;file&#x2F;2017-03-09_09...</a><p>The code turned out more complex than first expected, mirroring the same problems OP encountered:<p><a href="https:&#x2F;&#x2F;github.com&#x2F;ge0rg&#x2F;yaxim&#x2F;blob&#x2F;gradle&#x2F;src&#x2F;org&#x2F;yaxim&#x2F;androidclient&#x2F;util&#x2F;XMPPHelper.java#L66-L93" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;ge0rg&#x2F;yaxim&#x2F;blob&#x2F;gradle&#x2F;src&#x2F;org&#x2F;yaxim&#x2F;and...</a>
评论 #13836653 未加载
mhilsabout 8 years ago
The Zero-Width-Joiner allows for some really strange things: <a href="https:&#x2F;&#x2F;blog.emojipedia.org&#x2F;ninja-cat-the-windows-only-emoji&#x2F;" rel="nofollow">https:&#x2F;&#x2F;blog.emojipedia.org&#x2F;ninja-cat-the-windows-only-emoji...</a>.<p>One can basically achieve an unlimited number of emojis by concatenating the current ones.
joeblauabout 8 years ago
I ran into this 2 years ago on Swift when I was creating an emojified version of Twitter. I wanted to ensure that each message sent had at least 1 emoji and I quickly realized that validating a string with 1 emoji was not as simple as:<p><pre><code> if (lastString.characters.count == 2) { &#x2F;&#x2F; pseudo code to allow string and activate send button } </code></pre> This was the app I was working on [1]; code is finished, but I&#x27;m not launching it (probably ever). The whole emoji length piece was quite frustrating because my assumption of character counting went right out of the window when I had people testing the app in Test Flight.<p>[1] - <a href="https:&#x2F;&#x2F;joeblau.com&#x2F;emo&#x2F;" rel="nofollow">https:&#x2F;&#x2F;joeblau.com&#x2F;emo&#x2F;</a>
评论 #13837318 未加载
hwcabout 8 years ago
How can that entire article never mention the term UTF-16?
评论 #13836028 未加载
tantalorabout 8 years ago
&gt; I have no idea if there’s a good reason for the name “astral plane.” Sometimes, I think people come up with these names just to add excitement to their lives.<p><a href="https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Plane_(esotericism)#The_Planes" rel="nofollow">https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Plane_(esotericism)#The_Planes</a>
openasocketabout 8 years ago
The issue doesn&#x27;t really seem to be the emojis, but rather the variation sequences, which seem to be really awkward to work with, but I can sort of see why they&#x27;re necessary. But the fact that we need special libraries to answer fairly basic queries about unicode text doesn&#x27;t bode well.
评论 #13832174 未加载
codezeroabout 8 years ago
I see your 2 and raise you 2:<p>&quot;(this is a color-hued hand from Apple that doesn&#x27;t render on HN)&quot;.length == 4<p>I ran into the length==2 bug when truncating some text, it led to errors trying to url encode a string :)<p>The author&#x27;s `fancyCount2` still returns a size of 2 for these kinds of emoji, but I&#x27;m not too surprised.
sorenjanabout 8 years ago
I think the article &quot;A Programmer&#x27;s Introduction to Unicode&quot; that was shared here recently is a good read and explains Unicode well.<p><a href="https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=13790575" rel="nofollow">https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=13790575</a>
pc2g4dabout 8 years ago
Just ran into this yesterday when I discovered that an emoji character wouldn&#x27;t fit into Rust&#x27;s `char` type. I just changed the type to `&amp;&#x27;static str` but I still wish there was a single `grapheme` type or something like that.
gtrubetskoyabout 8 years ago
In Go:<p><pre><code> func main() { shit := &quot;\U0001f4a9&quot; fmt.Printf(&quot;len of %s is %d\n&quot;, shit, utf8.RuneCountInString(shit)) } </code></pre> $ len of � is 1<p>Though I can&#x27;t say that this is all that intuitive either...
评论 #13837344 未加载
remxabout 8 years ago
Just going to leave this link here: <a href="https:&#x2F;&#x2F;mathiasbynens.be&#x2F;notes&#x2F;javascript-unicode" rel="nofollow">https:&#x2F;&#x2F;mathiasbynens.be&#x2F;notes&#x2F;javascript-unicode</a>
Traubenfuchsabout 8 years ago
If this interests you, read the source of Java&#x27;s abstractStringBuilder.reverse(). It&#x27;s interesting and very short. I am not sure it can deal with multi-emoji-emoji though.
xemabout 8 years ago
Here are my 2 cents: you can decompose an Unicode string with the ES6 spread operator:<p>[...&quot;(insert 5 poo emoji here)&quot;].length === 5<p>[...&quot;(insert 5 poo emoji here)&quot;][1] === &quot;(poo emoji)&quot;
lsv1about 8 years ago
As a developer dealing with the encoding of user input made in UTF-8 into a legacy systems which only support ASCII... I prefer this.
rsmetsabout 8 years ago
(U+200B), zero width space, should be outlawed... got me good a couple years ago! Had todo a hexdump to see what was going on.
beaugundersonabout 8 years ago
lodash&#x27;s toArray and split both support emoji, with good unit tests. I also wrote emoji-aware for this purpose:<p><a href="https:&#x2F;&#x2F;www.npmjs.com&#x2F;package&#x2F;emoji-aware" rel="nofollow">https:&#x2F;&#x2F;www.npmjs.com&#x2F;package&#x2F;emoji-aware</a>
nutbutterabout 8 years ago
The golf course flag equals one obviously because at a hole-in-one. :)
jtymannabout 8 years ago
Makes me wonder whether or not that should be considered a bug.
评论 #13832960 未加载
TheRealPomaxabout 8 years ago
but the real question is why he needed password length constraints instead of password strength constraints...
marichardsabout 8 years ago
create table twitter(tweet varchar(? ... that&#x27;s it, I give up, time to become an Uber driver
wcummingsabout 8 years ago
&gt; Sometimes, I think people come up with these names just to add excitement to their lives.<p>Let&#x27;s get outta here guys, we&#x27;ve been rumbled!
phkahlerabout 8 years ago
Unicode is fucked. All these bullshit emojis remind me of the 1980s when ASCII was 7 bits but every computer manufacturer (Atari, Commodore, Apple, IBM, TI, etc...) made their own set of characters for the 128 values of a byte beyond ASCII. Of course Unicode is a global standard so your pile-of-poop emoji will still be a pile-of-poop on every device even if the amount of steam is different for some people.<p>It&#x27;s beyond me why this is happening. Who decides which bullshit symbols get into the standard anyway?
评论 #13832159 未加载
评论 #13831744 未加载
评论 #13831754 未加载
评论 #13835232 未加载
评论 #13831947 未加载
评论 #13831834 未加载