TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Emoji Under the Hood

430 pointsby kogirabout 4 years ago

28 comments

lifthrasiirabout 4 years ago
&gt; One weird inconsistency I’ve noticed is that hair color is done via ZWJ, while skin tone is just modifier emoji with no joiner. Why? Seriously, I am asking you: why? I have no clue.<p>Mainly because skin tone modifiers [1] predate the ZWJ mechanism [2]. For hair colors there were two contending proposals [3] [4], one of which doesn&#x27;t use ZWJ, and the ZWJ proposal was accepted because new modifiers (as opposed to ZWJ sequences) needed the architectural change [5].<p>[1] <a href="https:&#x2F;&#x2F;www.unicode.org&#x2F;L2&#x2F;L2014&#x2F;14213-skin-tone-mod.pdf" rel="nofollow">https:&#x2F;&#x2F;www.unicode.org&#x2F;L2&#x2F;L2014&#x2F;14213-skin-tone-mod.pdf</a><p>[2] <a href="https:&#x2F;&#x2F;www.unicode.org&#x2F;L2&#x2F;L2015&#x2F;15029r-zwj-emoji.pdf" rel="nofollow">https:&#x2F;&#x2F;www.unicode.org&#x2F;L2&#x2F;L2015&#x2F;15029r-zwj-emoji.pdf</a><p>[3] <a href="https:&#x2F;&#x2F;www.unicode.org&#x2F;L2&#x2F;L2017&#x2F;17082-natural-hair-color.pdf" rel="nofollow">https:&#x2F;&#x2F;www.unicode.org&#x2F;L2&#x2F;L2017&#x2F;17082-natural-hair-color.pd...</a><p>[4] <a href="https:&#x2F;&#x2F;www.unicode.org&#x2F;L2&#x2F;L2017&#x2F;17193-hair-colour-proposal.pdf" rel="nofollow">https:&#x2F;&#x2F;www.unicode.org&#x2F;L2&#x2F;L2017&#x2F;17193-hair-colour-proposal....</a><p>[5] <a href="https:&#x2F;&#x2F;www.unicode.org&#x2F;L2&#x2F;L2017&#x2F;17283-response-hair.pdf" rel="nofollow">https:&#x2F;&#x2F;www.unicode.org&#x2F;L2&#x2F;L2017&#x2F;17283-response-hair.pdf</a>
评论 #26590738 未加载
devadvanceabout 4 years ago
Fantastic post that builds up knowledge along the way. A fun case where this type of knowledge was relevant: when creating emoji short links with a couple characters (symbols), I made sure to snag both URLs: one with the emoji (codepoint + `U+FE0F`) and one with just the symbol codepoint.<p>Another thing worth calling out: you can get involved in emoji creation and Unicode in general. You can do this directly, or by working with groups like Emojination [0].<p>[0] <a href="http:&#x2F;&#x2F;www.emojination.org&#x2F;" rel="nofollow">http:&#x2F;&#x2F;www.emojination.org&#x2F;</a>
评论 #26589885 未加载
rkangelabout 4 years ago
The article is great, but there is one slightly misleading bit at the start:<p>&gt; The most popular encoding we use is called Unicode, with the two most popular variations called UTF-8 and UTF-16.<p>Unicode is a list of codepoints - the characters talked about in the rest of the article. These live in a number space that&#x27;s very big (~2^23 as discussed).<p>You can talk about these codepoints in the abstract as this article does, but at some point you need to put them in a computer - store them on disk or transmit them over a network connection. To do this you need a way to make a stream of bytes store a series of unicode codepoints. This is an &#x27;encoding&#x27;, UTF-8 and UTF-16, UTF-32 etc. are different encodings.<p>UTF-32 is the simplest and most &#x27;obvious&#x27; encoding to use. 32 bits is more than enough to represent every codepoint, so just use a 32-bit value to represent each codepoint, and keep them in a big array. This has a lot of value in simplicity, but it means that text ends up taking up a lot of space. Most western text (e.g. this page) fits in the first 127 bits and so for the majority of values, most of the bits will be 0.<p>UTF-16 is an abomination that is largely Microsoft&#x27;s fault and is the default unicode encoding on Windows. It is based on the fact that most text in most language fits in the first 65535 unicode codepoints - referred to as the &#x27;Basic Multilingual Plane&#x27;. This means that you can use a 16 bit value to represent most codepoints, so unicode is stored as an array of 16-bit values (&quot;wide strings&quot; in MS APIs). Obviously not <i>all</i> Unicode values fit in, so there is the capability to use two UTF-16 values to represent a code-point. There are many problems with UTF-16, but my favourite is that it really helps you to have &#x27;unicode surprises&#x27; in your code. Something in your stack that assumes single byte characters and barfs on higher unicode values is well known, and you find it in testing fairly often. Because UTF-16 is a single value for the vast majority of normal codepoints, it makes that worse by making it only happen in a very small number of cases that you will inevitably only discover in production.<p>UTF-8 is the generally agreed to be the best encoding (particularly among people who don&#x27;t work for Microsoft). It is a full variable length encoding, so a single codepoint can take 1, 2, 3 or 4 bytes. It has lots of nice properties, but one is that codepoints that are &lt;= 127 encode using a single byte. This means that proper ASCII is valid UTF-8.
评论 #26594727 未加载
评论 #26601142 未加载
vanderZwanabout 4 years ago
&gt; <i>Flags don’t have dedicated codepoints. Instead, they are two-letter ligatures. (...) There are 258 valid two-letter combinations. Can you find them all?</i><p>Well this nerd-sniped me pretty hard<p><a href="https:&#x2F;&#x2F;next.observablehq.com&#x2F;@jobleonard&#x2F;which-unicode-flags-are-reversible" rel="nofollow">https:&#x2F;&#x2F;next.observablehq.com&#x2F;@jobleonard&#x2F;which-unicode-flag...</a><p>That was a fun little exercise, but enough time wasted, back to work.
评论 #26590488 未加载
aglionbyabout 4 years ago
Great post, entertainingly written.<p>Back in 2015, Instagram did a blog post on similar challenges they came across implementing emoji hashtags [1]. Spoiler alert: they programmatically constructed a huge regex to detect them.<p>[1] <a href="https:&#x2F;&#x2F;instagram-engineering.com&#x2F;emojineering-part-ii-implementing-hashtag-emoji-7b653b221c82" rel="nofollow">https:&#x2F;&#x2F;instagram-engineering.com&#x2F;emojineering-part-ii-imple...</a>
评论 #26590571 未加载
truefossilabout 4 years ago
I wonder why Mediterranean nations switched from ideograms to alphabet as soon as one was invented. Probably they did not have enough surplus grain to feed something like the Unicode consortium?
评论 #26589916 未加载
评论 #26593673 未加载
peteretepabout 4 years ago
An excellent article, although:<p>&gt; “Ü” is a single grapheme cluster, even though it’s composed of two codepoints: U+0055 UPPER-CASE U followed by U+0308 COMBINING DIAERESIS.<p>would be a great opportunity to talk about normal form, because there’s also a single code point version: “latin capital letter u with diaeresis”.
评论 #26595203 未加载
breckabout 4 years ago
I thought I knew Emoji, but there was a lot I didn’t know. Thank you, a very enjoyable and enlightening read. Also, “dingbats”! I rarely seen that word since I was a kid (when I had no idea what that voodoo was but loved it).
BlueGh0stabout 4 years ago
I wish I could read this without getting a migraine. The &quot;darkmode&quot; joke was funny until I realized there was no actual way to turn it on.
评论 #26591789 未加载
评论 #26591213 未加载
评论 #26602904 未加载
Hawzenabout 4 years ago
&gt; The most popular encoding we use is called Unicode<p>Unicode is a character set, not an encoding UTF-8, UTF-16, etc. are encodings of that character set
Robizzle01about 4 years ago
Love it, thanks for writing it up.<p>Regarding Windows and flags, I heard it was a geopolitical issue. Basically, to support flag emoji you’d have to decide whether or not to recognize some states (e.g. Taiwan) which can anger other states. Not sure if that’s the real reason or not.<p>A couple questions I still have: 1. Why make flags multiple code points when there’s plenty of unused address space to assign a single code point? 2. Any entertaining backstories regarding platform specific non-standard emoji, such as Windows ninja cat (<a href="https:&#x2F;&#x2F;emojipedia.org&#x2F;ninja-cat&#x2F;" rel="nofollow">https:&#x2F;&#x2F;emojipedia.org&#x2F;ninja-cat&#x2F;</a>)? Why would they use those code points rather than ? 3. Is it possible to modify Windows to render emoji using Apple’s font (or a modified Segue that looks like Apple’s)? 4. Which emoji look the most different depending on platform? Are there any that cause miscommunication? 5. Do any glyphs render differently based on background color, e.g. dark mode?
wokoabout 4 years ago
&gt; Unicode allocates 2²¹ (~2 mil) characters called codepoints. Sorry, programmers, but it’s not a multiply of 8 .<p>Why would 2^21 not be a multiple of 2^3?
评论 #26589876 未加载
mojubaabout 4 years ago
Can someone explain, what are the rules for substring(m, n) given all the madness that&#x27;s today&#x27;s Unicode? Is it standardized or it&#x27;s up to the implementations?
评论 #26589565 未加载
评论 #26590960 未加载
评论 #26589493 未加载
评论 #26589534 未加载
评论 #26590042 未加载
评论 #26589530 未加载
mannerheimabout 4 years ago
&gt; Currently they are used for these three flags only: England, Scotland and Wales:<p>Not quite true, you can get US state flags with this as well.
评论 #26589685 未加载
评论 #26590194 未加载
评论 #26589526 未加载
bewuethrabout 4 years ago
This is a really nice overview!<p>I have one nit about an omission: in addition to the emoji presentation selector, FE0F, which forces &quot;presentation as emoji&quot;, there&#x27;s also the text presentation selector, FE0E, which does the opposite [1].<p>The Emoji_Presentation property [2] determines when either is required; code points with both an emoji and a text presentation and the property set to &quot;Yes&quot; default to emoji presentation without a selector and require FE0E for text presentation; code points with the property set to &quot;No&quot; default to text presentation and require FE0F for emoji presentation.<p>There&#x27;s a list [3] with all emoji that have two presentations, and the first three rows of the Default Style Values table [4] shows which emoji default to which style.<p>[1]: <a href="https:&#x2F;&#x2F;unicode.org&#x2F;reports&#x2F;tr51&#x2F;#Emoji_Variation_Sequences" rel="nofollow">https:&#x2F;&#x2F;unicode.org&#x2F;reports&#x2F;tr51&#x2F;#Emoji_Variation_Sequences</a><p>[2]: <a href="http:&#x2F;&#x2F;unicode.org&#x2F;reports&#x2F;tr51&#x2F;#Emoji_Properties_and_Data_Files" rel="nofollow">http:&#x2F;&#x2F;unicode.org&#x2F;reports&#x2F;tr51&#x2F;#Emoji_Properties_and_Data_F...</a><p>[3]: <a href="https:&#x2F;&#x2F;unicode.org&#x2F;emoji&#x2F;charts&#x2F;emoji-variants.html" rel="nofollow">https:&#x2F;&#x2F;unicode.org&#x2F;emoji&#x2F;charts&#x2F;emoji-variants.html</a><p>[4]: <a href="https:&#x2F;&#x2F;unicode.org&#x2F;emoji&#x2F;charts-13.0&#x2F;emoji-style.html" rel="nofollow">https:&#x2F;&#x2F;unicode.org&#x2F;emoji&#x2F;charts-13.0&#x2F;emoji-style.html</a>
artur_maklyabout 4 years ago
What I really want to know is the story behind how these emoji&#x27;s came to be?! Who was tasked to come up with this sticker list of symbols? What was the decision&#x2F;strategy behind the selection of these symbols? etc etc. it seems soooo arbitrary at first-glance.<p>And how do we as a community propose new icons while considering others to be removed&#x2F;replaced?
评论 #26592442 未加载
ijidakabout 4 years ago
This is eye opening. So many frustrations I&#x27;ve had with emoji over the years is explained via this post.<p>Big thank you to the OP.
评论 #26589274 未加载
yunteiabout 4 years ago
and now to see how emoji rendering is completely broken, put a gear u+2699 text variant and emoji variant in some html and set the font to menlo in one element, and monaco in another element and then view it in chrome, safari desktop, and safari ios, and also select and right click on it in chrome, and maybe also post it into the comment section of various websites. Every single combination of text variant and emoji variant will be displayed in complete randomness :)
tomduncalfabout 4 years ago
Really interesting and well written (and entertaining!) post. I was vaguely aware of most of it but hadn’t appreciated how the ZWJ system for more complex emojis made up of basin ones means the meaning can be discerned even if your device doesn’t support the new emoji, clever approach!
zimpenfishabout 4 years ago
Apropos of nothing, macOS 11.3 beta and iOS 14.5 do support the combined emojis near the bottom - instead of &lt;heart&gt;&lt;fire&gt;, I actually get &lt;flaming heart&gt; as expected.
MrGilbertabout 4 years ago
Reading about the 2 million codepoints: Is there a good set of open-source licensed fonts which cover as many codepoints as possible? Just curiosity, no real usecase at the moment. I don&#x27;t think it would make sense to create one huge font for this, right?
评论 #26590178 未加载
评论 #26590311 未加载
z3t4about 4 years ago
Related: implementing Emoji support in a text editor: <a href="https:&#x2F;&#x2F;xn--zta-qla.com&#x2F;en&#x2F;blog&#x2F;editor10.htm" rel="nofollow">https:&#x2F;&#x2F;xn--zta-qla.com&#x2F;en&#x2F;blog&#x2F;editor10.htm</a>
aviparsabout 4 years ago
Really interesting article, why haven&#x27;t platforms banned Ų̷̡̡̨̫͍̟̯̣͎͓̘̱̖̱̣͈͍̫͖̮̫̹̟̣͉̦̬̬͈͈͔͙͕̩̬̐̏̌̉́̾͑̒͌͊͗́̾̈̈́̆̅̉͌̋̇͆̚̚̚͠ͅ or figured out a way to parse&#x2F;contain the character to it&#x27;s container?
kaeructabout 4 years ago
I&#x27;m confused about the part saying flags don&#x27;t work on Windows because I can see them on Firefox (on Windows). They don&#x27;t work on Edge though.
评论 #26594741 未加载
itsmeamarioabout 4 years ago
Great quality post. I&#x27;d like to see more things like this on HN. Interesting and I learnt a lot about emojis and UTF.
mshenfieldabout 4 years ago
It&#x27;s a post about emojis, but I feel like I understand Unicode better now?
imtiyazabout 4 years ago
Never gone to these nitty gritties. Very well explained. Thanks Nikita.
评论 #26594754 未加载
remuxabout 4 years ago
Great post!