Why we can't process Emoji anymore

258 pointsby tpintoover 12 years ago

30 comments

oofabzover 12 years ago

This is why UTF-8 is great. If it works for any Unicode character it will work for them all. Surrogate pairs are rare enough that they are poorly tested. With UTF-8, if there are issues with multi-byte characters, they are obvious enough to get fixed.UTF-16 is not a very good encoding. It only exists for legacy reasons. It has the same major drawback as UTF-8 (variable-length encoding) but none of the benefits (ASCII compatibility, size efficient).

评论 #4835898 未加载

评论 #4834371 未加载

评论 #4834920 未加载

ender7over 12 years ago

Apropos: <a href="http://mathiasbynens.be/notes/javascript-encoding" rel="nofollow">http://mathiasbynens.be/notes/javascript-encoding</a>TL;DR:- Javascript engines are free to internally represent strings as either UCS-2 or UTF-16. Engines that choose to go USC-2 tend to replace all glyphs outside of the BMP with the replacement char (U+FFFD). Firefox, IE, Opera, and Safari all do this (with some inconsistencies).- However, from the point of view of the actual JS code that gets executed, strings are always UCS-2 (sort of). In UTF-16, code points outside the BMP are encoded as surrogate pairs (4 bytes). But -- if you have a Javascript string that contains such a character, it will be treated as two consecutive 2-byte characters.<pre><code> var x = '𝌆'; x.length; // 2 x[0]; // \uD834 x[1]; // \uDF06 </code></pre> Note that if you insert said string into the DOM, it will still render correctly (you'll see a single character instead of two ?s).

评论 #4835914 未加载

评论 #4834781 未加载

评论 #4844811 未加载

praptakover 12 years ago

Sometimes you need to know about encodings, even if you're just a consumer. Putting just one non 7-bit character in your SMS message will silently change its encoding from 7-bit (160 chars) to 8-bit (140 chars) or even 16 bit (70 chars) which might make the phone split it into many chunks. The resulting chunks are billed as separate messages.

评论 #4834429 未加载

评论 #4834508 未加载

pjscottover 12 years ago

The quick summary, for people who don't like ignoring all those = signs, is that V8 uses UCS-2 internally to represent strings, and therefore can't handle Unicode characters which lie outside the Basic Multilingual Plane -- including Emoji.

评论 #4834461 未加载

driverdanover 12 years ago

If you search for V8 UCS-2 you'll find a lot of discussion on this issue dating back at least a few years. There are ways to work around V8's lack of support for surrogate pairs. See this V8 issue for ideas: <a href="https://code.google.com/p/v8/issues/detail?id=761" rel="nofollow">https://code.google.com/p/v8/issues/detail?id=761</a>My question is why does V8 (or anything else) still use UCS-2?

评论 #4834937 未加载

评论 #4836513 未加载

评论 #4834935 未加载

gkobergerover 12 years ago

Took me a bit to realize that this is talking about the Voxer iOS app (<a href="http://voxer.com/" rel="nofollow">http://voxer.com/</a>), not Github (<a href="https://github.com/blog/816-emoji" rel="nofollow">https://github.com/blog/816-emoji</a>).

评论 #4837760 未加载

hkmurakamiover 12 years ago

>Wow, you read though all of that? You rock. I'm humbled that you gave me so much of your attention.That was actually really fun to read, even as a now non-technical guy. I can't put a finger on it, but there was something about his style that gave off a really friendly vibe even through all the technical jargon. That's a definite skill!

评论 #4834500 未加载

beaumartinezover 12 years ago

This is dated January 2012. By the looks of things, this was fixed in March 2012[1][1] <a href="https://code.google.com/p/v8/issues/detail?id=761#c33" rel="nofollow">https://code.google.com/p/v8/issues/detail?id=761#c33</a>

评论 #4834731 未加载

ricardobeatover 12 years ago

Please, if you're going to post text to a Gist at least use the .md extension:<a href="https://gist.github.com/4151124" rel="nofollow">https://gist.github.com/4151124</a>

评论 #4835568 未加载

pbiggarover 12 years ago

A couple of reasons why it makes sense for V8 and other vendors to use UCS2:- The spec says UCS2 or UTF16. Those are the only options.- UCS2 allows random access to characters, UTF-16 does not.- Remember how the JS engines were fighting for speed on arbitrary benchmarks, and nobody cared about anything else for 5 years? UCS2 helps string benchmarks be fast!- Changing from UCS2 to UTF-16 might "break the web", something browser vendors hate (and so do web developers)- Java was UCS2. Then Java 5 changed to UTF-16. Why didn't JS change to UTF-16? Because a Java VM only has to run one program at once! In JS, you can't specify a version, an encoding, and one engine has to run everything on the web. No migration path to other encodings!

评论 #4835666 未加载

languagehackerover 12 years ago

We seem to be seeing this more and more with Node-based applications. It's a symptom of the platform being too immature. This is why you shouldn't adopt these sorts of stacks unless there's some feature they provide that none of the more mature stacks support yet. And even then, you should probably ask yourself if you really need that feature.

评论 #4836051 未加载

freedrullover 12 years ago

Why on earth would the people who wrote V8 use UCS-2? What about alternative JS runtimes?

评论 #4834335 未加载

epsover 12 years ago

They control their clients, so they could've just re-encoded emojies with custom 16bit escaping scheme, make the backend transparently relay it over in escaped form and decode it back to 17bits at the other end.Or am I missing something obviuos here?

kstenerudover 12 years ago

Small nitpick, but Objective-C does not require a particular string encoding internally. In Mac OS and iOS, NSString uses one of the cfinfo flags to specify whether the internal representation is UTF-16 or ASCII (as a space-saving mechanism).

dgreenspover 12 years ago

The specific problems the author describes don't seem to be present today; perhaps they were fixed. That's not to say this conversions aren't a source of issues, just that I don't see any show-stopper problems currently in Node, V8, or JavaScript.In JavaScript, a string is a series of UTF-16 code units, so the smiley face is written '\ud83d\ude04'. This string has length 2, not 1, and behaves like a length-2 string as far as regexes, etc., which is too bad. But even though you don't get the character-counting APIs you might want, the JavaScript engine knows this is a surrogate pair and represents a single code point (character). (It just doesn't do much with this knowledge.)You can assign '\ud83d\ude04' to document.body.innerHTML in modern Chrome, Firefox, or Safari. In Safari you get a nice Emoji; in stock Chrome and Firefox, you don't, but the empty space is selectable and even copy-and-pastable as a smiley! So the character is actually there, it just doesn't render as a smiley.The bug that may have been present in V8 or Node is: what happens if you take this length-2 string and write it to a UTF8 buffer, does it get translated correctly? Today, it does.What if you put the smiley directly into a string literal in JS source code, not \u-escaped? Does that work? Yes, in Chrome, Firefox, and Safari.

评论 #4839081 未加载

dale-cooperover 12 years ago

The UCS-2 heritage is kind of annoying. In java for example, chars (the primitive type, which the Character class just wraps) are 16 bits. So one instance of a Character may not be a full "character" but rather a part of a surrogate pair. This creates a small gotcha where the length of a string might not be the same as the amount of characters it has. And that you just cant split/splice a Character array naively (because you might split it at a surrogate pair).

评论 #4836519 未加载

eloisantover 12 years ago

Maybe nickpicking but I don't think Softbank came up with the Emoji. Emoji existed way before Softbank bought the Japanese Vodaphone, and even before Vodaphone bought J-Phone.So emoji were probably invented by J-Phone, while Softbank was mostly taking care of Yahoo Japan.

adrianpikeover 12 years ago

Here's the thread in the v8 bug tracker about this issue: <a href="http://code.google.com/p/v8/issues/detail?id=761" rel="nofollow">http://code.google.com/p/v8/issues/detail?id=761</a>Is there a reason that the workaround in comment 8 won't address some of these issues?

评论 #4834521 未加载

clebioover 12 years ago

Somewhat meta, but this would be one where showing subdomain on HN submissions would be nice. The title is vague enough that I assumed it was something to do with _Github_ not processing Emoji (which would be sort of a strange state of affairs...).

评论 #4835423 未加载

pla3rhat3rover 12 years ago

I love this article. So often it has been difficult to explain to people why one set of characters can work while others will not. This lays out some great historical info that will be helpful going forward.

cjensenover 12 years ago

UCS-16 is only used by programs which jumped the gun and implemented Unicode before it was all done. (It was 16 bits for awhile with Asian languages sharing code points so that the font in use determined whether the text was displayed as Chinese vs Japanese vs. etc). What Century was V8 written in that they thought UCS-16 was an acceptable thing to implement?Good rule of thumb for implementers: get over it and use 32 bits internally. Always use UTF-8 when encoding into a byte stream. Add UTF-16 encoding if you must interface with archaic libraries.

评论 #4836466 未加载

evincarofautumnover 12 years ago

Failures in Unicode support seem usually to result from the standard’s persistently shortsighted design—well intentioned and carefully considered though it undoubtedly is. It’s a “good enough” solution to a very difficult problem, but I wonder if we won’t see Unicode supplanted in the next decade.All that aside: emoji should not be in Unicode. Fullstop.

FredericJover 12 years ago

How about this npm module : <a href="https://npmjs.org/package/emoji" rel="nofollow">https://npmjs.org/package/emoji</a> ?

xnover 12 years ago

Here's the message decoded from quoted-printable: <a href="https://gist.github.com/4151707#file_emoji_sad_decoded.txt" rel="nofollow">https://gist.github.com/4151707#file_emoji_sad_decoded.txt</a>

mranneyover 12 years ago

Note that this message is almost a year old now. The issue has been addressed by the node and V8 teams.

shocksover 12 years ago

Very informative, great read. Thanks!

alexbosworthover 12 years ago

Fixed a good while ago for node.js

masklinnover 12 years ago

Wow, the first half of the text is basically full of crap and claims which don't even remotely match reality, and now I'm reaching the technical section which can only get even more wrong.

评论 #4836917 未加载

sneakover 12 years ago

TLDR: node sucks

评论 #4834177 未加载

评论 #4834224 未加载

评论 #4834505 未加载

csenseover 12 years ago

A two-character sequence for a smiley face that should be compatible with everything in existence:<pre><code> :) </code></pre> Problem solved. Why is this front page material (#6 as of this writing)?