Why can't you reverse a string with a flag emoji?

189 pointsby da12over 3 years ago

58 comments

coreyp_1over 3 years ago

If you think the Unicode flag emoji take a lot of bytes, then consider the family emoji! (<a href="https://unicode.org/emoji/charts/full-emoji-list.html#family" rel="nofollow">https://unicode.org/emoji/charts/full-emoji-list.html#family</a>)I'm in the process of designing a scripting language and implementing it in C++. I plan to put together a YouTube series about it. (Doesn't everyone want to see Bison and Flex mixed with proper unit tests and C++20 code?)Due to my future intended use case, I needed good support for Unicode. I thought that I could write it myself, and I was wrong. I wasted two weeks (in my spare time, mostly evenings) trying to cobble together things that should work, identifying patterns, figuring out how to update it as Unicode itself is updated, thinking about edge cases, i18n, zalgo text, etc. And then I finally reached the point where I knew enough to know that I was making the wrong choice.I'm now using ICU. (<a href="https://icu.unicode.org/" rel="nofollow">https://icu.unicode.org/</a>) It's huge, it was hard to get it working in my environment, and there are very few examples of it's usage online, but after the initial setup dues are paid, it WORKS.Aside: Yes, I know I'm crazy for implementing a programming language that I intend for serious usage. Yes, I have good reasons for doing it, and yes I have considered alternatives. But it's fun, so I'm doing it anyways.Moral of the story: Dealing with Unicode is hard, and if you think it shouldn't be that hard, then you probably don't know enough about the problem!

评论 #30106201 未加载

评论 #30106046 未加载

评论 #30113328 未加载

评论 #30109212 未加载

评论 #30112058 未加载

评论 #30108854 未加载

WA9ACEover 3 years ago

I feel like I'm obligated to share this almost 20 year old Spolsky post that gave me my understanding of characters.<a href="https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/" rel="nofollow">https://www.joelonsoftware.com/2003/10/08/the-absolute-minim...</a>

评论 #30105444 未加载

zarzavatover 3 years ago

This reminds me of an interesting bug I saw where I was seeing a strange flag in some Arabic text. However when I copied the string and pasted it into a text editor, the flag of Saudi Arabia appeared instead (which made much more sense). After some vexillologic research on Wikipedia I identified the original flag as American Samoa and it suddenly all made sense. Turns out some broken RTL support was flipping the SA into AS at presentation.

评论 #30112732 未加载

yoyohello13over 3 years ago

Maybe I'm missing some prerequisite knowledge here, but why would I assume `flag="us"` is an emoji? Looking at that first block of code, there is no reason for me to think "us" is a single character.Edit: Turns out my browser wasn't rendering the flags.

评论 #30105407 未加载

评论 #30105327 未加载

评论 #30107912 未加载

评论 #30105554 未加载

emodendroketover 3 years ago

What I'd like to know is, given the explosion of the character set for emoji, does the rationale for Han unification still make sense? The case for not allowing national variants seems less and less compelling with every emoji they add.This is a bit of a hobby horse, but imagine if every time you read an article in English on your phone some of the letters were replaced with "equivalent" Greek or Cyrillic one and you can get an idea of the annoyance. Yeah, you can still read it with a bit of thought, but who wants to read that way?

评论 #30107645 未加载

评论 #30109743 未加载

评论 #30108478 未加载

评论 #30109106 未加载

treeskneesover 3 years ago

But you can, and did, reverse a string. It seems you would need more details, such as a request to reverse the meaning or interpretation of the string, which is what the author is getting at.If someone challenges you to reverse an image, what do you do? Do you invert the colors? Mirror horizontally? Mirror vertically? Just reverse the byte order?

评论 #30105861 未加载

评论 #30106435 未加载

jerfover 3 years ago

So, in terms of acing interviews, increasingly one of the best answers to the question "Write some code that reverses a string" is that in a world of unicode, "reversing a string" is no longer possible or meaningful.You'll probably be told "oh, assume US ASCII" or something, but in the meantime, if you can back that up when they dig into it, you'll look really smart.

评论 #30107502 未加载

评论 #30108184 未加载

评论 #30105392 未加载

评论 #30105891 未加载

评论 #30105674 未加载

评论 #30105625 未加载

评论 #30105379 未加载

评论 #30108843 未加载

dhosekover 3 years ago

On the challenge front, there are things like á which might be a single code point or two code points (a+´). Then there are the really challenging things like ᾷ where if the components are individual characters, the order of ͺ and ῀ are not guaranteed to be consistent.

评论 #30105851 未加载

评论 #30107962 未加载

otagekkiover 3 years ago

If flag emojis are really a combination of 2 special characters, the reversal of the U.S. flag should result in having the Soviet Union flag.

评论 #30106446 未加载

评论 #30106836 未加载

评论 #30106410 未加载

评论 #30107599 未加载

chrismorganover 3 years ago

UTF-8 does not represent Unicode code points, but rather Unicode scalar values. The difference between the two is surrogates, the way that UTF-16 ruined Unicode: code points are 0₁₆ to 10FFFF₁₆, scalar values are 0₁₆ to D7FF₁₆ and E000₁₆ to 10FFFF₁₆. Yes, the author quoted Wikipedia, but Wikipedia is wrong on this point; surprisingly comprehensively wrong: the UTF-8 page completely ignores the distinction, and even the page on code points doesn’t mention scalar values! This error propagates to other places, too: for example, “and there are a total of 1,112,064 possible code points”: no, that’s how many scalar values there are; code points also include the 2,048 surrogates, so there are 1,114,112 code points.

Mesopropithecusover 3 years ago

Unfortunately the HN text input won't let me do this, but a funny starter for the article would have been this:'(Spanish flag)'[::-1]basically ''.join([chr(127466), chr(127480)]) vs. ''.join([chr(127466), chr(127480)])[::-1]I'll add this to my collection of party tricks and show myself out.Cool article!

Crazyontapover 3 years ago

This section on the linked Wikipedia article(1) is quite amazing on how the family emoji is rendered using a zero-width joiner(1) <a href="https://en.wikipedia.org/wiki/Emoji#Joining" rel="nofollow">https://en.wikipedia.org/wiki/Emoji#Joining</a>edit: forgot HN doesn't render emojis. Better read it directly on Wikipedia i guess.

tlover 3 years ago

This is a nice dive into limitations in Python's unicode handling and at the end, how to work around some problems. But you could use languages with proper unicode support like Swift or Elixir (weirdly HN is fighting flags in comment code which makes examples header to demonstrate).

评论 #30106020 未加载

happytoexplainover 3 years ago

I guessed that it would become the USSR flag (US -> SU), but apparently Unicode doesn't define that one! I wonder why. That would have been humorous.

评论 #30105494 未加载

评论 #30105691 未加载

评论 #30105503 未加载

progbitsover 3 years ago

Semi-related (about length of emoji "characters", not reversing): <a href="https://hsivonen.fi/string-length/" rel="nofollow">https://hsivonen.fi/string-length/</a>Previously discussed:<a href="https://news.ycombinator.com/item?id=20914184" rel="nofollow">https://news.ycombinator.com/item?id=20914184</a><a href="https://news.ycombinator.com/item?id=26591373" rel="nofollow">https://news.ycombinator.com/item?id=26591373</a>As for this article & Python - as usual it is biasing towards convenience and implicit behavior rather than properly handling all edge cases.Compare with Rust where you can't "reverse" a string - that is not a defined operation. But you can either break it into a sequence of characters or graphemes and then reverse that, with expected results: <a href="https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=6335bb2c5cacf308ed5190deaba0121f" rel="nofollow">https://play.rust-lang.org/?version=stable&mode=debug&editio...</a>(Sadly the grapheme segmentation is not part of standard library, at least yet)

评论 #30113421 未加载

jiveturkeyover 3 years ago

Interesting article. Written for beginners, conversationally. Has excessive amounts of whitespace, for "readability" I guess. But at the same time, it dives quite deep, which I don't think this "style" of presentation matches up with the amount of time a more novice reader is going to devote to a single long form article.As to the content, for all the deep dive, a simple link to <a href="https://unicode.org/reports/tr51/#Flags" rel="nofollow">https://unicode.org/reports/tr51/#Flags</a> and what an emoji is, would have saved so much exposition. I also wish he'd touched on normalization. With the amount of time he's demanding from readers he could have mentioned this important subject. Because then he could discuss why (starting from his emoji example) a-grave (à) might or might not be reversible, depending how the character is composed.Also wish he'd pointed to some libraries that can do such reversals.

utopcellover 3 years ago

There are unicode characters that reverse parsing order themselves. This has been the basis of a code injection attack, analyzed in [1].[1] ``Trojan Source: Invisible Vulnerabilities'': <a href="https://trojansource.codes/trojan-source.pdf" rel="nofollow">https://trojansource.codes/trojan-source.pdf</a>

uniqueuidover 3 years ago

Upper and lower codepoints are really way too obscure and can create issues you didn't even know you had.I once had the very unpleasant experience of debugging a case where data saved with R on windows and loaded on macOS ended up with individually double-encoded codepoints.Not fun.

techwiz137over 3 years ago

It's pretty funny that reversing the American flag yields Soviet Union(SU).

cmyrover 3 years ago

Something I haven't seen mentioned yet is one of the most annoying things about regional indicator symbols, which is that interpreting them correctly requires arbitrary backtracking, and handling this correctly is very annoying for things like text fields.Basically: A single, unpaired RIS counts as a single grapheme. Similarly, a pair of RIS count as a single grapheme. Now imagine if your cursor position is after an RTS, and you arrow backwards (assuming LTR text, imagine your cursor is to the right of an RIS, and you press the left arrow.) Your textbox should now move the cursor to the left by one grapheme. How do you figure out where this is, in code units? You basically have to scan backwards until you find the first non-RIS codepoint, and then you have to match them up into pairs to figure out if your left-arrow movement should correspond to a movement of one or two codepoints.This is a longstanding source of bugs, and if you're bored you can play around with pasting a huge sequence of flags into a textfield and then trying to navigate around it with the arrow keys/mouse. There are some broken implementations out there.edit: while I'm thinking about this I will point out that an alternative design, which would have solved this problem (and which was first pointed out to me by @raphlinus) would have been to have two separate sets of RI symbols, one for 'first position' and one for 'second position'; then you could always determine the appropriate cursor position without needing context. Isn't hindsight a wonderful thing?

评论 #30113469 未加载

sundarurfriendover 3 years ago

Julia docs do a (surprisingly) good job of being clear and explicit about this: the docstring for `reverse(AbstractString)` says:> Reverses a string. Technically, this function reverses the codepoints in a string and its main utility is for reversed-order string processing [...]. See also [...] `graphemes` from module Unicode to operate on user-visible "characters" (graphemes) rather than codepoints.Properly reversing a string of flags (or any other grapheme clusters) is just a `using Unicode: grapheme` away.

mappuover 3 years ago

If you like this, you may also like why len(emoji) is still not 1 in Python 3 despite all the unicode breakage: <a href="https://storytime.ivysaur.me/posts/grapheme-clusters/" rel="nofollow">https://storytime.ivysaur.me/posts/grapheme-clusters/</a>I do feel like these are all 'gotcha' questions - I haven't seen any real-world requirement to reverse a string and then have it be displayed in a useful way.

qqiiover 3 years ago

> Challenge: How would you go about writing a function that reverses a string while leaving symbols encoded as sequences of code points intact? Can you do it from scratch? Is there a package available in your language that can do it for you? How did that package solve the problem?So are there any good libraries that can deal with code points that are merged together into a single pictographic and reverse them "as expected"?

评论 #30107939 未加载

kevin_thibedeauover 3 years ago

This misses the real problem with flag emoji in that they are composed of codepoints that can be in any order. With other emoji you get a base codepoint with potential combining characters. Using a table of combining character ranges you can skip over them and isolate the logical glyph sequences. You don't need surrounding context to parse them out like flags need.

评论 #30106075 未加载

评论 #30105508 未加载

a_cover 3 years ago

Understanding unicode would make the question more obvious<a href="https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/" rel="nofollow">https://www.joelonsoftware.com/2003/10/08/the-absolute-minim...</a>

Waterluvianover 3 years ago

I wish languages did a far better job clearly distinguishing between their operations:1. You are acting in byte space and it’s pretty unambiguous what should happen. We are not acknowledging the semantics of language and alphabets.2. You’re acting in language space and these operations will behave the way you probably think they should (depending on your cultural expectations, probably)

Beldinover 3 years ago

Interestingly, on my phone the so-called flag is not a flag at all, but "US" in outline.So python behaves as expected: the 2 character string, when reversed, becomes "SU". Similar stuff happens with the other "flag" strings.I'm sure emojis in my phone are outdated. I'm not sure how that affects whether I see a flag or letters.

评论 #30105997 未加载

评论 #30108356 未加载

codingkevover 3 years ago

Yes, this allows for easy building of flag emojis as long as you know the ISO 3166 two-letter country code.Example: <a href="https://github.com/kennell/flagz/blob/master/flagz.py" rel="nofollow">https://github.com/kennell/flagz/blob/master/flagz.py</a>

michaelsbradleyover 3 years ago

See chapter 7 in Hacking the Planet (with Notcurses) for a short treatment of encodings, extended grapheme clusters, etc.<a href="https://nick-black.com/htp-notcurses.pdf#page53" rel="nofollow">https://nick-black.com/htp-notcurses.pdf#page53</a>

xmprtover 3 years ago

This is a cool article about Unicode encoding however I still feel like it should be possible to reverse strings with Flag emojis. I don't see why computers can't handle multi rune symbols in the same way that they handle multi byte runes. We could combine all the runes that should be a single symbol and make sure that we're maintaining the ordering of those runes in the reversed string. Of course that means that naive string reversing doesn't work anymore but naive string reversing wouldn't work in the world of UTF-8 if we just went byte by byte.

评论 #30105533 未加载

评论 #30111255 未加载

raffyover 3 years ago

Kinda related: I am developing a library for ENS (Ethereum Name Service) name normalization: <a href="https://github.com/adraffy/ens-normalize.js" rel="nofollow">https://github.com/adraffy/ens-normalize.js</a>I'm trying to find the best combination of UTS-46, UTS-51, UTS-39, and prior work on IDN resolution w/r/t confusables: <a href="https://adraffy.github.io/ens-normalize.js/test/report-confusables.html" rel="nofollow">https://adraffy.github.io/ens-normalize.js/test/report-confu...</a>Personally, I found the Unicode spec very messy. Critical information is all over the place. You can see the direct effect of this when you compare various packages across different languages and discover that every library disagrees in multiple places. Even JS String.normalize() isn't consistent in the latest version of most browsers: <a href="https://adraffy.github.io/ens-normalize.js/test/report-nf.html" rel="nofollow">https://adraffy.github.io/ens-normalize.js/test/report-nf.ht...</a> (fails in Chrome, Safari)The major difference between ENS and DNS is emoji are front and center. ENS resolves by computing a hash of a name in a canonicalized form. Since resolution must happen decentralized, simply punting to punycode and relying custom logic for Unicode-handling isn't possible. On-chain records are 1:1, so there's no fuzzy matching either. Additionally, ENS is actively registering names, so any improvement to the system must preserve as many names as possible.At the moment, I'm attempting to improve upon the confusables in the Common/Greek/Latin/Cyrillic scripts, and will combine these new grouping with the mixed-script limitations similar to IDN handling in Chromium.Interactive Demo: <a href="https://adraffy.github.io/ens-normalize.js/test/resolver.html" rel="nofollow">https://adraffy.github.io/ens-normalize.js/test/resolver.htm...</a>Also this emoji report is pretty cool: <a href="https://adraffy.github.io/ens-normalize.js/test/report-emoji.html" rel="nofollow">https://adraffy.github.io/ens-normalize.js/test/report-emoji...</a>

jupp0rover 3 years ago

With all the criticism I normally have for Rust, I must say that its type safe handling of UTF-8 and its unambiguous distinction between byte strings and UTF-8 strings are extremely helpful in handling situations mentioned in the article correctly (and also efficiently).Yes it's a pain, but the way the standard library designed its types force you to handle conversions correctly, for example when byte arrays are converted to UTF-8 strings and may contain invalid UTF-8 sequences.

ineedasernameover 3 years ago

It's an emoji... Are there any emojis with only one character? My assumption going in would be that any emoji is > 1 character. Admittedly, despite lots of string processing, I never have to deal with emojis so I guess I'm not sure.An interesting exercise would be emoji detection during string reversal to preserve the original emoji. I though something like that would be the crux of the article.Am I wrong about single character emojis?

评论 #30111598 未加载

faebiover 3 years ago

Why reverse them if one barely can implement, display and edit them correctly. I never could make them work perfectly in VIM. Also I had to open a bug in Firefox recently:Flag emojis and others are displayed in double the size on Windows 10 using Firefox Nightly <a href="https://bugzilla.mozilla.org/show_bug.cgi?id=1746795" rel="nofollow">https://bugzilla.mozilla.org/show_bug.cgi?id=1746795</a>

评论 #30111616 未加载

aidenn0over 3 years ago

> The answer is: it depends. There isn't a canonical way to reverse a string, at least that I'm aware of.Unicode defines grapheme clusters[1] that represent "user-perceived characters" separating a string into those and reversing seems like a pretty good way to go about it.1: <a href="http://www.unicode.org/reports/tr29/" rel="nofollow">http://www.unicode.org/reports/tr29/</a>

jugover 3 years ago

I'm not surprised the flag had two components, but I _was_ surprised the US flag was made by literally U and S, haha!I definitely thought it'd be something like [I am a Flag] and [The flag ID between 0 and 65535]. And reversing it would be [Flag ID] + [I am a Flag] which would not be a defined "component" and instead rendered as the individual two nonsense characters.

评论 #30106301 未加载

nextstepover 3 years ago

Compare all of this nonsense to how it’s done in Swift. String APIs in Swift are great: intuitive and do what you expect.

alfredxingover 3 years ago

Related — I did a deep dive a couple years ago on emoji codepoints and how they're encoded in the Apple emoji font file, with the end goal of extracting the embedded images — <a href="https://github.com/alfredxing/emoji" rel="nofollow">https://github.com/alfredxing/emoji</a>

sltkrover 3 years ago

So what was the deal with the Scottish flag?

评论 #30105664 未加载

评论 #30105734 未加载

zwerdldsover 3 years ago

In normal conditions you can check for a ZWJ, but with regional coding chars, you would have to consider the regional chars block as a single char in the reversal. Given that is isn't necessarily locale dependant but presentation layer dependant, there might not be anough info to decide how to act.

codezeroover 3 years ago

You also can't URL Encode a string (In JS at least) if you truncate an emoji at the beginning or end of it.

评论 #30113594 未加载

zanzibar735over 3 years ago

Of course you can reverse a string with a flag emoji. You just need to treat a "string" as a collected of Extended Grapheme Clusters, and then you reverse the order of the EGCs. So if the string is `a<flag unicode bytes>b`, the output should be `b<flag unicode bytes>a`.

bandyabootover 3 years ago

Would be interesting to see the list of flag emojis that, when reversed, become a different flag emoji.

评论 #30106427 未加载

demetriusover 3 years ago

It's sad that Unicode doesn't include flags for dissolved countries. If it did, reversing an US flag would make a Soviet Union flag (code SU). This would make the text much more fun

评论 #30112456 未加载

heystefanover 3 years ago

Oooh I know this one, I've read it here last year: <a href="https://tonsky.me/blog/emoji/" rel="nofollow">https://tonsky.me/blog/emoji/</a>

ezfeover 3 years ago

Works in Swift, which is the benefit of Swift having the most painful String API possible:let v = "Flag: " String(v.reversed()) // Output: :galF v.count // Output: 7

nitelyover 3 years ago

You can, but you need to break the string into graphemes first.

kart23over 3 years ago

Google also interprets emojis funny. Google the Estonian and South Sudan flag (f"{chr(127466)*2+chr(127480)*2}") and you get results for Spain.

architectdroneover 3 years ago

humorously, on my local machine, I only see the string "us", and was rather confused when he was asserting that it was a single character :D

评论 #30108346 未加载

randpxover 3 years ago

Try reversing the Canadian flag (CA) and you get the Ascension Island Flag (AC). Great article, but completely misses the point.

qwerty456127over 3 years ago

If the US flag is 2 special symbols saying US, why doesn't reversing it just produce the flag of the Soviet Union?

评论 #30112161 未加载

ts4zover 3 years ago

Let me cheat a bit and say Unicode comes in three flavors: UTF-8, UCS-2 aka UTF-16, and UTF-32. UTF-8 is byte-oriented, UTF-16 is double-byte oriented, and UTF-32 nobody uses because you waste half the word almost all of the time.You can't reduce the bytes in UTF-8 or UTF-16, because you'll scramble the encoding. But you could parsing the string, codepoint-at-a-time, handling the specifics of UTF-8, or UTF-16 with its surrogate pairs, and reversing those. This sounds equivalent to reversing UTF-32, and I believe is what the original poster was imagining.Except you can't do that, because Unicode has composing characters. Now, I'm American and too stupid to type anything other than ASCII, but I know about n+~ = ñ. If you have the pre-composed version of ñ, you can reverse the codepoint (it's one codepoint). If you don't have it, and you have n+dead ~, you can't reverse it, or in the word "año" you might put the ~ on the "o". (Even crazier things happen when you get to the ligatures in Arabic; IIRC one of those is about 20 codepoints.)So we can't just reverse codepoints, even ancient versions of Unicode. Other posters have talked about the even more exotic stuff like Emoji + skin tone. It's necessary to be very careful.Now, the old fart in me says that ASCII never had this problem. But the old fart in me knows about CRLF in text protocols, and that's never LFCR; and that if you want to make a ñ in ASCII you must send n ^H ~. I guess you can reverse that, but if you want to do more exotic things it becomes less obvious.(IIRC UCS-2 is the deadname, now we call it UTF-16 to remind us to always handle surrogate pairs correctly, which we don't.)TLDR: Strings are hard.

smegsicleover 3 years ago

did they think all those skintone emojis are individual codepoints?

评论 #30105534 未加载

评论 #30105425 未加载

mlindnerover 3 years ago

The person tries to define character when there isn't actually any definition of what that even means. Character is a term limited to languages that actually use them and not all text is made up of characters.

nottorpover 3 years ago

So basically unicode along with c++ are great job security if you do bother to learn them.There's another word that comes to mind when thinking about those two: metastasis.

hougaardover 3 years ago

In other news, water is wet :)

exdsqover 3 years ago

Am I missing something or is this Day 1 of a programming course in C?

midjjiover 3 years ago

And this is why char should have been byte from the start.