Mimic – abusing Unicode to create tragedy

408 pointsby epsylonover 9 years ago

39 comments

omgtehlionover 9 years ago

Slightly tangential:In Russia there is a government procurement portal. Where gov organizations have to post their requests to enforce competetion and best prices.The usual tactics [1] of corrupt officials was replacing cyrillic (russian) letters with respective latin homoglyphs so only affiliated companies can find and win this contract.[1] <a href="http://www.bbc.com/russian/rolling_news/2013/04/130409_rn_state_auctions_improve" rel="nofollow">http://www.bbc.com/russian/rolling_news/2013/04/130409_rn_st...</a>

评论 #10439296 未加载

b0ner_t0nerover 9 years ago

Taylor Swift? Never heard of her: <a href="https://www.google.com/search?q=Τаylοr+Ѕwіft" rel="nofollow">https://www.google.com/search?q=Τаylοr+Ѕwіft</a>:D

评论 #10438661 未加载

评论 #10438781 未加载

wodenokotoover 9 years ago

So, one might wonder why these homo-graphs have different code points. After all the French A and the English A share the same code point.It's really difficult to do the right thing here. If Greek question marks share code point with semi-colon, it obstructs search and replace for question marks.Subtle differences in how Japanese and Chinese are written has led to differently written characters sharing the same code point. It's nice that you can easily look up most Japanese characters in a Chinese dictionary and see how they are used in China, but it has become frustratingly hard to get subtleties in their written form right. The Chinese version may have the line strike through another line, while the Japanese only has it touching.I honestly don't know how to go about posting how to same code points have different written forms!But it seems like it would be nice if code editors warned about text outside ascii. You usually only want that in strings and comments.

评论 #10439227 未加载

评论 #10439985 未加载

评论 #10438054 未加载

评论 #10439443 未加载

评论 #10438130 未加载

评论 #10441201 未加载

评论 #10438043 未加载

acdhaover 9 years ago

Mac users might appreciate the great UnicodeChecker:<a href="http://earthlingsoft.net/UnicodeChecker/" rel="nofollow">http://earthlingsoft.net/UnicodeChecker/</a>It offers a convenient utility to diff arbitrary strings, which is also quite handy for e.g. detecting normalization discrepancies, and installs a service so you can highlight a character in any app and use “Display character information” to see what it actually is.I have Python command-line version in my PATH which displays the character info for arbitrary input strings: <a href="https://github.com/acdha/unix_tools/blob/master/bin/unicode-characters.py" rel="nofollow">https://github.com/acdha/unix_tools/blob/master/bin/unicode-...</a>

评论 #10440555 未加载

mottiover 9 years ago

This sort of stuff can be the basis for many XSS attacks, see <a href="http://websec.github.io/unicode-security-guide/character-transformations/" rel="nofollow">http://websec.github.io/unicode-security-guide/character-tra...</a>For instance, \u2329, \uFE64, \uFF1C and \u3008 can be best-fitted automatically to \u003C (the regular '<' mark in HTML)

评论 #10439807 未加载

mattlondonover 9 years ago

I had something similar happen in the wild to me.I work for a "major search engine" that does a lot of advertising & marketing stuff. To get the most out of it, we need customers to implement some javascript on their ecommerce sites.As is often the case, javascript code that needs to get implemented on an ecommerce site often gets copy-pasted or emailed around a lot internally within a customer before it reaches the right person who can add it to the site's pages.In this example somewhere along the way, a normal javascript snippet got all of the semi-colons changed from ; to ;.In case you've not already spotted it, ; is not a ; but is actually "Greek Question Mark" (<a href="http://www.fileformat.info/info/unicode/char/037e/index.htm" rel="nofollow">http://www.fileformat.info/info/unicode/char/037e/index.htm</a>).It was very confusing why Chrome was moaning about a semi-colon an illegal token. I had a genuine "Am I going mad? Seriously?" moment before I realised what was happening.

评论 #10438101 未加载

评论 #10437942 未加载

sherazover 9 years ago

There is a special place in hell for anyone doing this. I'm going to watch this repo and blacklist pull requests from anyone who forks it :-)

评论 #10437795 未加载

评论 #10438601 未加载

pierrecover 9 years ago

I can foresee a new phenomenon arising in stackoverflow-style sites and coding discussion forums:"My simple piece of code looks perfect and should work without problems. Yet it won't compile! Help!"Answer:"Try running `./mimic --reverse` on your source."

评论 #10438764 未加载

评论 #10438475 未加载

austinjpover 9 years ago

I'm reminded how very useful I've found Text::Unidecode in the past.<a href="http://search.cpan.org/~sburke/Text-Unidecode-1.27/lib/Text/Unidecode.pm" rel="nofollow">http://search.cpan.org/~sburke/Text-Unidecode-1.27/lib/Text/...</a>

评论 #10438360 未加载

adrianNover 9 years ago

On a Mac you (used to?) get a non-ascii space when you hit the space bar while holding Alt or something like that. Easy to fat-finger it in any case and looks the same in most text editors. It's a great source of fun for novice Mac-using programmers to find out why the compiler complains.

评论 #10437864 未加载

评论 #10438916 未加载

评论 #10438979 未加载

评论 #10437905 未加载

评论 #10446518 未加载

sly010over 9 years ago

Ironically I have weird OCD where I always assume I made a typo, so I keep deleting and retyping code a few dozen character at a type, often in lines where I see nothing wrong. Over time this has just become something my hands do whenever my brain needs time to think about something else. So in a way I developed natural immunity to said unicode tricks ;)

评论 #10438234 未加载

评论 #10438235 未加载

Animatsover 9 years ago

There's a set of rules used on domain names to stop homoglyph abuse there.[1][2] Applying those rules to language identifiers would prevent this problem. It's also useful to apply those rules to login names for forum/social systems. The rules prevent mixed language identifiers, mixed left to right and right to left text, and similar annoyances.[1] <a href="https://tools.ietf.org/html/rfc5893" rel="nofollow">https://tools.ietf.org/html/rfc5893</a> [2] <a href="http://unicode.org/reports/tr46/" rel="nofollow">http://unicode.org/reports/tr46/</a>

rbinvover 9 years ago

I guess someone should develop an IDE/editor plugin that marks non-ASCII characters outside of string literals.

评论 #10438080 未加载

评论 #10438076 未加载

评论 #10441569 未加载

AUmryshover 9 years ago

I think the line about "Mimic substitutes common ASCII characters for obscure homographs" has it backward. Shouldn't it say Mimic substitutes obscure homographs for common ASCII characters?

评论 #10438323 未加载

评论 #10439136 未加载

评论 #10438870 未加载

评论 #10438342 未加载

cstrossover 9 years ago

Also GREAT if you're trying to identify untaken phishing domain names to register for your next scam!

评论 #10438150 未加载

评论 #10437791 未加载

reinderienover 9 years ago

Mimic author here... sorry, humanity...

Svenstaroover 9 years ago

Wow, now that's just pure evil.

评论 #10438523 未加载

ant6nover 9 years ago

One could name variables and functions to later identify whether code was copied (e.g. to find out whether somebody copied some GPL code).

评论 #10438271 未加载

tucifover 9 years ago

Spotify used to have a security problem with this kind of characters:<a href="https://labs.spotify.com/2013/06/18/creative-usernames/" rel="nofollow">https://labs.spotify.com/2013/06/18/creative-usernames/</a>

cruise02over 9 years ago

> Replace a semicolon (;) with a greek question mark (;) in your friend's C# code and watch them pull their hair out over the syntax errorI'm not sure how frustrating this would be. Wouldn't most people just delete the character immediately and type a new one?

评论 #10441361 未加载

评论 #10440650 未加载

pmlnrover 9 years ago

This somewhat reminds me if this little entry on how "tolerant" JavaScript is...<a href="https://mathiasbynens.be/notes/javascript-identifiers" rel="nofollow">https://mathiasbynens.be/notes/javascript-identifiers</a>

gnudover 9 years ago

This can actually be used productively, to see how your app reacts to weird input :)

cammsaulover 9 years ago

The repo's README mentions a vim plugin to highlight Unicode homoglyphs. As an Emacs user, I did a quick M-x package-list-packages, thinking I'll find at least half a dozen equivalent Emacs packages.To my dismay, there were none. So I spent the rest of my afternoon correcting this glaring deficiency. Fellow Emacs users, protect yourself from Unicode trolls and grab it here: <a href="https://github.com/camsaul/emacs-unicode-troll-stopper" rel="nofollow">https://github.com/camsaul/emacs-unicode-troll-stopper</a>

Procrastesover 9 years ago

This seems like a useful tool for fuzztesting your dev ops person, or if you are the dev ops person, for fuzz testing development. Fuzz for all!

Kristine1975over 9 years ago

Piping the result through TTS creates weird results (on OS X):<pre><code> echo "hello world" | mimic --me-harder 100 | say</code></pre>

评论 #10441579 未加载

n-gaugeover 9 years ago

Just chuck the code into an XML validator. Any character > 127 will be flagged as invalid.

foolfoolzover 9 years ago

is anyone aware of the reverse of this, a homoglyph normalization library? id love to be able to take strings that visually look the same and compare them against one master list, such as for spam detection

r721over 9 years ago

In cases like those I use unicodelookup.com to list suspicious characters :)

评论 #10437913 未加载

bradbeattieover 9 years ago

Add the following to your ~/.vimrc to always highlight non-ascii characters:<pre><code> au BufWinEnter * let w:matchnonascii=matchadd('ErrorMsg', "[\x7f-\xff]", -1)</code></pre>

Induaneover 9 years ago

These dang democrats done banned Ben Carson from google man!<a href="https://www.google.com/search?q=Ben+Сarѕоn" rel="nofollow">https://www.google.com/search?q=Ben+Сarѕоn</a>

perlancar2over 9 years ago

Made a perl port: <a href="https://metacpan.org/pod/mimic" rel="nofollow">https://metacpan.org/pod/mimic</a> (currently 50% faster)

TazeTSchnitzelover 9 years ago

In some languages which allow non-ASCII but aren't Unicode-aware (PHP, for instance), you can add significant, invisible zero-width spaces to identifiers.

florian-fover 9 years ago

webXLover 9 years ago

Hmm... I wonder if this can be used in browser source maps.

cevarisover 9 years ago

Some people just want to see the world burn...

mrzoolover 9 years ago

Some men just want to watch the world burn.

ChrisArgyleover 9 years ago

And now I know what I'm doing for April 1st next year.

ehoscaover 9 years ago

i smell a Notepad++ extension

grabcocqueover 9 years ago

YOU ARE A TERRIBLE PERSON AND I LIKE YOU