Slightly tangential:<p>In Russia there is a government procurement portal. Where gov organizations have to post their requests to enforce competetion and best prices.<p>The usual tactics [1] of corrupt officials was replacing cyrillic (russian) letters with respective latin homoglyphs so only affiliated companies can find and win this contract.<p>[1] <a href="http://www.bbc.com/russian/rolling_news/2013/04/130409_rn_state_auctions_improve" rel="nofollow">http://www.bbc.com/russian/rolling_news/2013/04/130409_rn_st...</a>
Taylor Swift? Never heard of her: <a href="https://www.google.com/search?q=Τаylοr+Ѕwіft" rel="nofollow">https://www.google.com/search?q=Τаylοr+Ѕwіft</a><p>:D
So, one might wonder why these homo-graphs have different code points. After all the French A and the English A share the same code point.<p>It's really difficult to do the right thing here. If Greek question marks share code point with semi-colon, it obstructs search and replace for question marks.<p>Subtle differences in how Japanese and Chinese are written has led to differently written characters sharing the same code point. It's nice that you can easily look up most Japanese characters in a Chinese dictionary and see how they are used in China, but it has become frustratingly hard to get subtleties in their written form right. The Chinese version may have the line strike through another line, while the Japanese only has it touching.<p>I honestly don't know how to go about posting how to same code points have different written forms!<p>But it seems like it would be nice if code editors warned about text outside ascii. You usually only want that in strings and comments.
Mac users might appreciate the great UnicodeChecker:<p><a href="http://earthlingsoft.net/UnicodeChecker/" rel="nofollow">http://earthlingsoft.net/UnicodeChecker/</a><p>It offers a convenient utility to diff arbitrary strings, which is also quite handy for e.g. detecting normalization discrepancies, and installs a service so you can highlight a character in any app and use “Display character information” to see what it actually is.<p>I have Python command-line version in my PATH which displays the character info for arbitrary input strings: <a href="https://github.com/acdha/unix_tools/blob/master/bin/unicode-characters.py" rel="nofollow">https://github.com/acdha/unix_tools/blob/master/bin/unicode-...</a>
This sort of stuff can be the basis for many XSS attacks, see <a href="http://websec.github.io/unicode-security-guide/character-transformations/" rel="nofollow">http://websec.github.io/unicode-security-guide/character-tra...</a><p>For instance, \u2329, \uFE64, \uFF1C and \u3008 can be best-fitted automatically to \u003C (the regular '<' mark in HTML)
I had something similar happen in the wild to me.<p>I work for a "major search engine" that does a lot of advertising & marketing stuff. To get the most out of it, we need customers to implement some javascript on their ecommerce sites.<p>As is often the case, javascript code that needs to get implemented on an ecommerce site often gets copy-pasted or emailed around a lot internally within a customer before it reaches the right person who can add it to the site's pages.<p>In this example somewhere along the way, a normal javascript snippet got all of the semi-colons changed from ; to ;.<p>In case you've not already spotted it, ; is not a ; but is actually "Greek Question Mark" (<a href="http://www.fileformat.info/info/unicode/char/037e/index.htm" rel="nofollow">http://www.fileformat.info/info/unicode/char/037e/index.htm</a>).<p>It was very confusing why Chrome was moaning about a semi-colon an illegal token. I had a genuine "Am I going mad? Seriously?" moment before I realised what was happening.
I can foresee a new phenomenon arising in stackoverflow-style sites and coding discussion forums:<p>"<i>My simple piece of code looks perfect and should work without problems. Yet it won't compile! Help!</i>"<p>Answer:<p>"<i>Try running `./mimic --reverse` on your source.</i>"
I'm reminded how very useful I've found Text::Unidecode in the past.<p><a href="http://search.cpan.org/~sburke/Text-Unidecode-1.27/lib/Text/Unidecode.pm" rel="nofollow">http://search.cpan.org/~sburke/Text-Unidecode-1.27/lib/Text/...</a>
On a Mac you (used to?) get a non-ascii space when you hit the space bar while holding Alt or something like that. Easy to fat-finger it in any case and looks the same in most text editors. It's a great source of fun for novice Mac-using programmers to find out why the compiler complains.
Ironically I have weird OCD where I always assume I made a typo, so I keep deleting and retyping code a few dozen character at a type, often in lines where I see nothing wrong. Over time this has just become something my hands do whenever my brain needs time to think about something else. So in a way I developed natural immunity to said unicode tricks ;)
There's a set of rules used on domain names to stop homoglyph abuse there.[1][2] Applying those rules to language identifiers would prevent this problem. It's also useful to apply those rules to login names for forum/social systems. The rules prevent mixed language identifiers, mixed left to right and right to left text, and similar annoyances.<p>[1] <a href="https://tools.ietf.org/html/rfc5893" rel="nofollow">https://tools.ietf.org/html/rfc5893</a>
[2] <a href="http://unicode.org/reports/tr46/" rel="nofollow">http://unicode.org/reports/tr46/</a>
I think the line about "Mimic substitutes common ASCII characters for obscure homographs" has it backward. Shouldn't it say Mimic substitutes obscure homographs for common ASCII characters?
Spotify used to have a security problem with this kind of characters:<p><a href="https://labs.spotify.com/2013/06/18/creative-usernames/" rel="nofollow">https://labs.spotify.com/2013/06/18/creative-usernames/</a>
> Replace a semicolon (;) with a greek question mark (;) in your friend's C# code and watch them pull their hair out over the syntax error<p>I'm not sure how frustrating this would be. Wouldn't most people just delete the character immediately and type a new one?
This somewhat reminds me if this little entry on how "tolerant" JavaScript is...<p><a href="https://mathiasbynens.be/notes/javascript-identifiers" rel="nofollow">https://mathiasbynens.be/notes/javascript-identifiers</a>
The repo's README mentions a vim plugin to highlight Unicode homoglyphs. As an Emacs user, I did a quick M-x package-list-packages, thinking I'll find at least half a dozen equivalent Emacs packages.<p>To my dismay, there were <i>none</i>. So I spent the rest of my afternoon correcting this glaring deficiency. Fellow Emacs users, protect yourself from Unicode trolls and grab it here: <a href="https://github.com/camsaul/emacs-unicode-troll-stopper" rel="nofollow">https://github.com/camsaul/emacs-unicode-troll-stopper</a>
is anyone aware of the reverse of this, a homoglyph normalization library? id love to be able to take strings that visually look the same and compare them against one master list, such as for spam detection
Add the following to your ~/.vimrc to always highlight non-ascii characters:<p><pre><code> au BufWinEnter * let w:matchnonascii=matchadd('ErrorMsg', "[\x7f-\xff]", -1)</code></pre>
These dang democrats done banned Ben Carson from google man!<p><a href="https://www.google.com/search?q=Ben+Сarѕоn" rel="nofollow">https://www.google.com/search?q=Ben+Сarѕоn</a>
Made a perl port: <a href="https://metacpan.org/pod/mimic" rel="nofollow">https://metacpan.org/pod/mimic</a> (currently 50% faster)
In some languages which allow non-ASCII but aren't Unicode-aware (PHP, for instance), you can add significant, invisible zero-width spaces to identifiers.