The homoglyph attack is a very old and often abused technique. A related, though not identical one, is bit flipping where a single character is swapped in a domain name and you prey on those who make misspellings. It turns out, however, that it isn't even necessary for someone to make a blatant error like that...<p><a href="https://www.youtube.com/watch?v=ZPbyDSvGasw" rel="nofollow">https://www.youtube.com/watch?v=ZPbyDSvGasw</a> [DEFCON 21: DNS May Be Hazardous To Your Health]
Some of these attacks, such as the exe.doc one, are the fault of the use of in-band signaling by Windows to indicate that a file is executable. You can't do that on OSX or Linux, where attributes determine execute capability.<p>The equivalent domain name issues are a lot tougher, and are going to require a character lookalike table or some other system of rules to warn the user.
I knew about the A vs. Α vs. А issue, where visually similar/identical characters map to different domain names. But I didn't know IDNs also could map visually <i>different</i> characters to the <i>same</i> domain names. I would've guessed that full-width characters would be punycoded as well, rather than treated as their ASCII equivalents. Is this done with any other characters?
Asking if utf-8 is safe on the basis of these examples seems like asking if we should throw the baby out with the bath water -- along with the toys and the tub for good measure.<p>The potential for abuse is evident, but it seems like these primarily ought to be fixed in userland. For instance, by giving cues by highlighting characters in widely different areas (latin vs cyrillic) or by ignoring rtl for extensions when a string starts in ltr.<p>(Not to mention, in the latter case, if users are opening random docs attached to spammy emails, utf8 is the last of your problems.)
I don't think the problem is Unicode, the problem is trusting your ability to determine the ownership of a URL (and thus the trust that should be inherited from its owner) based on its name. Plenty of phishing attacks work with domains like "yahoo-password-reset.com".<p>If you're not seeing a valid TLS session with a certificate signed by an issuer you trust not to allow these shenanigans, it really doesn't matter what chracters you're seeing in the URL bar.
To me, we're in a Unicode transition period - 10 years ago it was almost completely unsupported, and as it is adopted more an more, we're finding the places it can cause issues.<p>Part of the problem is that a lot of the languages and tools we use pre-exist widescale use of Unicode and don't handle it very well. The Python 3 approach is by far the best one I've come across (would be interested to hear of other examples), and they needed to make a backwards incompatible change to handle it in a way that made it harder to screw up.<p>It is a complex technology, and inevitably there are going to be holes, but as in a lot of other cases, it is worth it (necessary, even), and as we move forward our tooling, languages, libraries and practices will get better and reduce the risk.
The internet is a complex technology that can never be completely secure. Doesn't mean it's not worth it though.
I was reading this and I'm like, Unicode (I assume UTF-8) isn't really that complicated at all. The UTF-8 system is straightforward, no more complex than simple run length coding. I'm also thinking that Unicode is basically a list of glyphs in every language plus a few control codes for rendering glyphs correctly, BOM, etc.<p>It's like saying a dictionary contains dangerous information.<p>I think the problem is software that enables Unicode input but is not willing to handle all the different types of input. For example, it seems like a bad idea to even let people input combined words of different languages; that's why we have input methods that filter out bad combinations; and dumping this on the font renderer without making sure the difference is highlighted.
These unicode attacks are interesting and unicode is far too useful to stop from using it.
The question is what can we do to fix some of these issues?
Like the RTL character. It shouldn't be blocked as it has a valid use case, but is there a non malicious use case for it when surrounded by normal latin characters?
eg:
abc[RTL]def<p>If it's just one RTL character then that should be fairly easy to filter out. Of course if that's a way a filter works then there will be other unicode characters you can add to the mix and still make it look the same for an average user and pass that particular filter.<p>One could identify unicode characters that belong to a particular character set (say latin) and see if some text contains more as one character set. Then invoke the filter if a text has more as 2 different character sets. Of course I can see that getting in the way of some use cases as well (text with translations in 3 languages for example)
While I appreciated the article, another one of his caught my eye and I thoroughly enjoyed it: <a href="http://www.jefftk.com/p/teach-yourself-any-instrument" rel="nofollow">http://www.jefftk.com/p/teach-yourself-any-instrument</a>
This reminded me about an interview to an adware author, in which he told a story about creating unwritable registry keys and file names 'by exploiting an “impedance mismatch” between the Win32 API and the NT API':<p><a href="http://philosecurity.org/2009/01/12/interview-with-an-adware-author" rel="nofollow">http://philosecurity.org/2009/01/12/interview-with-an-adware...</a><p>The adware registered a key in the Windows Registry with Null unicode in the middle of the string so that the UI of Windows failed to display or modify that string.
Does Windows really display the that "exe.doc" RTL example with the icon for a Word Document? Or is the exe file just set to use that for its icon in order to complete the illusion?
If anyone’s writing a library to deal with homoglyph attacks I recommend Unicode’s list of ‘confusables’ as a data source: <a href="http://www.unicode.org/Public/security/revision-06/confusables.txt" rel="nofollow">http://www.unicode.org/Public/security/revision-06/confusabl...</a>
I like and appreciate the fact that the Spotify people used the one guy's findings to improve security (for everything that uses Twisted!) rather than just throwing the legal system at him.