TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Is Unicode Safe?

109 pointsby nsmalchover 11 years ago

18 comments

vezzy-fnordover 11 years ago
The homoglyph attack is a very old and often abused technique. A related, though not identical one, is bit flipping where a single character is swapped in a domain name and you prey on those who make misspellings. It turns out, however, that it isn&#x27;t even necessary for someone to make a blatant error like that...<p><a href="https://www.youtube.com/watch?v=ZPbyDSvGasw" rel="nofollow">https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=ZPbyDSvGasw</a> [DEFCON 21: DNS May Be Hazardous To Your Health]
评论 #6864869 未加载
apiover 11 years ago
Some of these attacks, such as the exe.doc one, are the fault of the use of in-band signaling by Windows to indicate that a file is executable. You can&#x27;t do that on OSX or Linux, where attributes determine execute capability.<p>The equivalent domain name issues are a lot tougher, and are going to require a character lookalike table or some other system of rules to warn the user.
评论 #6865943 未加载
评论 #6864250 未加载
_deliriumover 11 years ago
I knew about the A vs. Α vs. А issue, where visually similar&#x2F;identical characters map to different domain names. But I didn&#x27;t know IDNs also could map visually <i>different</i> characters to the <i>same</i> domain names. I would&#x27;ve guessed that full-width characters would be punycoded as well, rather than treated as their ASCII equivalents. Is this done with any other characters?
评论 #6864112 未加载
null_ptrover 11 years ago
Why does Unicode have such powerful control characters that can be used to construct misleading strings? Is there a non-malicious use case for them?
评论 #6864475 未加载
评论 #6865379 未加载
评论 #6864453 未加载
ddebernardyover 11 years ago
Asking if utf-8 is safe on the basis of these examples seems like asking if we should throw the baby out with the bath water -- along with the toys and the tub for good measure.<p>The potential for abuse is evident, but it seems like these primarily ought to be fixed in userland. For instance, by giving cues by highlighting characters in widely different areas (latin vs cyrillic) or by ignoring rtl for extensions when a string starts in ltr.<p>(Not to mention, in the latter case, if users are opening random docs attached to spammy emails, utf8 is the last of your problems.)
twoodfinover 11 years ago
I don&#x27;t think the problem is Unicode, the problem is trusting your ability to determine the ownership of a URL (and thus the trust that should be inherited from its owner) based on its name. Plenty of phishing attacks work with domains like &quot;yahoo-password-reset.com&quot;.<p>If you&#x27;re not seeing a valid TLS session with a certificate signed by an issuer you trust not to allow these shenanigans, it really doesn&#x27;t matter what chracters you&#x27;re seeing in the URL bar.
rkangelover 11 years ago
To me, we&#x27;re in a Unicode transition period - 10 years ago it was almost completely unsupported, and as it is adopted more an more, we&#x27;re finding the places it can cause issues.<p>Part of the problem is that a lot of the languages and tools we use pre-exist widescale use of Unicode and don&#x27;t handle it very well. The Python 3 approach is by far the best one I&#x27;ve come across (would be interested to hear of other examples), and they needed to make a backwards incompatible change to handle it in a way that made it harder to screw up.<p>It is a complex technology, and inevitably there are going to be holes, but as in a lot of other cases, it is worth it (necessary, even), and as we move forward our tooling, languages, libraries and practices will get better and reduce the risk. The internet is a complex technology that can never be completely secure. Doesn&#x27;t mean it&#x27;s not worth it though.
lyndonhover 11 years ago
I was reading this and I&#x27;m like, Unicode (I assume UTF-8) isn&#x27;t really that complicated at all. The UTF-8 system is straightforward, no more complex than simple run length coding. I&#x27;m also thinking that Unicode is basically a list of glyphs in every language plus a few control codes for rendering glyphs correctly, BOM, etc.<p>It&#x27;s like saying a dictionary contains dangerous information.<p>I think the problem is software that enables Unicode input but is not willing to handle all the different types of input. For example, it seems like a bad idea to even let people input combined words of different languages; that&#x27;s why we have input methods that filter out bad combinations; and dumping this on the font renderer without making sure the difference is highlighted.
评论 #6864982 未加载
评论 #6882635 未加载
asdfaoeuover 11 years ago
Safari users check out:<p><a href="http://‮moc.lapyap.m‭d2.shptech.com" rel="nofollow">http:&#x2F;&#x2F;‮moc.lapyap.m‭d2.shptech.com</a><p>(copy and paste link)
wilaover 11 years ago
These unicode attacks are interesting and unicode is far too useful to stop from using it. The question is what can we do to fix some of these issues? Like the RTL character. It shouldn&#x27;t be blocked as it has a valid use case, but is there a non malicious use case for it when surrounded by normal latin characters? eg: abc[RTL]def<p>If it&#x27;s just one RTL character then that should be fairly easy to filter out. Of course if that&#x27;s a way a filter works then there will be other unicode characters you can add to the mix and still make it look the same for an average user and pass that particular filter.<p>One could identify unicode characters that belong to a particular character set (say latin) and see if some text contains more as one character set. Then invoke the filter if a text has more as 2 different character sets. Of course I can see that getting in the way of some use cases as well (text with translations in 3 languages for example)
gleennover 11 years ago
Some of those attacks are just awesome. Pretty scary as a web programmer
TazeTSchnitzelover 11 years ago
U+202E, the right-to-left override, is endless fun on web forums. Use it once and all the rest of the page will be flipped.
评论 #6865636 未加载
hsmyersover 11 years ago
While I appreciated the article, another one of his caught my eye and I thoroughly enjoyed it: <a href="http://www.jefftk.com/p/teach-yourself-any-instrument" rel="nofollow">http:&#x2F;&#x2F;www.jefftk.com&#x2F;p&#x2F;teach-yourself-any-instrument</a>
ttfleeover 11 years ago
This reminded me about an interview to an adware author, in which he told a story about creating unwritable registry keys and file names &#x27;by exploiting an “impedance mismatch” between the Win32 API and the NT API&#x27;:<p><a href="http://philosecurity.org/2009/01/12/interview-with-an-adware-author" rel="nofollow">http:&#x2F;&#x2F;philosecurity.org&#x2F;2009&#x2F;01&#x2F;12&#x2F;interview-with-an-adware...</a><p>The adware registered a key in the Windows Registry with Null unicode in the middle of the string so that the UI of Windows failed to display or modify that string.
评论 #6870111 未加载
rcthompsonover 11 years ago
Does Windows really display the that &quot;exe.doc&quot; RTL example with the icon for a Word Document? Or is the exe file just set to use that for its icon in order to complete the illusion?
评论 #6865628 未加载
评论 #6864834 未加载
im3w1lover 11 years ago
𝐈𝐦𝐩𝐨𝐫𝐭𝐚𝐧𝐭 𝐀𝐧𝐧𝐨𝐮𝐧𝐜𝐞𝐦𝐞𝐧𝐭<p>Is another unicode classic
评论 #6865348 未加载
robin_realaover 11 years ago
If anyone’s writing a library to deal with homoglyph attacks I recommend Unicode’s list of ‘confusables’ as a data source: <a href="http://www.unicode.org/Public/security/revision-06/confusables.txt" rel="nofollow">http:&#x2F;&#x2F;www.unicode.org&#x2F;Public&#x2F;security&#x2F;revision-06&#x2F;confusabl...</a>
nnnnniover 11 years ago
I like and appreciate the fact that the Spotify people used the one guy&#x27;s findings to improve security (for everything that uses Twisted!) rather than just throwing the legal system at him.