I am not sure whether these are dark corners of Unicode or just complex language support. When dealing with text, you need to learn and educate yourself about text. International text is much more complex than localized text, but if you intend to support that, learn it. Sure, frameworks exist that hide a lot of the complexity, but then you hit a specific bug/corner case where the framework fails you, and since you are not familiar with the intricacies of the language-specific problem, you are clueless how to proceed and fix it, whereas if you invested the time to learn about it, it would have been much easier.<p>A similar (albeit rather simpler and more limited) problem is calendrical calculus, where people who have little to no grasp of how to perform correct date operations do complex calendar applications and fail spectacularly in some edge cases.<p>Call me crazy, but if you are dealing with text, have some time set for research before you start your development.
For locale-aware correct sorting of unicode strings, based on the Unicode Collation Algorithm, some open source libraries twitter released are pretty awesome.<p>ruby: <a href="https://github.com/twitter/twitter-cldr-rb" rel="nofollow">https://github.com/twitter/twitter-cldr-rb</a><p>javascript: <a href="https://github.com/twitter/twitter-cldr-js" rel="nofollow">https://github.com/twitter/twitter-cldr-js</a><p>Human written language is pretty complicated. The Unicode standards (including the Common Locale Data Repository, the Unicode Collation Algorithm, normalization forms, associated standards and algorithms, etc) -- is a pretty damn amazing approach to dealing with it. It's not perfect, but it's amazing it's as well-designed and complete as it is. It's also not easy to implement solutions based on the unicode standards from scratch, cause it's complicated.
I don't agree with the section about JavaScript strings. Those are proper strings, just encoded in UTF-16.<p>> JavaScript’s string type is backed by a sequence of unsigned 16-bit integers, so it can’t hold any codepoint higher than U+FFFF and instead splits them into surrogate pairs.<p>You just contradicted yourself. Surrogate pairs is exactly what allows UTF-16 to encode any codepoint.<p>Once you start talking about in-memory representation, you need to agree on an encoding. UTF-8, UTF-16 being the most common. wchar_t could be UTF-16 or UCS-2.
Interesting that Firefox takes the decomposed Hangul and renders it as whole syllables, while Chrome shows them as the sequence of individual jamos. <a href="http://mcc.id.au/temp/hangul.png" rel="nofollow">http://mcc.id.au/temp/hangul.png</a>
He's rendering normalized text, but normalization is only for string comparisons...<p>I don't understand why emoji are width 1 either.. really the EastAsianWidth.txt from the Unicode standard needs to match fixed with terminal emulators.<p>I've been dealing with all of this recently in JOE: <a href="http://sourceforge.net/p/joe-editor/mercurial/ci/default/tree/NEWS.md" rel="nofollow">http://sourceforge.net/p/joe-editor/mercurial/ci/default/tre...</a><p>In particular JOE now finally renders combining characters correctly. It now stores a string for each character cell which includes the start character and any following combining characters. If any of them change, JOE re-emits the entire sequence.<p>But which characters are combining characters? I expect \p{Mn} and \p{Me}, but U+1160 - U+11FF needs to be included as well but isn't. It's crazy that these are not counted as combining characters. Now I'm going to have to check how zero-width joiner is handled in terminal emulators. JOE is not changing the start character after a joiner into a combining character, ugh..
<i>“Also, I strongly recommend you install the Symbola font, which contains basic glyphs for a vast number of characters. They may not be pretty, but they’re better than seeing the infamous Unicode lego.”</i><p>I disagree with the notion of Symbola not being a pretty font. As I mentioned here¹, the glyphs Symbola has for the Mathematical Alphanumeric Symbols block are quite beautiful². (It may help that I’m using the non-hinted version on a HiDPI display though… still that implies it will look even better when printed on paper with an inkjet or laser printer since they still produce more DPI than the typical HiDPI monitor).<p>――――――<p>¹ — <a href="https://news.ycombinator.com/item?id=10198620" rel="nofollow">https://news.ycombinator.com/item?id=10198620</a><p>² — <a href="http://f.cl.ly/items/2h2p0r1F1h2E1y2o2y0c/Screen%20Shot%202015-09-10%20at%2011.51.55%20AM.png" rel="nofollow">http://f.cl.ly/items/2h2p0r1F1h2E1y2o2y0c/Screen%20Shot%2020...</a>
Its a good article. There is a direct analogy with the article asking if HTML is a semantic markup language or a binary graphics art format, and the groups not overlapping very much other than in failure while mostly not being very interested in each other.
> <i>I strongly recommend you install the Symbola font, which contains basic glyphs for a vast number of characters. They may not be pretty, but they’re better than seeing the infamous Unicode lego.</i><p>Well, I installed the Symbola font as he suggested but I'm still seeing lots of Unicode lego in the article.<p>I'm using Windows 7 and the latest version of Firefox, and I set the Symbola as the default font in Firefox and unchecked the box that says, "Allow pages to choose their own fonts, instead of my selections above".<p>What could I be doing wrong? I would assume that if the author recommends Symbola font, he's checked that Symbola has representations for all the symbols he's using.
I predict someone will complain, as usual, that Unicode could and should be regular and programmer-friendly and everything.<p>My response would be this: <a href="http://xkcd.com/1576/" rel="nofollow">http://xkcd.com/1576/</a><p>Unicode is merely as complex that which it encodes: human language.
This is why you should provide a locale for most sting functions in C# for example. I think this article is mostly about bad unicode support in programming languages.