"This is SUCH a classic sloppy programmer mistake that I'm disappointed"<p>Oh come off of it. This happens everywhere on the web on probably something like 25% of websites. And it's NOT always the consuming program's fault: very often somebody upstream, e.g. the hosting company, the person that wrote the HTML, the source of an RSS feed being inserted into the page etc. etc. forgot to encode something the way somebody else expected, and you as the poor guy at the end of the chain gets a document with multiple encodings improperly embedded into it. Inevitably you have to make some bad decisions and not all corner cases are handled.<p>Somebody once reverse-engineered the state chart for how Internet Explorer handles documents with conflicting encoding declarations and I kid you not, it must have had >20 branches spanning a good few pages. Officially, the correct order of precedence is (<a href="http://www.w3.org/International/questions/qa-html-encoding-declarations" rel="nofollow">http://www.w3.org/International/questions/qa-html-encoding-d...</a>):<p>1. HTTP Content-Type header<p>2. byte-order mark (BOM)<p>3. XML declaration<p>4. meta element<p>5. link charset attribute<p>but that's not how every browser does it, because the W3C sort of declared that after things on the Real Internet (TM) had already gotten out of hand. I hate to resuscitate Joel posts but Unicode is not easy to implement right.
Maybe I'm just a terrible programmer, but I think the author may be a little over emphasizing the seriousness of the bug. To me, this is one of those throw away issues that you keep in the back of your mind. Unless I'm coding something that is extremely datacentric and critical in that sense. But it's not like its on one of my "top 20 must run tests" or anything. It's always one of those issues I ignore or assume is correct until I find out it isn't. When I find out, it's simply a matter of tossing in a code page translation at either the input or output end and I'm done with it.<p>Or maybe I've just been fortunate enough to be in an environment where an occasional goof of this caliber doesn't have any serious consequences.
'So do political nerds get to moan because the author referred to Boehner as "the Senator"?'<p><a href="https://twitter.com/douglas/status/89080894018686976" rel="nofollow">https://twitter.com/douglas/status/89080894018686976</a>
Hey I work for the company responsible for the visualization behind the president and the content on <a href="http://askobama.twitter.com" rel="nofollow">http://askobama.twitter.com</a><p>Let me take this very excellent opportunity to say that we are looking to hire a full time "front end" developer. You'll get to work on badass projects like the Obama Town Hall. Ideally, you'd be located in Austin. Find me on Twitter @efalcao to learn more.
Did they see the encoding mistake before they showed it?<p>Because I wonder how difficult it would be to create a string that says something innocuous in UTF-8 (e.g., "When will you bring the troops home #AskObama") but in ASCII would read as something totally different, but legible (e.g., "the secret priests would take great Cthulhu from his tomb to revive His subjects and resume his rule of earth...")
Errors like this is what me and my coworkers jokingly refer to as US-UTF8 (no offense meant). In a country that's dominated by ASCII, "supporting" UTF8 means "emitting the same data as usual but declare it as UTF8).<p>Sure there might be some misunderstandings with special punctuation characters as evidenced by the article, but such issues generally get low priority.<p>In countries where the language isn't representable in ASCII, we can't use US-UTF8, but have to resort to "real-UTF8" which means dealing with legacy systems that don't do UTF8 (which is what happened in the article we're currently commenting on), dealing with browsers who lie about encoding, and dealing with the fact that a string length isn't its byte length any more even if it doesn't contain "fancy" punctuation characters.<p>All that makes me wish I could do US-UTF8 too :-)
What a cool shot. The prez with a common bug on his screen. And a bug I can fix! Still, even though this is an easy fix, he's going to need to open up a ticket.
Part of the problem is that UTF-8 makes things really, really simple, and bulletproof, and then people have to go and create problems again.<p>Listen. Any time you use an encoding other than UTF-8, you are creating incompatibilities. If your stated intention is to facilitate communication, you are failing. You are a bad person. Stop doing it. The only possible excuse for using a non-UTF-8 encoding is to frustrate communication.<p>(It's too fucking bad HTTP mandates that the default charset is ISO-8859-1.)
What modern software stack uses Extended ASCII as its default encoding? The last time I dealt with this problem, it was in 2005 or 2006 and I was working with PHP.
What bugs me is not the mis-encoding (though that's a fail), but that people "struggled to understand" it... surely everyone's seen apostrophes turn into these special characters on enough web pages over the past ten years to have recognized what's happening when it does.
An error like this should not detract from the value Mass Relevance delivers. Clearly an event like this or similar events like the Oscars in which they take part are better, more engaging because of their involvement.
thank you for the fascinating article! I've seen this bug in other places, and I never knew what it was (and usually brushed it off instead of digging deeper to find it).