Why the #AskObama tweet was garbled on screen

322 pointsby brianwillisalmost 14 years ago

23 comments

pakalmost 14 years ago

"This is SUCH a classic sloppy programmer mistake that I'm disappointed"Oh come off of it. This happens everywhere on the web on probably something like 25% of websites. And it's NOT always the consuming program's fault: very often somebody upstream, e.g. the hosting company, the person that wrote the HTML, the source of an RSS feed being inserted into the page etc. etc. forgot to encode something the way somebody else expected, and you as the poor guy at the end of the chain gets a document with multiple encodings improperly embedded into it. Inevitably you have to make some bad decisions and not all corner cases are handled.Somebody once reverse-engineered the state chart for how Internet Explorer handles documents with conflicting encoding declarations and I kid you not, it must have had >20 branches spanning a good few pages. Officially, the correct order of precedence is (<a href="http://www.w3.org/International/questions/qa-html-encoding-declarations" rel="nofollow">http://www.w3.org/International/questions/qa-html-encoding-d...</a>):1. HTTP Content-Type header2. byte-order mark (BOM)3. XML declaration4. meta element5. link charset attributebut that's not how every browser does it, because the W3C sort of declared that after things on the Real Internet (TM) had already gotten out of hand. I hate to resuscitate Joel posts but Unicode is not easy to implement right.

评论 #2740445 未加载

评论 #2740592 未加载

评论 #2741360 未加载

gibersonalmost 14 years ago

Maybe I'm just a terrible programmer, but I think the author may be a little over emphasizing the seriousness of the bug. To me, this is one of those throw away issues that you keep in the back of your mind. Unless I'm coding something that is extremely datacentric and critical in that sense. But it's not like its on one of my "top 20 must run tests" or anything. It's always one of those issues I ignore or assume is correct until I find out it isn't. When I find out, it's simply a matter of tossing in a code page translation at either the input or output end and I'm done with it.Or maybe I've just been fortunate enough to be in an environment where an occasional goof of this caliber doesn't have any serious consequences.

评论 #2741196 未加载

评论 #2740417 未加载

评论 #2741354 未加载

评论 #2740420 未加载

评论 #2740466 未加载

yahelcalmost 14 years ago

'So do political nerds get to moan because the author referred to Boehner as "the Senator"?'<a href="https://twitter.com/douglas/status/89080894018686976" rel="nofollow">https://twitter.com/douglas/status/89080894018686976</a>

评论 #2740408 未加载

efalcaoalmost 14 years ago

Hey I work for the company responsible for the visualization behind the president and the content on <a href="http://askobama.twitter.com" rel="nofollow">http://askobama.twitter.com</a>Let me take this very excellent opportunity to say that we are looking to hire a full time "front end" developer. You'll get to work on badass projects like the Obama Town Hall. Ideally, you'd be located in Austin. Find me on Twitter @efalcao to learn more.

评论 #2741346 未加载

js2almost 14 years ago

tl;dr:<pre><code> $ python >>> print u"\u2019".encode("utf-8").decode("Windows-1252")</code></pre>

评论 #2740555 未加载

pavel_lishinalmost 14 years ago

Did they see the encoding mistake before they showed it?Because I wonder how difficult it would be to create a string that says something innocuous in UTF-8 (e.g., "When will you bring the troops home #AskObama") but in ASCII would read as something totally different, but legible (e.g., "the secret priests would take great Cthulhu from his tomb to revive His subjects and resume his rule of earth...")

评论 #2740490 未加载

评论 #2741012 未加载

pilifalmost 14 years ago

Errors like this is what me and my coworkers jokingly refer to as US-UTF8 (no offense meant). In a country that's dominated by ASCII, "supporting" UTF8 means "emitting the same data as usual but declare it as UTF8).Sure there might be some misunderstandings with special punctuation characters as evidenced by the article, but such issues generally get low priority.In countries where the language isn't representable in ASCII, we can't use US-UTF8, but have to resort to "real-UTF8" which means dealing with legacy systems that don't do UTF8 (which is what happened in the article we're currently commenting on), dealing with browsers who lie about encoding, and dealing with the fact that a string length isn't its byte length any more even if it doesn't contain "fancy" punctuation characters.All that makes me wish I could do US-UTF8 too :-)

zachalmost 14 years ago

Proof that even completely ordinary string data used in the most USA-centric domain imaginable STILL needs proper encoding.

评论 #2740915 未加载

markbninealmost 14 years ago

What a cool shot. The prez with a common bug on his screen. And a bug I can fix! Still, even though this is an easy fix, he's going to need to open up a ticket.

评论 #2740998 未加载

anigbrowlalmost 14 years ago

I find it infuriating that this sort of thing is still a problem. I'm constantly seeing mangled apostrophes in places like Google reader too.

评论 #2740545 未加载

kragenalmost 14 years ago

Part of the problem is that UTF-8 makes things really, really simple, and bulletproof, and then people have to go and create problems again.Listen. Any time you use an encoding other than UTF-8, you are creating incompatibilities. If your stated intention is to facilitate communication, you are failing. You are a bad person. Stop doing it. The only possible excuse for using a non-UTF-8 encoding is to frustrate communication.(It's too fucking bad HTTP mandates that the default charset is ISO-8859-1.)

Jachalmost 14 years ago

Zed Shaw, can you make a post called "Programmers Need To Learn Unicode Or I Will Kill Them All"?

评论 #2741155 未加载

luigialmost 14 years ago

What modern software stack uses Extended ASCII as its default encoding? The last time I dealt with this problem, it was in 2005 or 2006 and I was working with PHP.

评论 #2740496 未加载

评论 #2740883 未加载

zipdogalmost 14 years ago

What bugs me is not the mis-encoding (though that's a fail), but that people "struggled to understand" it... surely everyone's seen apostrophes turn into these special characters on enough web pages over the past ten years to have recognized what's happening when it does.

评论 #2740766 未加载

gutinialmost 14 years ago

An error like this should not detract from the value Mass Relevance delivers. Clearly an event like this or similar events like the Oscars in which they take part are better, more engaging because of their involvement.

dabentalmost 14 years ago

That’s a lot of detail, but a very good explanation of what happened.

winsbe01almost 14 years ago

thank you for the fascinating article! I've seen this bug in other places, and I never knew what it was (and usually brushed it off instead of digging deeper to find it).

Jachalmost 14 years ago

Well, it could have been worse. They could have shown \'

nextparadigmsalmost 14 years ago

Why does it say 3 hours ago under the tweet? Wasn't this in real time?

评论 #2740365 未加载

mahmudalmost 14 years ago

He had to do a twitter townhall because El Jefe couldn't get a G+ invite :-|

joeyhalmost 14 years ago

tl;dr -- mojibake

igniferoalmost 14 years ago

So does HN support utf8 ¢ðrrê¢†l¥?

评论 #2740553 未加载

igniferoalmost 14 years ago

You have to give it to Microsoft. They use Word even for 140 letter documents now! (I understand spelling is a reason, but browsers have spelling now)

评论 #2740925 未加载

评论 #2741426 未加载

23 comments

pakalmost 14 years ago

评论 #2740445 未加载

评论 #2740592 未加载

评论 #2741360 未加载

gibersonalmost 14 years ago

评论 #2741196 未加载

评论 #2740417 未加载

评论 #2741354 未加载

评论 #2740420 未加载

评论 #2740466 未加载

yahelcalmost 14 years ago

评论 #2740408 未加载

efalcaoalmost 14 years ago

评论 #2741346 未加载

js2almost 14 years ago

tl;dr:<pre><code> $ python >>> print u"\u2019".encode("utf-8").decode("Windows-1252")</code></pre>

评论 #2740555 未加载

pavel_lishinalmost 14 years ago

评论 #2740490 未加载

评论 #2741012 未加载

pilifalmost 14 years ago

zachalmost 14 years ago

Proof that even completely ordinary string data used in the most USA-centric domain imaginable STILL needs proper encoding.

评论 #2740915 未加载

markbninealmost 14 years ago

What a cool shot. The prez with a common bug on his screen. And a bug I can fix! Still, even though this is an easy fix, he's going to need to open up a ticket.

评论 #2740998 未加载

anigbrowlalmost 14 years ago

I find it infuriating that this sort of thing is still a problem. I'm constantly seeing mangled apostrophes in places like Google reader too.

评论 #2740545 未加载

kragenalmost 14 years ago

Jachalmost 14 years ago

Zed Shaw, can you make a post called "Programmers Need To Learn Unicode Or I Will Kill Them All"?

评论 #2741155 未加载

luigialmost 14 years ago

What modern software stack uses Extended ASCII as its default encoding? The last time I dealt with this problem, it was in 2005 or 2006 and I was working with PHP.

评论 #2740496 未加载

评论 #2740883 未加载

zipdogalmost 14 years ago

评论 #2740766 未加载

gutinialmost 14 years ago

dabentalmost 14 years ago

That’s a lot of detail, but a very good explanation of what happened.

winsbe01almost 14 years ago

thank you for the fascinating article! I've seen this bug in other places, and I never knew what it was (and usually brushed it off instead of digging deeper to find it).

Jachalmost 14 years ago

Well, it could have been worse. They could have shown \'

nextparadigmsalmost 14 years ago

Why does it say 3 hours ago under the tweet? Wasn't this in real time?

评论 #2740365 未加载

mahmudalmost 14 years ago

He had to do a twitter townhall because El Jefe couldn't get a G+ invite :-|

joeyhalmost 14 years ago

tl;dr -- mojibake

igniferoalmost 14 years ago

So does HN support utf8 ¢ðrrê¢†l¥?

评论 #2740553 未加载

igniferoalmost 14 years ago

You have to give it to Microsoft. They use Word even for 140 letter documents now! (I understand spelling is a reason, but browsers have spelling now)

评论 #2740925 未加载

评论 #2741426 未加载