> do not use template languages to generate XML.<p>Small correction: do not use <i>text</i> template languages (Jinja, moustache, erb — which seems to be the one used here considering `%= display_date %>`, raw PHP, smarty, freemarker, what have you) to generate XML. There are templating languages whose <i>primary</i> use case is to generate markup (including XML)[0] and (unless they're broken to uselessness) they should guarantee the output is valid XML.<p>> Schema-design-wise, the content:encoded and excerpt:encoded element names are deeply suspect, as if someone looked at RSS 2.0, squinted, shrugged, and invented their own ad hoc analogous namespace prefix, rather than understanding the role of elements in XML.<p>They seem to be using Wordpress's WXR import/export format, hence the wp-namespaced elements. The "content" and "excerpt" namespace garbage comes straight from there according to <a href="http://ipggi.wordpress.com/2011/03/16/the-wordpress-extended-rss-wxr-exportimport-xml-document-format-decoded-and-explained/" rel="nofollow">http://ipggi.wordpress.com/2011/03/16/the-wordpress-extended...</a><p>> <content:encoded> Is the replacement for the restrictive Rss <description> element. Enclosed within a character data enclosure is the complete WordPress formatted blog post, HTML tags and all.<p>> <excerpt:encoded> This is an unknown elementThis is a summary or description of the post often used by RSS/Atom feeds..<p>Considering the cottage industry of wordpress interaction, it was probably a good move to shoot for interop (should allow posterous exports to be directly imported into wordpress?). Not sure they succeeded though.<p>[0] genshi for instance <a href="http://genshi.edgewall.org/" rel="nofollow">http://genshi.edgewall.org/</a>
One thing that bugs me about this is the use of CDATA. CDATA sections are just-about ok in hand-crafted xml, but in machine generated xml, they are absolutely pointless, and usually hint that the coder doesn't know what they're doing.<p>For example, the author thinks that the content inside the CDATA is escaped, but in fact, it isn't necessarily - eg in this case they're including chunks of html which may contain <i>more</i> CDATA sections, and of course they don't nest (you need to terminate and restart the CDATA section). I've also seen examples where the enclosing encoding and the encoding of the CDATA section were incompatible.<p>The worst thing is specs with CDATA sections in examples. Junior devs bend over backwards to use things like xsl's disable-output-escaping to get a character-for-character match in test results, and then wonder why their code breaks in production.
To be fair to the Posterous Team they are doing a good job of fixing the bugs in the export as they are reported to them.<p>Hopefully they will get all of them fixed before the final close down.<p>If you want an easy way to get your Posterous Export file cleaned up and into a more Valid XML file then feel free to use the Import from Posterous option over at WordPress.com - <a href="http://en.support.wordpress.com/import/import-from-posterous/" rel="nofollow">http://en.support.wordpress.com/import/import-from-posterous...</a><p>We've spent some time on writing code which cleans up the XML file so that it can be imported into WordPress successfully.<p>You can then export a clean WXR file and import elsewhere much easier - <a href="http://en.support.wordpress.com/export/" rel="nofollow">http://en.support.wordpress.com/export/</a>
I don't see any problem with this XML that can't be easily overcome.<p>The comment about GMT-offsetting the date is particularly pithy, Assuming the blog in question isn't about ephemerides. By and large, blog posts have dates. If you desperately need an hour-offset from GMT, one might suggest this is your edge-case because, by and large, it doesn't matter.<p>Count me among those who would argue that the omission of a schema is a blessing.<p>I've wasted whole f*cking days of my life wrangling with so-called "non-amateur" XML. Invariably this was over-bloated XML with schemas that did nothing to help the discoverability and the processing of the data. Plain and simple, XML is over-spec'd and many data publishers, aided by their inflexible toolsets, pushed their XML beyond reason.<p>Be careful what you wish for.<p>I would take this XML, map-it, iterate it, done! End of story. I don't think there's much to complain about here.
He has a few valid complaints (by a few I mean one), but this is really not that bad compared to a lot of the XML floating around. No reason to be <i>shocked</i><p><i>"There are no namespace declarations. No self-respecting XML parser will have anything to do with this XML data."</i><p>I don't get this comment. I have never seen an XML parser that would refuse to parse XML without a namespace..<p>Am i missing something? Or is that just mindless hyperbole?
There's nothing wrong with invalid XML - why is everyone complaining? C compilers should similarly take a stab in the dark about what the programmer meant if they encounter invalid syntax as well. And those linking errors always annoy me - it should just pick the closest matching symbol if the specified one can't be found.
"There are no namespace declarations. No self-respecting XML parser will have anything to do with this XML data."<p>I would argue that any self respecting xml parser should parser it just find and shouldn't demand the namespaces to be defined at all.<p>"...invented their own ad hoc analogous namespace prefix, rather than understanding the role of elements in XML"<p>I don't think you understand the base concept of XML much. It is meant to be a generic container to hold whatever you want. XML in and of itself doesn't enforce node naming. Sure if you are talking about the official spec it does, but people pretty much globally use whatever node names they want. Don't have a cow.<p>"I haven’t been able to determine the intended encoding of the files"<p>Well maybe you should look into a parser that just parses as is without attempting to use some specific encoding.<p>Check out XML::Bare on cpan for perl. It will parse pretty much anything you throw at it, in any encoding. It leaves it up to you, the user, to decide what to do with the data after parsing.
So it seems that we prefer XML which is easy to read. I have seen those files way often. Like: <xml><item><key>1</key><value>Something</value></item><item....></xml><p>Then you have to combine what ever keys and values are in item tags. I found out these to be very annoying files to handle. Especially when key is X3 and value is 83d, you have to look for every combination from some kind of mapping, because non of those tells you absolutely nothing directly. At least its easy to create files that full fill the schema, because the complexity is pushed out of XML level. Often these files are created by "upgrading" CSV to XML. Let's just call column key # and then put what ever is in that column to the value tag. Yes attributes could be used, but often aren't.<p>Then you have to know that if key X contains value Y then you also need to look for key Z and hopefully it does contain value N or what ever.
I'd just like to take a moment to mention Nested Comments.<p>Oh if I had a £1 for every time I'd had to sift through lines and lines of code, because I can't just comment an element. I just can't comprehend why they'd need to reserve -- inside a comment.
It's ironic how many problems (large and irritating enough to justify blog posts or public spates) could have been avoided if someone bothered to test beforehand.<p>If someone did a trial export he would immediately see the missing dates.
Yes, it's crap, but it would take a few minutes to clean this up with a couple of sed scripts to turn ns:tag into ns_tag or something to make it parseable.<p>Or you could prepend some fake namespace declarations.