A brief XML rant

129 pointsby alsothingsabout 12 years ago

17 comments

masklinnabout 12 years ago

> do not use template languages to generate XML.Small correction: do not use text template languages (Jinja, moustache, erb — which seems to be the one used here considering `%= display_date %>`, raw PHP, smarty, freemarker, what have you) to generate XML. There are templating languages whose primary use case is to generate markup (including XML)[0] and (unless they're broken to uselessness) they should guarantee the output is valid XML.> Schema-design-wise, the content:encoded and excerpt:encoded element names are deeply suspect, as if someone looked at RSS 2.0, squinted, shrugged, and invented their own ad hoc analogous namespace prefix, rather than understanding the role of elements in XML.They seem to be using Wordpress's WXR import/export format, hence the wp-namespaced elements. The "content" and "excerpt" namespace garbage comes straight from there according to <a href="http://ipggi.wordpress.com/2011/03/16/the-wordpress-extended-rss-wxr-exportimport-xml-document-format-decoded-and-explained/" rel="nofollow">http://ipggi.wordpress.com/2011/03/16/the-wordpress-extended...</a>> <content:encoded> Is the replacement for the restrictive Rss <description> element. Enclosed within a character data enclosure is the complete WordPress formatted blog post, HTML tags and all.> <excerpt:encoded> This is an unknown elementThis is a summary or description of the post often used by RSS/Atom feeds..Considering the cottage industry of wordpress interaction, it was probably a good move to shoot for interop (should allow posterous exports to be directly imported into wordpress?). Not sure they succeeded though.[0] genshi for instance <a href="http://genshi.edgewall.org/" rel="nofollow">http://genshi.edgewall.org/</a>

评论 #5318068 未加载

评论 #5318318 未加载

评论 #5318038 未加载

评论 #5336482 未加载

bazzarghabout 12 years ago

One thing that bugs me about this is the use of CDATA. CDATA sections are just-about ok in hand-crafted xml, but in machine generated xml, they are absolutely pointless, and usually hint that the coder doesn't know what they're doing.For example, the author thinks that the content inside the CDATA is escaped, but in fact, it isn't necessarily - eg in this case they're including chunks of html which may contain more CDATA sections, and of course they don't nest (you need to terminate and restart the CDATA section). I've also seen examples where the enclosing encoding and the encoding of the CDATA section were incompatible.The worst thing is specs with CDATA sections in examples. Junior devs bend over backwards to use things like xsl's disable-output-escaping to get a character-for-character match in test results, and then wonder why their code breaks in production.

评论 #5318431 未加载

westiabout 12 years ago

To be fair to the Posterous Team they are doing a good job of fixing the bugs in the export as they are reported to them.Hopefully they will get all of them fixed before the final close down.If you want an easy way to get your Posterous Export file cleaned up and into a more Valid XML file then feel free to use the Import from Posterous option over at WordPress.com - <a href="http://en.support.wordpress.com/import/import-from-posterous/" rel="nofollow">http://en.support.wordpress.com/import/import-from-posterous...</a>We've spent some time on writing code which cleans up the XML file so that it can be imported into WordPress successfully.You can then export a clean WXR file and import elsewhere much easier - <a href="http://en.support.wordpress.com/export/" rel="nofollow">http://en.support.wordpress.com/export/</a>

stblackabout 12 years ago

I don't see any problem with this XML that can't be easily overcome.The comment about GMT-offsetting the date is particularly pithy, Assuming the blog in question isn't about ephemerides. By and large, blog posts have dates. If you desperately need an hour-offset from GMT, one might suggest this is your edge-case because, by and large, it doesn't matter.Count me among those who would argue that the omission of a schema is a blessing.I've wasted whole f*cking days of my life wrangling with so-called "non-amateur" XML. Invariably this was over-bloated XML with schemas that did nothing to help the discoverability and the processing of the data. Plain and simple, XML is over-spec'd and many data publishers, aided by their inflexible toolsets, pushed their XML beyond reason.Be careful what you wish for.I would take this XML, map-it, iterate it, done! End of story. I don't think there's much to complain about here.

评论 #5318322 未加载

评论 #5318582 未加载

gizzlonabout 12 years ago

He has a few valid complaints (by a few I mean one), but this is really not that bad compared to a lot of the XML floating around. No reason to be shocked"There are no namespace declarations. No self-respecting XML parser will have anything to do with this XML data."I don't get this comment. I have never seen an XML parser that would refuse to parse XML without a namespace..Am i missing something? Or is that just mindless hyperbole?

评论 #5318088 未加载

评论 #5318258 未加载

评论 #5319864 未加载

fpgeekabout 12 years ago

> Get off my lawn, you kids.Isn't that what they were doing?

评论 #5318373 未加载

peterkellyabout 12 years ago

There's nothing wrong with invalid XML - why is everyone complaining? C compilers should similarly take a stab in the dark about what the programmer meant if they encounter invalid syntax as well. And those linking errors always annoy me - it should just pick the closest matching symbol if the specified one can't be found.

paulnechiforabout 12 years ago

This isn't really criticism of XML, though. You can do a good job of screwing up in any language or format.

评论 #5317998 未加载

nanoscopicabout 12 years ago

"There are no namespace declarations. No self-respecting XML parser will have anything to do with this XML data."I would argue that any self respecting xml parser should parser it just find and shouldn't demand the namespaces to be defined at all."...invented their own ad hoc analogous namespace prefix, rather than understanding the role of elements in XML"I don't think you understand the base concept of XML much. It is meant to be a generic container to hold whatever you want. XML in and of itself doesn't enforce node naming. Sure if you are talking about the official spec it does, but people pretty much globally use whatever node names they want. Don't have a cow."I haven’t been able to determine the intended encoding of the files"Well maybe you should look into a parser that just parses as is without attempting to use some specific encoding.Check out XML::Bare on cpan for perl. It will parse pretty much anything you throw at it, in any encoding. It leaves it up to you, the user, to decide what to do with the data after parsing.

评论 #5318239 未加载

Sami_Lehtinenabout 12 years ago

So it seems that we prefer XML which is easy to read. I have seen those files way often. Like: <xml><item><key>1</key><value>Something</value></item><item....></xml>Then you have to combine what ever keys and values are in item tags. I found out these to be very annoying files to handle. Especially when key is X3 and value is 83d, you have to look for every combination from some kind of mapping, because non of those tells you absolutely nothing directly. At least its easy to create files that full fill the schema, because the complexity is pushed out of XML level. Often these files are created by "upgrading" CSV to XML. Let's just call column key # and then put what ever is in that column to the value tag. Yes attributes could be used, but often aren't.Then you have to know that if key X contains value Y then you also need to look for key Z and hopefully it does contain value N or what ever.

TheAnimusabout 12 years ago

I'd just like to take a moment to mention Nested Comments.Oh if I had a £1 for every time I'd had to sift through lines and lines of code, because I can't just comment an element. I just can't comprehend why they'd need to reserve -- inside a comment.

评论 #5318153 未加载

nigglerabout 12 years ago

It's ironic how many problems (large and irritating enough to justify blog posts or public spates) could have been avoided if someone bothered to test beforehand.If someone did a trial export he would immediately see the missing dates.

kaoDabout 12 years ago

Who uses XML in 2013 anyways?

评论 #5320030 未加载

评论 #5318862 未加载

评论 #5320128 未加载

评论 #5320170 未加载

daGrevisabout 12 years ago

Probably they used regexes to parse it. :)

评论 #5318752 未加载

icedchaiabout 12 years ago

Yes, it's crap, but it would take a few minutes to clean this up with a couple of sed scripts to turn ns:tag into ns_tag or something to make it parseable.Or you could prepend some fake namespace declarations.

LoneWolfabout 12 years ago

Am I the only one bothered by the extremely oversized xml snippets? Or is it just me?Chrome 25.0.1364.97 m

评论 #5321937 未加载

tlarkworthyabout 12 years ago

It would take 0.5 days work to get that into any format you desire so I don't think it fails its purpose.

评论 #5318148 未加载

评论 #5318560 未加载