Great approach, but some work could have been saved (and robustness added) using the W3C’s HTML-XML-utils. For example, there’s hxselect, which filters HTML/XML against a CSS selector, and hxpipe, which breaks XML input into a more grep/awk-friendly format. I’ve used these tools myself on multiple occasions, they’ve saved me a huge amount of time.